Regularization

1. The Overfitting Problem

Overfitting occurs when a model learns to memorize training data instead of learning general patterns — like a student who memorizes past exam answers word-for-word but fails new questions.

Real-Life Analogy 🎓

Student Type	Strategy	Result
Underfitting	"I'll guess 42 for everything"	Fails everything
Just Right	"I understand the core concepts"	Passes new tests ✅
Overfitting	"I memorized every past paper"	Fails new questions

Diagnosing Overfitting

Rendering diagram...

When validation loss starts rising while training loss keeps falling → overfitting!

2. The Bias-Variance Tradeoff

$\text{Expected Error} = \underbrace{\text{Bias}^2}_{\text{underfitting}} + \underbrace{\text{Variance}}_{\text{overfitting}} + \underbrace{\sigma^2}_{\text{irreducible noise}}$

	Bias	Variance
Model too simple	High	Low
Model too complex	Low	High
Goal	Low	Low

Regularization reduces variance (overfitting) at the cost of a slight increase in bias.

3. L2 Regularization (Weight Decay)

Idea: Penalize large weights — force the model to stay simple.

Modified Loss

$L_{total} = L_{original} + \lambda \sum_{i} w_i^2$

Where $\lambda$ is the regularization strength hyperparameter.

Effect on Gradient Update

$\frac{\partial L_{total}}{\partial w} = \frac{\partial L_{original}}{\partial w} + 2\lambda w$

$w \leftarrow w - \alpha\left(\frac{\partial L}{\partial w} + 2\lambda w\right) = w\underbrace{(1 - 2\alpha\lambda)}_{\text{decay}} - \alpha\frac{\partial L}{\partial w}$

Every step, the weight decays slightly toward zero (hence "weight decay").

Numeric Example

With $\lambda = 0.01$ , $\alpha = 0.1$ , $w = 5.0$ , $\dfrac{\partial L}{\partial w} = 0.3$ :

$w_{new} = 5.0 - 0.1(0.3 + 2 \times 0.01 \times 5.0) = 5.0 - 0.04 = 4.96$

Effect on Weights

Without L2	With L2
$\mathbf{w} = [0.001, 892, -743]$	$\mathbf{w} = [0.3, 0.8, -0.6]$
Overconfident, memorizing ❌	Balanced, generalizing ✅

4. L1 Regularization (Lasso)

$L_{total} = L_{original} + \lambda \sum_{i} |w_i|$

Gradient

$\frac{\partial L_{total}}{\partial w} = \frac{\partial L}{\partial w} + \lambda \cdot \text{sign}(w)$

L1 vs L2 Comparison

Property	L1	L2
Penalty	$\lambda\\|w\\|$	$\lambda w^2$
Effect	Drives weights to exactly 0	Shrinks all weights
Result	Sparse weights	Dense small weights
Use when	Feature selection needed	General regularization
Analogy	Muting some speakers	Turning all speakers down

L1 produces sparsity:

$\text{L2: } [0.3, \; 0.8, \; -0.6, \; 0.4, \; 0.2]$ $\text{L1: } [0.0, \; 0.9, \; 0.0, \; 0.0, \; -0.7] \quad \leftarrow \text{sparse!}$

5. Dropout

Idea: During training, randomly zero out neurons with probability $p$ .

How It Works

$\tilde{a}_i = \begin{cases} 0 & \text{with probability } p \\ \dfrac{a_i}{1-p} & \text{with probability } 1-p \end{cases}$

The $\frac{1}{1-p}$ scaling ensures the expected value of each activation is unchanged:

$\mathbb{E}[\tilde{a}_i] = (1-p) \cdot \frac{a_i}{1-p} + p \cdot 0 = a_i \; ✅$

Basketball Team Analogy 🏀

A coach randomly sits out 5 of 10 players each drill. Every player learns all positions. In the real game, the full team is far more robust.

Without dropout → neurons become co-dependent, overfit together. With dropout → each neuron must learn independently → robust features.

Dropout Rates

Layer	Typical $p$	Reason
Input layer	$0.1 - 0.2$	Don't lose too much input info
Hidden layers	$0.3 - 0.5$	Sweet spot
Output layer	$0.0$	Never drop the final output!

Train vs Inference

Mode	Dropout	Scaling
Training	Active (random zeroing)	$\frac{1}{1-p}$ applied
Inference	Off (all neurons active)	No scaling needed

6. Batch Normalization

Problem: As training progresses, distributions shift — later layers must constantly readjust.

$\underbrace{\text{Layer 1 outputs}}_{\text{Epoch 1: } \mu \approx 0,\; \sigma \approx 1} \neq \underbrace{\text{Layer 1 outputs}}_{\text{Epoch 100: } \mu \approx 50,\; \sigma \approx 30}$

This internal covariate shift slows training.

The Fix

Normalize each mini-batch:

$\hat{x}_i = \frac{x_i - \mu_\mathcal{B}}{\sqrt{\sigma_\mathcal{B}^2 + \epsilon}}$

Then scale and shift with learned parameters:

$y_i = \gamma \hat{x}_i + \beta$

Where $\gamma$ and $\beta$ are learned — allowing the network to undo normalization if needed.

Numeric Example

Batch: $[10, 20, 30, 40, 50]$

$\mu = 30, \quad \sigma = 14.1$

$\hat{x} = \left[-1.41, -0.71, 0, 0.71, 1.41\right]$

With $\gamma = 2, \beta = 1$ :

$y = [-1.82, -0.42, 1.0, 2.42, 3.82]$

Benefits

Benefit	Explanation
Faster training	Allows higher learning rates
Less sensitive to init	More robust weight initialization
Regularization effect	Adds noise via batch statistics
Gradient flow	Prevents vanishing/exploding

7. Early Stopping

Monitor validation loss during training. Stop when it stops improving:

Rendering diagram...

Typical patience: 5–20 epochs

8. Data Augmentation

Artificially expand training data by creating variations:

Data Type	Augmentation Techniques
Images	Flip, rotate, crop, brightness, noise
Text	Synonym substitution, back-translation
Audio	Pitch shift, time stretch, noise
Tabular	SMOTE (synthetic minority oversampling)

The model sees the same underlying pattern in many contexts → learns the concept, not the specific instance.

9. Regularization Techniques Summary

Method	Formula	Best For
L2	$+\lambda\sum w^2$	General, most used
L1	$+\lambda\sum\\|w\\|$	Feature selection
Dropout	Random zeroing	Deep networks ⭐
Batch Norm	Normalize activations	Deep networks ⭐
Early Stopping	Monitor val loss	Always ⭐
Data Augment	More training variety	Vision, NLP ⭐

10. Real Example — Fraud Detection

Dataset: 10,000 transactions (9,500 legit, 500 fraud)

Configuration	Train Accuracy	Test Accuracy
No regularization	99.9%	71.0% ❌
L2 only	96.2%	84.3%
Dropout only	95.8%	87.1%
L2 + Dropout + Early Stop	94.0%	91.5% ✅

Regularized model sacrificed training accuracy for massive real-world gains.

11. Quick Reference

$\boxed{L_{L2} = L + \lambda\sum w^2, \quad L_{L1} = L + \lambda\sum|w|}$

$\boxed{\tilde{a}_i = \begin{cases} 0 & p \text{ prob} \\ a_i/(1-p) & (1-p) \text{ prob} \end{cases}}$

$\boxed{\hat{x} = \frac{x - \mu_\mathcal{B}}{\sqrt{\sigma_\mathcal{B}^2 + \epsilon}}, \quad y = \gamma\hat{x} + \beta}$