1. The Overfitting Problem
Overfitting occurs when a model learns to memorize training data instead of learning general patterns β like a student who memorizes past exam answers word-for-word but fails new questions.
Real-Life Analogy π
Student Type Strategy Result Underfitting "I'll guess 42 for everything" Fails everything Just Right "I understand the core concepts" Passes new tests β
Overfitting "I memorized every past paper" Fails new questions
Diagnosing Overfitting
Rendering diagram...
When validation loss starts rising while training loss keeps falling β overfitting!
2. The Bias-Variance Tradeoff
ExpectedΒ Error = Bias 2 β underfitting + Variance β overfitting + Ο 2 β irreducibleΒ noise \text{Expected Error} = \underbrace{\text{Bias}^2}_{\text{underfitting}} + \underbrace{\text{Variance}}_{\text{overfitting}} + \underbrace{\sigma^2}_{\text{irreducible noise}} ExpectedΒ Error = underfitting Bias 2 β β + overfitting Variance β β + irreducibleΒ noise Ο 2 β β
Bias Variance Model too simple High Low Model too complex Low High Goal Low Low
Regularization reduces variance (overfitting) at the cost of a slight increase in bias.
3. L2 Regularization (Weight Decay)
Idea: Penalize large weights β force the model to stay simple.
Modified Loss
L t o t a l = L o r i g i n a l + Ξ» β i w i 2 L_{total} = L_{original} + \lambda \sum_{i} w_i^2 L t o t a l β = L or i g ina l β + Ξ» β i β w i 2 β
Where Ξ» \lambda Ξ» is the regularization strength hyperparameter.
Effect on Gradient Update
β L t o t a l β w = β L o r i g i n a l β w + 2 Ξ» w \frac{\partial L_{total}}{\partial w} = \frac{\partial L_{original}}{\partial w} + 2\lambda w β w β L t o t a l β β = β w β L or i g ina l β β + 2 Ξ» w
w β w β Ξ± ( β L β w + 2 Ξ» w ) = w ( 1 β 2 Ξ± Ξ» ) β decay β Ξ± β L β w w \leftarrow w - \alpha\left(\frac{\partial L}{\partial w} + 2\lambda w\right) = w\underbrace{(1 - 2\alpha\lambda)}_{\text{decay}} - \alpha\frac{\partial L}{\partial w} w β w β Ξ± ( β w β L β + 2 Ξ» w ) = w decay ( 1 β 2 Ξ± Ξ» ) β β β Ξ± β w β L β
Every step, the weight decays slightly toward zero (hence "weight decay").
Numeric Example
With Ξ» = 0.01 \lambda = 0.01 Ξ» = 0.01 , Ξ± = 0.1 \alpha = 0.1 Ξ± = 0.1 , w = 5.0 w = 5.0 w = 5.0 , β L β w = 0.3 \dfrac{\partial L}{\partial w} = 0.3 β w β L β = 0.3 :
w n e w = 5.0 β 0.1 ( 0.3 + 2 Γ 0.01 Γ 5.0 ) = 5.0 β 0.04 = 4.96 w_{new} = 5.0 - 0.1(0.3 + 2 \times 0.01 \times 5.0) = 5.0 - 0.04 = 4.96 w n e w β = 5.0 β 0.1 ( 0.3 + 2 Γ 0.01 Γ 5.0 ) = 5.0 β 0.04 = 4.96
Effect on Weights
Without L2 With L2 w = [ 0.001 , 892 , β 743 ] \mathbf{w} = [0.001, 892, -743] w = [ 0.001 , 892 , β 743 ] w = [ 0.3 , 0.8 , β 0.6 ] \mathbf{w} = [0.3, 0.8, -0.6] w = [ 0.3 , 0.8 , β 0.6 ] Overconfident, memorizing β Balanced, generalizing β
4. L1 Regularization (Lasso)
L t o t a l = L o r i g i n a l + Ξ» β i β£ w i β£ L_{total} = L_{original} + \lambda \sum_{i} |w_i| L t o t a l β = L or i g ina l β + Ξ» β i β β£ w i β β£
Gradient
β L t o t a l β w = β L β w + Ξ» β
sign ( w ) \frac{\partial L_{total}}{\partial w} = \frac{\partial L}{\partial w} + \lambda \cdot \text{sign}(w) β w β L t o t a l β β = β w β L β + Ξ» β
sign ( w )
L1 vs L2 Comparison
Property L1 L2 Penalty Ξ» β₯ w β₯ \lambda\|w\| Ξ» β₯ w β₯ Ξ» w 2 \lambda w^2 Ξ» w 2 Effect Drives weights to exactly 0 Shrinks all weights Result Sparse weights Dense small weights Use when Feature selection needed General regularization Analogy Muting some speakers Turning all speakers down
L1 produces sparsity:
L2:Β [ 0.3 , β
β 0.8 , β
β β 0.6 , β
β 0.4 , β
β 0.2 ] \text{L2: } [0.3, \; 0.8, \; -0.6, \; 0.4, \; 0.2] L2:Β [ 0.3 , 0.8 , β 0.6 , 0.4 , 0.2 ]
L1:Β [ 0.0 , β
β 0.9 , β
β 0.0 , β
β 0.0 , β
β β 0.7 ] β sparse! \text{L1: } [0.0, \; 0.9, \; 0.0, \; 0.0, \; -0.7] \quad \leftarrow \text{sparse!} L1:Β [ 0.0 , 0.9 , 0.0 , 0.0 , β 0.7 ] β sparse!
5. Dropout
Idea: During training, randomly zero out neurons with probability p p p .
How It Works
a ~ i = { 0 withΒ probabilityΒ p a i 1 β p withΒ probabilityΒ 1 β p \tilde{a}_i = \begin{cases} 0 & \text{with probability } p \\ \dfrac{a_i}{1-p} & \text{with probability } 1-p \end{cases} a ~ i β = β© β¨ β§ β 0 1 β p a i β β β withΒ probabilityΒ p withΒ probabilityΒ 1 β p β
The 1 1 β p \frac{1}{1-p} 1 β p 1 β scaling ensures the expected value of each activation is unchanged:
E [ a ~ i ] = ( 1 β p ) β
a i 1 β p + p β
0 = a i β
β β
\mathbb{E}[\tilde{a}_i] = (1-p) \cdot \frac{a_i}{1-p} + p \cdot 0 = a_i \; β
E [ a ~ i β ] = ( 1 β p ) β
1 β p a i β β + p β
0 = a i β β
Basketball Team Analogy π
A coach randomly sits out 5 of 10 players each drill. Every player learns all positions. In the real game, the full team is far more robust.
Without dropout β neurons become co-dependent, overfit together.
With dropout β each neuron must learn independently β robust features .
Dropout Rates
Layer Typical p p p Reason Input layer 0.1 β 0.2 0.1 - 0.2 0.1 β 0.2 Don't lose too much input info Hidden layers 0.3 β 0.5 0.3 - 0.5 0.3 β 0.5 Sweet spot Output layer 0.0 0.0 0.0 Never drop the final output!
Train vs Inference
Mode Dropout Scaling Training Active (random zeroing) 1 1 β p \frac{1}{1-p} 1 β p 1 β appliedInference Off (all neurons active) No scaling needed
6. Batch Normalization
Problem: As training progresses, distributions shift β later layers must constantly readjust.
LayerΒ 1Β outputs β EpochΒ 1:Β ΞΌ β 0 , β
β Ο β 1 β LayerΒ 1Β outputs β EpochΒ 100:Β ΞΌ β 50 , β
β Ο β 30 \underbrace{\text{Layer 1 outputs}}_{\text{Epoch 1: } \mu \approx 0,\; \sigma \approx 1} \neq \underbrace{\text{Layer 1 outputs}}_{\text{Epoch 100: } \mu \approx 50,\; \sigma \approx 30} EpochΒ 1:Β ΞΌ β 0 , Ο β 1 LayerΒ 1Β outputs β β ξ = EpochΒ 100:Β ΞΌ β 50 , Ο β 30 LayerΒ 1Β outputs β β
This internal covariate shift slows training.
The Fix
Normalize each mini-batch:
x ^ i = x i β ΞΌ B Ο B 2 + Ο΅ \hat{x}_i = \frac{x_i - \mu_\mathcal{B}}{\sqrt{\sigma_\mathcal{B}^2 + \epsilon}} x ^ i β = Ο B 2 β + Ο΅ β x i β β ΞΌ B β β
Then scale and shift with learned parameters:
y i = Ξ³ x ^ i + Ξ² y_i = \gamma \hat{x}_i + \beta y i β = Ξ³ x ^ i β + Ξ²
Where Ξ³ \gamma Ξ³ and Ξ² \beta Ξ² are learned β allowing the network to undo normalization if needed.
Numeric Example
Batch: [ 10 , 20 , 30 , 40 , 50 ] [10, 20, 30, 40, 50] [ 10 , 20 , 30 , 40 , 50 ]
ΞΌ = 30 , Ο = 14.1 \mu = 30, \quad \sigma = 14.1 ΞΌ = 30 , Ο = 14.1
x ^ = [ β 1.41 , β 0.71 , 0 , 0.71 , 1.41 ] \hat{x} = \left[-1.41, -0.71, 0, 0.71, 1.41\right] x ^ = [ β 1.41 , β 0.71 , 0 , 0.71 , 1.41 ]
With Ξ³ = 2 , Ξ² = 1 \gamma = 2, \beta = 1 Ξ³ = 2 , Ξ² = 1 :
y = [ β 1.82 , β 0.42 , 1.0 , 2.42 , 3.82 ] y = [-1.82, -0.42, 1.0, 2.42, 3.82] y = [ β 1.82 , β 0.42 , 1.0 , 2.42 , 3.82 ]
Benefits
Benefit Explanation Faster training Allows higher learning rates Less sensitive to init More robust weight initialization Regularization effect Adds noise via batch statistics Gradient flow Prevents vanishing/exploding
7. Early Stopping
Monitor validation loss during training. Stop when it stops improving:
Rendering diagram...
Typical patience: 5β20 epochs
8. Data Augmentation
Artificially expand training data by creating variations:
Data Type Augmentation Techniques Images Flip, rotate, crop, brightness, noise Text Synonym substitution, back-translation Audio Pitch shift, time stretch, noise Tabular SMOTE (synthetic minority oversampling)
The model sees the same underlying pattern in many contexts β learns the concept, not the specific instance.
9. Regularization Techniques Summary
Method Formula Best For L2 + Ξ» β w 2 +\lambda\sum w^2 + Ξ» β w 2 General, most used L1 + Ξ» β β₯ w β₯ +\lambda\sum\|w\| + Ξ» β β₯ w β₯ Feature selection Dropout Random zeroing Deep networks β Batch Norm Normalize activations Deep networks β Early Stopping Monitor val loss Always β Data Augment More training variety Vision, NLP β
10. Real Example β Fraud Detection
Dataset: 10,000 transactions (9,500 legit, 500 fraud)
Configuration Train Accuracy Test Accuracy No regularization 99.9% 71.0% β L2 only 96.2% 84.3% Dropout only 95.8% 87.1% L2 + Dropout + Early Stop 94.0% 91.5% β
Regularized model sacrificed training accuracy for massive real-world gains .
11. Quick Reference
L L 2 = L + Ξ» β w 2 , L L 1 = L + Ξ» β β£ w β£ \boxed{L_{L2} = L + \lambda\sum w^2, \quad L_{L1} = L + \lambda\sum|w|} L L 2 β = L + Ξ» β w 2 , L L 1 β = L + Ξ» β β£ w β£ β
a ~ i = { 0 p Β prob a i / ( 1 β p ) ( 1 β p ) Β prob \boxed{\tilde{a}_i = \begin{cases} 0 & p \text{ prob} \\ a_i/(1-p) & (1-p) \text{ prob} \end{cases}} a ~ i β = { 0 a i β / ( 1 β p ) β p Β prob ( 1 β p ) Β prob β β
x ^ = x β ΞΌ B Ο B 2 + Ο΅ , y = Ξ³ x ^ + Ξ² \boxed{\hat{x} = \frac{x - \mu_\mathcal{B}}{\sqrt{\sigma_\mathcal{B}^2 + \epsilon}}, \quad y = \gamma\hat{x} + \beta} x ^ = Ο B 2 β + Ο΅ β x β ΞΌ B β β , y = Ξ³ x ^ + Ξ² β