All posts
ai2026-02-21Β·6 min read

Regularization

Teaching models to generalize, not memorize. Dropout, weight decay, and the bias-variance tradeoff.

1. The Overfitting Problem

Overfitting occurs when a model learns to memorize training data instead of learning general patterns β€” like a student who memorizes past exam answers word-for-word but fails new questions.

Real-Life Analogy πŸŽ“

Student TypeStrategyResult
Underfitting"I'll guess 42 for everything"Fails everything
Just Right"I understand the core concepts"Passes new tests βœ…
Overfitting"I memorized every past paper"Fails new questions

Diagnosing Overfitting

Rendering diagram...

When validation loss starts rising while training loss keeps falling β†’ overfitting!


2. The Bias-Variance Tradeoff

ExpectedΒ Error=Bias2⏟underfitting+Variance⏟overfitting+Οƒ2⏟irreducibleΒ noise\text{Expected Error} = \underbrace{\text{Bias}^2}_{\text{underfitting}} + \underbrace{\text{Variance}}_{\text{overfitting}} + \underbrace{\sigma^2}_{\text{irreducible noise}}

BiasVariance
Model too simpleHighLow
Model too complexLowHigh
GoalLowLow

Regularization reduces variance (overfitting) at the cost of a slight increase in bias.


3. L2 Regularization (Weight Decay)

Idea: Penalize large weights β€” force the model to stay simple.

Modified Loss

Ltotal=Loriginal+Ξ»βˆ‘iwi2L_{total} = L_{original} + \lambda \sum_{i} w_i^2

Where Ξ»\lambda is the regularization strength hyperparameter.

Effect on Gradient Update

βˆ‚Ltotalβˆ‚w=βˆ‚Loriginalβˆ‚w+2Ξ»w\frac{\partial L_{total}}{\partial w} = \frac{\partial L_{original}}{\partial w} + 2\lambda w

w←wβˆ’Ξ±(βˆ‚Lβˆ‚w+2Ξ»w)=w(1βˆ’2Ξ±Ξ»)⏟decayβˆ’Ξ±βˆ‚Lβˆ‚ww \leftarrow w - \alpha\left(\frac{\partial L}{\partial w} + 2\lambda w\right) = w\underbrace{(1 - 2\alpha\lambda)}_{\text{decay}} - \alpha\frac{\partial L}{\partial w}

Every step, the weight decays slightly toward zero (hence "weight decay").

Numeric Example

With Ξ»=0.01\lambda = 0.01, Ξ±=0.1\alpha = 0.1, w=5.0w = 5.0, βˆ‚Lβˆ‚w=0.3\dfrac{\partial L}{\partial w} = 0.3:

wnew=5.0βˆ’0.1(0.3+2Γ—0.01Γ—5.0)=5.0βˆ’0.04=4.96w_{new} = 5.0 - 0.1(0.3 + 2 \times 0.01 \times 5.0) = 5.0 - 0.04 = 4.96

Effect on Weights

Without L2With L2
w=[0.001,892,βˆ’743]\mathbf{w} = [0.001, 892, -743]w=[0.3,0.8,βˆ’0.6]\mathbf{w} = [0.3, 0.8, -0.6]
Overconfident, memorizing ❌Balanced, generalizing βœ…

4. L1 Regularization (Lasso)

Ltotal=Loriginal+Ξ»βˆ‘i∣wi∣L_{total} = L_{original} + \lambda \sum_{i} |w_i|

Gradient

βˆ‚Ltotalβˆ‚w=βˆ‚Lβˆ‚w+Ξ»β‹…sign(w)\frac{\partial L_{total}}{\partial w} = \frac{\partial L}{\partial w} + \lambda \cdot \text{sign}(w)

L1 vs L2 Comparison

PropertyL1L2
PenaltyΞ»βˆ₯wβˆ₯\lambda\|w\|Ξ»w2\lambda w^2
EffectDrives weights to exactly 0Shrinks all weights
ResultSparse weightsDense small weights
Use whenFeature selection neededGeneral regularization
AnalogyMuting some speakersTurning all speakers down

L1 produces sparsity:

L2:Β [0.3,β€…β€Š0.8,β€…β€Šβˆ’0.6,β€…β€Š0.4,β€…β€Š0.2]\text{L2: } [0.3, \; 0.8, \; -0.6, \; 0.4, \; 0.2] L1:Β [0.0,β€…β€Š0.9,β€…β€Š0.0,β€…β€Š0.0,β€…β€Šβˆ’0.7]←sparse!\text{L1: } [0.0, \; 0.9, \; 0.0, \; 0.0, \; -0.7] \quad \leftarrow \text{sparse!}


5. Dropout

Idea: During training, randomly zero out neurons with probability pp.

How It Works

a~i={0withΒ probabilityΒ pai1βˆ’pwithΒ probabilityΒ 1βˆ’p\tilde{a}_i = \begin{cases} 0 & \text{with probability } p \\ \dfrac{a_i}{1-p} & \text{with probability } 1-p \end{cases}

The 11βˆ’p\frac{1}{1-p} scaling ensures the expected value of each activation is unchanged:

E[a~i]=(1βˆ’p)β‹…ai1βˆ’p+pβ‹…0=aiβ€…β€Šβœ…\mathbb{E}[\tilde{a}_i] = (1-p) \cdot \frac{a_i}{1-p} + p \cdot 0 = a_i \; βœ…

Basketball Team Analogy πŸ€

A coach randomly sits out 5 of 10 players each drill. Every player learns all positions. In the real game, the full team is far more robust.

Without dropout β†’ neurons become co-dependent, overfit together. With dropout β†’ each neuron must learn independently β†’ robust features.

Dropout Rates

LayerTypical ppReason
Input layer0.1βˆ’0.20.1 - 0.2Don't lose too much input info
Hidden layers0.3βˆ’0.50.3 - 0.5Sweet spot
Output layer0.00.0Never drop the final output!

Train vs Inference

ModeDropoutScaling
TrainingActive (random zeroing)11βˆ’p\frac{1}{1-p} applied
InferenceOff (all neurons active)No scaling needed

6. Batch Normalization

Problem: As training progresses, distributions shift β€” later layers must constantly readjust.

LayerΒ 1Β outputs⏟EpochΒ 1:Β ΞΌβ‰ˆ0,β€…β€ŠΟƒβ‰ˆ1β‰ LayerΒ 1Β outputs⏟EpochΒ 100:Β ΞΌβ‰ˆ50,β€…β€ŠΟƒβ‰ˆ30\underbrace{\text{Layer 1 outputs}}_{\text{Epoch 1: } \mu \approx 0,\; \sigma \approx 1} \neq \underbrace{\text{Layer 1 outputs}}_{\text{Epoch 100: } \mu \approx 50,\; \sigma \approx 30}

This internal covariate shift slows training.

The Fix

Normalize each mini-batch:

x^i=xiβˆ’ΞΌBΟƒB2+Ο΅\hat{x}_i = \frac{x_i - \mu_\mathcal{B}}{\sqrt{\sigma_\mathcal{B}^2 + \epsilon}}

Then scale and shift with learned parameters:

yi=Ξ³x^i+Ξ²y_i = \gamma \hat{x}_i + \beta

Where Ξ³\gamma and Ξ²\beta are learned β€” allowing the network to undo normalization if needed.

Numeric Example

Batch: [10,20,30,40,50][10, 20, 30, 40, 50]

ΞΌ=30,Οƒ=14.1\mu = 30, \quad \sigma = 14.1

x^=[βˆ’1.41,βˆ’0.71,0,0.71,1.41]\hat{x} = \left[-1.41, -0.71, 0, 0.71, 1.41\right]

With Ξ³=2,Ξ²=1\gamma = 2, \beta = 1:

y=[βˆ’1.82,βˆ’0.42,1.0,2.42,3.82]y = [-1.82, -0.42, 1.0, 2.42, 3.82]

Benefits

BenefitExplanation
Faster trainingAllows higher learning rates
Less sensitive to initMore robust weight initialization
Regularization effectAdds noise via batch statistics
Gradient flowPrevents vanishing/exploding

7. Early Stopping

Monitor validation loss during training. Stop when it stops improving:

Rendering diagram...

Typical patience: 5–20 epochs


8. Data Augmentation

Artificially expand training data by creating variations:

Data TypeAugmentation Techniques
ImagesFlip, rotate, crop, brightness, noise
TextSynonym substitution, back-translation
AudioPitch shift, time stretch, noise
TabularSMOTE (synthetic minority oversampling)

The model sees the same underlying pattern in many contexts β†’ learns the concept, not the specific instance.


9. Regularization Techniques Summary

MethodFormulaBest For
L2+Ξ»βˆ‘w2+\lambda\sum w^2General, most used
L1+Ξ»βˆ‘βˆ₯wβˆ₯+\lambda\sum\|w\|Feature selection
DropoutRandom zeroingDeep networks ⭐
Batch NormNormalize activationsDeep networks ⭐
Early StoppingMonitor val lossAlways ⭐
Data AugmentMore training varietyVision, NLP ⭐

10. Real Example β€” Fraud Detection

Dataset: 10,000 transactions (9,500 legit, 500 fraud)

ConfigurationTrain AccuracyTest Accuracy
No regularization99.9%71.0% ❌
L2 only96.2%84.3%
Dropout only95.8%87.1%
L2 + Dropout + Early Stop94.0%91.5% βœ…

Regularized model sacrificed training accuracy for massive real-world gains.


11. Quick Reference

LL2=L+Ξ»βˆ‘w2,LL1=L+Ξ»βˆ‘βˆ£w∣\boxed{L_{L2} = L + \lambda\sum w^2, \quad L_{L1} = L + \lambda\sum|w|}

a~i={0pΒ probai/(1βˆ’p)(1βˆ’p)Β prob\boxed{\tilde{a}_i = \begin{cases} 0 & p \text{ prob} \\ a_i/(1-p) & (1-p) \text{ prob} \end{cases}}

x^=xβˆ’ΞΌBΟƒB2+Ο΅,y=Ξ³x^+Ξ²\boxed{\hat{x} = \frac{x - \mu_\mathcal{B}}{\sqrt{\sigma_\mathcal{B}^2 + \epsilon}}, \quad y = \gamma\hat{x} + \beta}

Filed underai

Related posts