All posts
ai2025-12-06Β·5 min read

Gradient Descent

Walking downhill to find the best weights. The optimization algorithm at the heart of deep learning.

1. Intuition

Gradient descent is the optimization algorithm that uses gradients (from backprop) to iteratively update weights in the direction that reduces the loss.

Real-Life Analogy πŸ”οΈ β€” Blindfolded on a Mountain

You are blindfolded on a hilly mountain and need to reach the lowest valley:

  • You can't see the whole landscape
  • But you can feel the slope under your feet
  • So you take a small step in the downhill direction
  • Repeat until you can't go lower
Loss (height)
  β”‚     You start here 🧍
  β”‚         ↓
  β”‚        🧍 step
  β”‚       ↙
  β”‚      🧍 step
  β”‚     ↙
  β”‚    🧍  ← getting closer
  β”‚     β†˜
  β”‚      🧍 ← minimum! (best weights)
  └──────────────── Weights (position)

2. The Core Update Rule

W←Wβˆ’Ξ±β‹…βˆ‚Lβˆ‚W\mathbf{W} \leftarrow \mathbf{W} - \alpha \cdot \frac{\partial L}{\partial \mathbf{W}}

b←bβˆ’Ξ±β‹…βˆ‚Lβˆ‚b\mathbf{b} \leftarrow \mathbf{b} - \alpha \cdot \frac{\partial L}{\partial \mathbf{b}}

Where:

  • Ξ±\alpha β€” learning rate (step size)
  • βˆ‚Lβˆ‚W\dfrac{\partial L}{\partial \mathbf{W}} β€” gradient (computed by backprop)

Why Subtract?

The gradient points uphill (direction of increasing loss). To go downhill, we go in the opposite direction:

gradient>0β€…β€ŠβŸΉβ€…β€ŠlossΒ increasesΒ withΒ Wβ€…β€ŠβŸΉβ€…β€ŠdecreaseΒ W\text{gradient} > 0 \implies \text{loss increases with } W \implies \text{decrease } W gradient<0β€…β€ŠβŸΉβ€…β€ŠlossΒ decreasesΒ withΒ Wβ€…β€ŠβŸΉβ€…β€ŠincreaseΒ W\text{gradient} < 0 \implies \text{loss decreases with } W \implies \text{increase } W


3. Worked Example

Continuing from backpropagation:

Given: w=3w = 3, b=1b = 1, βˆ‚Lβˆ‚w=8\dfrac{\partial L}{\partial w} = 8, βˆ‚Lβˆ‚b=4\dfrac{\partial L}{\partial b} = 4, Ξ±=0.1\alpha = 0.1

Update weights:

wnew=3βˆ’(0.1)(8)=3βˆ’0.8=2.2w_{new} = 3 - (0.1)(8) = 3 - 0.8 = 2.2

bnew=1βˆ’(0.1)(4)=1βˆ’0.4=0.6b_{new} = 1 - (0.1)(4) = 1 - 0.4 = 0.6

Forward pass with new weights:

z=(2.2)(2)+0.6=5.0z = (2.2)(2) + 0.6 = 5.0

a=ReLU(5.0)=5.0a = \text{ReLU}(5.0) = 5.0

L=(5.0βˆ’5)2=0πŸŽ‰Β Perfect!L = (5.0 - 5)^2 = \mathbf{0} \quad \text{πŸŽ‰ Perfect!}

Loss dropped from 4 β†’ 0 in one step!


4. The Learning Rate Ξ±\alpha

The learning rate controls the size of each step.

Learning RateEffectOutcome
Too large (Ξ±=10\alpha = 10)Overshoots minimumDiverges πŸ’₯
Too small (α=0.000001\alpha = 0.000001)Tiny stepsTakes forever 🐒
Just right (Ξ±=0.001\alpha = 0.001)Steady progressConverges βœ…

Effect on loss curve:

Too Large:          Too Small:          Just Right:
Loss                Loss                Loss
β”‚  /\               β”‚                   β”‚β•²
β”‚ /  \___           β”‚β•²β•²β•²β•²β•²β•²β•²β•²β•²         β”‚ β•²___
β”‚/               β†’  β”‚        β•²β•²β•²β•²   β†’  β”‚
└────────           └────────           └────────
  Diverges!           Slow!               Converges βœ…

Common values: α∈{0.1,0.01,0.001,0.0001}\alpha \in \{0.1, 0.01, 0.001, 0.0001\}


5. Three Flavors of Gradient Descent

5.1 Batch Gradient Descent

Use all nn training samples to compute one gradient update:

W←Wβˆ’Ξ±β‹…1nβˆ‘i=1nβˆ‡WLi\mathbf{W} \leftarrow \mathbf{W} - \alpha \cdot \frac{1}{n}\sum_{i=1}^{n}\nabla_{\mathbf{W}} L_i

βœ… Pros❌ Cons
Stable, accurate gradientExtremely slow for large datasets
Guaranteed convergenceMay not fit in memory

5.2 Stochastic Gradient Descent (SGD)

Use one random sample per update:

W←Wβˆ’Ξ±β‹…βˆ‡WLi\mathbf{W} \leftarrow \mathbf{W} - \alpha \cdot \nabla_{\mathbf{W}} L_i

βœ… Pros❌ Cons
Very fast updatesNoisy, jumpy path
Can escape local minimaMay not converge

5.3 Mini-Batch Gradient Descent ⭐

Use a small batch (B=32B = 32 to 256256) per update:

W←Wβˆ’Ξ±β‹…1Bβˆ‘i∈batchβˆ‡WLi\mathbf{W} \leftarrow \mathbf{W} - \alpha \cdot \frac{1}{B}\sum_{i \in \text{batch}}\nabla_{\mathbf{W}} L_i

βœ… Pros❌ Cons
Balance of speed and stabilityBatch size is a hyperparameter
GPU-friendlySlightly noisy
Used in all modern AI

6. Convergence Comparison

Rendering diagram...

7. Advanced Optimizers

7.1 Momentum

Adds a "velocity" term β€” accumulates gradient history to build up speed:

v←β1v+(1βˆ’Ξ²1)βˆ‡WL\mathbf{v} \leftarrow \beta_1 \mathbf{v} + (1 - \beta_1)\nabla_\mathbf{W} L

W←Wβˆ’Ξ±v\mathbf{W} \leftarrow \mathbf{W} - \alpha \mathbf{v}

Where Ξ²1β‰ˆ0.9\beta_1 \approx 0.9 is the momentum coefficient.

7.2 Adam (Adaptive Moment Estimation) ⭐

Combines momentum + adaptive learning rates:

First moment (mean of gradients):

mt=Ξ²1mtβˆ’1+(1βˆ’Ξ²1)gt\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1)\mathbf{g}_t

Second moment (mean of squared gradients):

vt=Ξ²2vtβˆ’1+(1βˆ’Ξ²2)gt2\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2)\mathbf{g}_t^2

Bias correction:

m^t=mt1βˆ’Ξ²1t,v^t=vt1βˆ’Ξ²2t\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1 - \beta_1^t}, \qquad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_2^t}

Final update:

W←Wβˆ’Ξ±v^t+Ο΅m^t\mathbf{W} \leftarrow \mathbf{W} - \frac{\alpha}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} \hat{\mathbf{m}}_t

Default hyperparameters: Ξ²1=0.9\beta_1 = 0.9, Ξ²2=0.999\beta_2 = 0.999, Ο΅=10βˆ’8\epsilon = 10^{-8}

Optimizer Comparison

OptimizerKey IdeaWhen to Use
SGDRaw gradientSimple problems
MomentumAccumulated velocityNoisy gradients
AdaGradAdaptive per-parameter LRSparse features
RMSPropAdaptive + exponential decayRNNs
Adam ⭐Momentum + AdaptiveAlmost everything

8. Full Training Loop

Rendering diagram...

9. Local Minima & Saddle Points

Loss landscape challenges:

Local minimum:         Saddle point:         Global minimum:
    β•²  /β•²              β•²       /                β•²__/
     β•²/  β•²              β•²_____/
     ↑                     ↑                      ↑
 Gets stuck!          Gradient β‰ˆ 0             We want this!

Solutions:

  • Momentum helps escape local minima
  • Large batches explore landscape better
  • Adam's adaptive rates help on plateaus

10. Quick Reference

W←Wβˆ’Ξ±β‹…βˆ‡WL\boxed{\mathbf{W} \leftarrow \mathbf{W} - \alpha \cdot \nabla_{\mathbf{W}} L}

Adam:Β W←Wβˆ’Ξ±v^+Ο΅m^\boxed{\text{Adam: } \mathbf{W} \leftarrow \mathbf{W} - \frac{\alpha}{\sqrt{\hat{v}} + \epsilon}\hat{m}}

Filed underai

Related posts