Gradient Descent

1. Intuition

Gradient descent is the optimization algorithm that uses gradients (from backprop) to iteratively update weights in the direction that reduces the loss.

Real-Life Analogy 🏔️ — Blindfolded on a Mountain

You are blindfolded on a hilly mountain and need to reach the lowest valley:

You can't see the whole landscape
But you can feel the slope under your feet
So you take a small step in the downhill direction
Repeat until you can't go lower

Loss (height)
  │     You start here 🧍
  │         ↓
  │        🧍 step
  │       ↙
  │      🧍 step
  │     ↙
  │    🧍  ← getting closer
  │     ↘
  │      🧍 ← minimum! (best weights)
  └──────────────── Weights (position)

2. The Core Update Rule

$\mathbf{W} \leftarrow \mathbf{W} - \alpha \cdot \frac{\partial L}{\partial \mathbf{W}}$

$\mathbf{b} \leftarrow \mathbf{b} - \alpha \cdot \frac{\partial L}{\partial \mathbf{b}}$

Where:

$\alpha$ — learning rate (step size)
$\dfrac{\partial L}{\partial \mathbf{W}}$ — gradient (computed by backprop)

Why Subtract?

The gradient points uphill (direction of increasing loss). To go downhill, we go in the opposite direction:

$\text{gradient} > 0 \implies \text{loss increases with } W \implies \text{decrease } W$ $\text{gradient} < 0 \implies \text{loss decreases with } W \implies \text{increase } W$

3. Worked Example

Continuing from backpropagation:

Given: $w = 3$ , $b = 1$ , $\dfrac{\partial L}{\partial w} = 8$ , $\dfrac{\partial L}{\partial b} = 4$ , $\alpha = 0.1$

Update weights:

$w_{new} = 3 - (0.1)(8) = 3 - 0.8 = 2.2$

$b_{new} = 1 - (0.1)(4) = 1 - 0.4 = 0.6$

Forward pass with new weights:

$z = (2.2)(2) + 0.6 = 5.0$

$a = \text{ReLU}(5.0) = 5.0$

$L = (5.0 - 5)^2 = \mathbf{0} \quad \text{🎉 Perfect!}$

Loss dropped from 4 → 0 in one step!

4. The Learning Rate $\alpha$

The learning rate controls the size of each step.

Learning Rate	Effect	Outcome
Too large ( $\alpha = 10$ )	Overshoots minimum	Diverges 💥
Too small ( $\alpha = 0.000001$ )	Tiny steps	Takes forever 🐢
Just right ( $\alpha = 0.001$ )	Steady progress	Converges ✅

Effect on loss curve:

Too Large:          Too Small:          Just Right:
Loss                Loss                Loss
│  /\               │                   │╲
│ /  \___           │╲╲╲╲╲╲╲╲╲         │ ╲___
│/               →  │        ╲╲╲╲   →  │
└────────           └────────           └────────
  Diverges!           Slow!               Converges ✅

Common values: $\alpha \in \{0.1, 0.01, 0.001, 0.0001\}$

5. Three Flavors of Gradient Descent

5.1 Batch Gradient Descent

Use all $n$ training samples to compute one gradient update:

$\mathbf{W} \leftarrow \mathbf{W} - \alpha \cdot \frac{1}{n}\sum_{i=1}^{n}\nabla_{\mathbf{W}} L_i$

✅ Pros	❌ Cons
Stable, accurate gradient	Extremely slow for large datasets
Guaranteed convergence	May not fit in memory

5.2 Stochastic Gradient Descent (SGD)

Use one random sample per update:

$\mathbf{W} \leftarrow \mathbf{W} - \alpha \cdot \nabla_{\mathbf{W}} L_i$

✅ Pros	❌ Cons
Very fast updates	Noisy, jumpy path
Can escape local minima	May not converge

5.3 Mini-Batch Gradient Descent ⭐

Use a small batch ( $B = 32$ to $256$ ) per update:

$\mathbf{W} \leftarrow \mathbf{W} - \alpha \cdot \frac{1}{B}\sum_{i \in \text{batch}}\nabla_{\mathbf{W}} L_i$

✅ Pros	❌ Cons
Balance of speed and stability	Batch size is a hyperparameter
GPU-friendly	Slightly noisy
Used in all modern AI

6. Convergence Comparison

Rendering diagram...

7. Advanced Optimizers

7.1 Momentum

Adds a "velocity" term — accumulates gradient history to build up speed:

$\mathbf{v} \leftarrow \beta_1 \mathbf{v} + (1 - \beta_1)\nabla_\mathbf{W} L$

$\mathbf{W} \leftarrow \mathbf{W} - \alpha \mathbf{v}$

Where $\beta_1 \approx 0.9$ is the momentum coefficient.

7.2 Adam (Adaptive Moment Estimation) ⭐

Combines momentum + adaptive learning rates:

First moment (mean of gradients):

$\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1)\mathbf{g}_t$

Second moment (mean of squared gradients):

$\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2)\mathbf{g}_t^2$

Bias correction:

$\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1 - \beta_1^t}, \qquad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_2^t}$

Final update:

$\mathbf{W} \leftarrow \mathbf{W} - \frac{\alpha}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} \hat{\mathbf{m}}_t$

Default hyperparameters: $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\epsilon = 10^{-8}$

Optimizer Comparison

Optimizer	Key Idea	When to Use
SGD	Raw gradient	Simple problems
Momentum	Accumulated velocity	Noisy gradients
AdaGrad	Adaptive per-parameter LR	Sparse features
RMSProp	Adaptive + exponential decay	RNNs
Adam ⭐	Momentum + Adaptive	Almost everything

8. Full Training Loop