1. Intuition
Gradient descent is the optimization algorithm that uses gradients (from backprop) to iteratively update weights in the direction that reduces the loss.
Real-Life Analogy ποΈ β Blindfolded on a Mountain
You are blindfolded on a hilly mountain and need to reach the lowest valley:
- You can't see the whole landscape
- But you can feel the slope under your feet
- So you take a small step in the downhill direction
- Repeat until you can't go lower
Loss (height)
β You start here π§
β β
β π§ step
β β
β π§ step
β β
β π§ β getting closer
β β
β π§ β minimum! (best weights)
βββββββββββββββββ Weights (position)
2. The Core Update Rule
Where:
- β learning rate (step size)
- β gradient (computed by backprop)
Why Subtract?
The gradient points uphill (direction of increasing loss). To go downhill, we go in the opposite direction:
3. Worked Example
Continuing from backpropagation:
Given: , , , ,
Update weights:
Forward pass with new weights:
Loss dropped from 4 β 0 in one step!
4. The Learning Rate
The learning rate controls the size of each step.
| Learning Rate | Effect | Outcome |
|---|---|---|
| Too large () | Overshoots minimum | Diverges π₯ |
| Too small () | Tiny steps | Takes forever π’ |
| Just right () | Steady progress | Converges β |
Effect on loss curve:
Too Large: Too Small: Just Right:
Loss Loss Loss
β /\ β ββ²
β / \___ ββ²β²β²β²β²β²β²β²β² β β²___
β/ β β β²β²β²β² β β
βββββββββ βββββββββ βββββββββ
Diverges! Slow! Converges β
Common values:
5. Three Flavors of Gradient Descent
5.1 Batch Gradient Descent
Use all training samples to compute one gradient update:
| β Pros | β Cons |
|---|---|
| Stable, accurate gradient | Extremely slow for large datasets |
| Guaranteed convergence | May not fit in memory |
5.2 Stochastic Gradient Descent (SGD)
Use one random sample per update:
| β Pros | β Cons |
|---|---|
| Very fast updates | Noisy, jumpy path |
| Can escape local minima | May not converge |
5.3 Mini-Batch Gradient Descent β
Use a small batch ( to ) per update:
| β Pros | β Cons |
|---|---|
| Balance of speed and stability | Batch size is a hyperparameter |
| GPU-friendly | Slightly noisy |
| Used in all modern AI |
6. Convergence Comparison
7. Advanced Optimizers
7.1 Momentum
Adds a "velocity" term β accumulates gradient history to build up speed:
Where is the momentum coefficient.
7.2 Adam (Adaptive Moment Estimation) β
Combines momentum + adaptive learning rates:
First moment (mean of gradients):
Second moment (mean of squared gradients):
Bias correction:
Final update:
Default hyperparameters: , ,
Optimizer Comparison
| Optimizer | Key Idea | When to Use |
|---|---|---|
| SGD | Raw gradient | Simple problems |
| Momentum | Accumulated velocity | Noisy gradients |
| AdaGrad | Adaptive per-parameter LR | Sparse features |
| RMSProp | Adaptive + exponential decay | RNNs |
| Adam β | Momentum + Adaptive | Almost everything |
8. Full Training Loop
9. Local Minima & Saddle Points
Loss landscape challenges:
Local minimum: Saddle point: Global minimum:
β² /β² β² / β²__/
β²/ β² β²_____/
β β β
Gets stuck! Gradient β 0 We want this!
Solutions:
- Momentum helps escape local minima
- Large batches explore landscape better
- Adam's adaptive rates help on plateaus
10. Quick Reference