1. Intuition
Backpropagation (backprop) is the algorithm that computes how much each weight in the network contributed to the final error. It works backwards from the loss, distributing blame through the network using the chain rule of calculus.
Real-Life Analogy 🏭 — Factory Defect
Imagine a car factory:
Steel → Engine → Paint → Final Car
A defective car is shipped. The manager investigates backwards :
Is the paint wrong? → Partially
Is the engine wrong? → Partially
Is the steel wrong? → Partially
Each station gets a share of the blame proportional to its contribution.
Backprop does exactly this — assigns each weight a gradient (blame score).
2. The Chain Rule
Backprop is an application of the chain rule from calculus.
If y y y depends on z z z , and z z z depends on x x x :
d y d x = d y d z ⋅ d z d x \frac{dy}{dx} = \frac{dy}{dz} \cdot \frac{dz}{dx} d x d y = d z d y ⋅ d x d z
For a neural network with layers:
∂ L ∂ W [ 1 ] = ∂ L ∂ a [ 2 ] ⋅ ∂ a [ 2 ] ∂ z [ 2 ] ⋅ ∂ z [ 2 ] ∂ a [ 1 ] ⋅ ∂ a [ 1 ] ∂ z [ 1 ] ⋅ ∂ z [ 1 ] ∂ W [ 1 ] \frac{\partial L}{\partial \mathbf{W}^{[1]}} = \frac{\partial L}{\partial \mathbf{a}^{[2]}} \cdot \frac{\partial \mathbf{a}^{[2]}}{\partial \mathbf{z}^{[2]}} \cdot \frac{\partial \mathbf{z}^{[2]}}{\partial \mathbf{a}^{[1]}} \cdot \frac{\partial \mathbf{a}^{[1]}}{\partial \mathbf{z}^{[1]}} \cdot \frac{\partial \mathbf{z}^{[1]}}{\partial \mathbf{W}^{[1]}} ∂ W [ 1 ] ∂ L = ∂ a [ 2 ] ∂ L ⋅ ∂ z [ 2 ] ∂ a [ 2 ] ⋅ ∂ a [ 1 ] ∂ z [ 2 ] ⋅ ∂ z [ 1 ] ∂ a [ 1 ] ⋅ ∂ W [ 1 ] ∂ z [ 1 ]
3. Complete Worked Example
Network Setup
x → z = w x + b → a = ReLU ( z ) → L = ( a − y ) 2 x \xrightarrow{z = wx+b} \xrightarrow{a = \text{ReLU}(z)} \xrightarrow{L = (a-y)^2} x z = w x + b a = ReLU ( z ) L = ( a − y ) 2
Given:
Input: x = 2 x = 2 x = 2
Weight: w = 3 w = 3 w = 3
Bias: b = 1 b = 1 b = 1
True label: y = 5 y = 5 y = 5
Step 1: Forward Pass
z = w x + b = ( 3 ) ( 2 ) + 1 = 7 z = wx + b = (3)(2) + 1 = 7 z = w x + b = ( 3 ) ( 2 ) + 1 = 7
a = ReLU ( 7 ) = 7 a = \text{ReLU}(7) = 7 a = ReLU ( 7 ) = 7
L = ( a − y ) 2 = ( 7 − 5 ) 2 = 4 L = (a - y)^2 = (7 - 5)^2 = 4 L = ( a − y ) 2 = ( 7 − 5 ) 2 = 4
Step 2: Backward Pass
We want ∂ L ∂ w \dfrac{\partial L}{\partial w} ∂ w ∂ L — how does loss change if we nudge w w w ?
Apply chain rule:
∂ L ∂ w = ∂ L ∂ a ⋅ ∂ a ∂ z ⋅ ∂ z ∂ w \frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w} ∂ w ∂ L = ∂ a ∂ L ⋅ ∂ z ∂ a ⋅ ∂ w ∂ z
🔴 Term 1: ∂ L ∂ a \dfrac{\partial L}{\partial a} ∂ a ∂ L
L = ( a − y ) 2 ⟹ ∂ L ∂ a = 2 ( a − y ) = 2 ( 7 − 5 ) = 4 L = (a-y)^2 \implies \frac{\partial L}{\partial a} = 2(a - y) = 2(7 - 5) = 4 L = ( a − y ) 2 ⟹ ∂ a ∂ L = 2 ( a − y ) = 2 ( 7 − 5 ) = 4
🟡 Term 2: ∂ a ∂ z \dfrac{\partial a}{\partial z} ∂ z ∂ a
a = ReLU ( z ) = max ( 0 , z ) ⟹ ∂ a ∂ z = { 1 if z > 0 0 if z < 0 a = \text{ReLU}(z) = \max(0, z) \implies \frac{\partial a}{\partial z} = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z < 0 \end{cases} a = ReLU ( z ) = max ( 0 , z ) ⟹ ∂ z ∂ a = { 1 0 if z > 0 if z < 0
Since z = 7 > 0 z = 7 > 0 z = 7 > 0 : ∂ a ∂ z = 1 \dfrac{\partial a}{\partial z} = 1 ∂ z ∂ a = 1
🟢 Term 3: ∂ z ∂ w \dfrac{\partial z}{\partial w} ∂ w ∂ z
z = w x + b ⟹ ∂ z ∂ w = x = 2 z = wx + b \implies \frac{\partial z}{\partial w} = x = 2 z = w x + b ⟹ ∂ w ∂ z = x = 2
🔵 Chain them together:
∂ L ∂ w = 4 ⋅ 1 ⋅ 2 = 8 \frac{\partial L}{\partial w} = 4 \cdot 1 \cdot 2 = \mathbf{8} ∂ w ∂ L = 4 ⋅ 1 ⋅ 2 = 8
Interpretation: If we increase w w w by 1, the loss increases by ~8. So we should decrease w w w !
Step 3: Gradient for Bias
∂ L ∂ b = ∂ L ∂ a ⋅ ∂ a ∂ z ⋅ ∂ z ∂ b = 4 ⋅ 1 ⋅ 1 = 4 \frac{\partial L}{\partial b} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial b} = 4 \cdot 1 \cdot 1 = 4 ∂ b ∂ L = ∂ a ∂ L ⋅ ∂ z ∂ a ⋅ ∂ b ∂ z = 4 ⋅ 1 ⋅ 1 = 4
4. Data Flow Diagram
Rendering diagram...
5. General Backprop Equations
For layer l l l in a deep network:
Error signal (delta):
δ [ l ] = ( W [ l + 1 ] T δ [ l + 1 ] ) ⊙ f ′ ( z [ l ] ) \boldsymbol{\delta}^{[l]} = \left(\mathbf{W}^{[l+1]T} \boldsymbol{\delta}^{[l+1]}\right) \odot f'\left(\mathbf{z}^{[l]}\right) δ [ l ] = ( W [ l + 1 ] T δ [ l + 1 ] ) ⊙ f ′ ( z [ l ] )
Weight gradient:
∂ L ∂ W [ l ] = δ [ l ] ⋅ a [ l − 1 ] T \frac{\partial L}{\partial \mathbf{W}^{[l]}} = \boldsymbol{\delta}^{[l]} \cdot \mathbf{a}^{[l-1]T} ∂ W [ l ] ∂ L = δ [ l ] ⋅ a [ l − 1 ] T
Bias gradient:
∂ L ∂ b [ l ] = δ [ l ] \frac{\partial L}{\partial \mathbf{b}^{[l]}} = \boldsymbol{\delta}^{[l]} ∂ b [ l ] ∂ L = δ [ l ]
Where ⊙ \odot ⊙ denotes element-wise (Hadamard) product and f ′ f' f ′ is the derivative of the activation function.
6. Vanishing & Exploding Gradients
Rendering diagram...
Vanishing Gradient
When activation derivatives are small (< 1 < 1 < 1 ), gradients shrink exponentially:
g 1 = 0.3 4 = 0.0081 ≈ 0 g_1 = 0.3^4 = 0.0081 \approx 0 g 1 = 0. 3 4 = 0.0081 ≈ 0
Early layers learn nothing .
Fixes:
Use ReLU (gradient = 1 for positive inputs)
ResNet skip connections
BatchNorm
Exploding Gradient
When gradients are large (> 1 > 1 > 1 ), they grow exponentially:
g 1 = 3 4 = 81 (weights blow up!) g_1 = 3^4 = 81 \quad \text{(weights blow up!)} g 1 = 3 4 = 81 (weights blow up!)
Fixes:
Gradient clipping : g ← min ( g , g m a x ) g \leftarrow \min\left(g, g_{max}\right) g ← min ( g , g ma x )
Careful weight initialization
7. Summary Table
Step Operation Direction Formula Forward Compute activations → \rightarrow → a [ l ] = f ( W [ l ] a [ l − 1 ] + b [ l ] ) a^{[l]} = f(W^{[l]}a^{[l-1]} + b^{[l]}) a [ l ] = f ( W [ l ] a [ l − 1 ] + b [ l ] ) Compute Loss Measure error — L = loss ( y ^ , y ) L = \text{loss}(\hat{y}, y) L = loss ( y ^ , y ) Backward Compute deltas ← \leftarrow ← δ [ l ] = ( W [ l + 1 ] T δ [ l + 1 ] ) ⊙ f ′ ( z [ l ] ) \delta^{[l]} = (W^{[l+1]T}\delta^{[l+1]}) \odot f'(z^{[l]}) δ [ l ] = ( W [ l + 1 ] T δ [ l + 1 ] ) ⊙ f ′ ( z [ l ] ) Gradients Weight gradients — ∇ W [ l ] = δ [ l ] a [ l − 1 ] T \nabla W^{[l]} = \delta^{[l]} a^{[l-1]T} ∇ W [ l ] = δ [ l ] a [ l − 1 ] T
8. Quick Reference
∂ L ∂ w = ∂ L ∂ a ⋅ ∂ a ∂ z ⋅ ∂ z ∂ w \boxed{\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}} ∂ w ∂ L = ∂ a ∂ L ⋅ ∂ z ∂ a ⋅ ∂ w ∂ z
δ [ l ] = ( W [ l + 1 ] T δ [ l + 1 ] ) ⊙ f ′ ( z [ l ] ) \boxed{\boldsymbol{\delta}^{[l]} = \left(\mathbf{W}^{[l+1]T} \boldsymbol{\delta}^{[l+1]}\right) \odot f'\left(\mathbf{z}^{[l]}\right)} δ [ l ] = ( W [ l + 1 ] T δ [ l + 1 ] ) ⊙ f ′ ( z [ l ] )