Backpropagation

1. Intuition

Backpropagation (backprop) is the algorithm that computes how much each weight in the network contributed to the final error. It works backwards from the loss, distributing blame through the network using the chain rule of calculus.

Real-Life Analogy 🏭 — Factory Defect

Imagine a car factory:

Steel → Engine → Paint → Final Car

A defective car is shipped. The manager investigates backwards:

Is the paint wrong? → Partially
Is the engine wrong? → Partially
Is the steel wrong? → Partially

Each station gets a share of the blame proportional to its contribution.

Backprop does exactly this — assigns each weight a gradient (blame score).

2. The Chain Rule

Backprop is an application of the chain rule from calculus.

If $y$ depends on $z$ , and $z$ depends on $x$ :

$\frac{dy}{dx} = \frac{dy}{dz} \cdot \frac{dz}{dx}$

For a neural network with layers:

$\frac{\partial L}{\partial \mathbf{W}^{[1]}} = \frac{\partial L}{\partial \mathbf{a}^{[2]}} \cdot \frac{\partial \mathbf{a}^{[2]}}{\partial \mathbf{z}^{[2]}} \cdot \frac{\partial \mathbf{z}^{[2]}}{\partial \mathbf{a}^{[1]}} \cdot \frac{\partial \mathbf{a}^{[1]}}{\partial \mathbf{z}^{[1]}} \cdot \frac{\partial \mathbf{z}^{[1]}}{\partial \mathbf{W}^{[1]}}$

3. Complete Worked Example

Network Setup

$x \xrightarrow{z = wx+b} \xrightarrow{a = \text{ReLU}(z)} \xrightarrow{L = (a-y)^2}$

Given:

Input: $x = 2$
Weight: $w = 3$
Bias: $b = 1$
True label: $y = 5$

Step 1: Forward Pass

$z = wx + b = (3)(2) + 1 = 7$

$a = \text{ReLU}(7) = 7$

$L = (a - y)^2 = (7 - 5)^2 = 4$

Step 2: Backward Pass

We want $\dfrac{\partial L}{\partial w}$ — how does loss change if we nudge $w$ ?

Apply chain rule:

$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}$

🔴 Term 1: $\dfrac{\partial L}{\partial a}$

$L = (a-y)^2 \implies \frac{\partial L}{\partial a} = 2(a - y) = 2(7 - 5) = 4$

🟡 Term 2: $\dfrac{\partial a}{\partial z}$

$a = \text{ReLU}(z) = \max(0, z) \implies \frac{\partial a}{\partial z} = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z < 0 \end{cases}$

Since $z = 7 > 0$ : $\dfrac{\partial a}{\partial z} = 1$

🟢 Term 3: $\dfrac{\partial z}{\partial w}$

$z = wx + b \implies \frac{\partial z}{\partial w} = x = 2$

🔵 Chain them together:

$\frac{\partial L}{\partial w} = 4 \cdot 1 \cdot 2 = \mathbf{8}$

Interpretation: If we increase $w$ by 1, the loss increases by ~8. So we should decrease $w$ !

Step 3: Gradient for Bias

$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial b} = 4 \cdot 1 \cdot 1 = 4$

4. Data Flow Diagram

Rendering diagram...

5. General Backprop Equations

For layer $l$ in a deep network:

Error signal (delta):

$\boldsymbol{\delta}^{[l]} = \left(\mathbf{W}^{[l+1]T} \boldsymbol{\delta}^{[l+1]}\right) \odot f'\left(\mathbf{z}^{[l]}\right)$

Weight gradient:

$\frac{\partial L}{\partial \mathbf{W}^{[l]}} = \boldsymbol{\delta}^{[l]} \cdot \mathbf{a}^{[l-1]T}$

Bias gradient:

$\frac{\partial L}{\partial \mathbf{b}^{[l]}} = \boldsymbol{\delta}^{[l]}$

Where $\odot$ denotes element-wise (Hadamard) product and $f'$ is the derivative of the activation function.

6. Vanishing & Exploding Gradients

Rendering diagram...

Vanishing Gradient

When activation derivatives are small ( $< 1$ ), gradients shrink exponentially:

$g_1 = 0.3^4 = 0.0081 \approx 0$

Early layers learn nothing.

Fixes:

Use ReLU (gradient = 1 for positive inputs)
ResNet skip connections
BatchNorm

Exploding Gradient

When gradients are large ( $> 1$ ), they grow exponentially:

$g_1 = 3^4 = 81 \quad \text{(weights blow up!)}$