All posts
ai2025-11-29·4 min read

Backpropagation

Figuring out who to blame — and by how much. How gradients flow backwards through a network.

1. Intuition

Backpropagation (backprop) is the algorithm that computes how much each weight in the network contributed to the final error. It works backwards from the loss, distributing blame through the network using the chain rule of calculus.

Real-Life Analogy 🏭 — Factory Defect

Imagine a car factory:

Steel → Engine → Paint → Final Car

A defective car is shipped. The manager investigates backwards:

  1. Is the paint wrong? → Partially
  2. Is the engine wrong? → Partially
  3. Is the steel wrong? → Partially

Each station gets a share of the blame proportional to its contribution.

Backprop does exactly this — assigns each weight a gradient (blame score).


2. The Chain Rule

Backprop is an application of the chain rule from calculus.

If yy depends on zz, and zz depends on xx:

dydx=dydzdzdx\frac{dy}{dx} = \frac{dy}{dz} \cdot \frac{dz}{dx}

For a neural network with layers:

LW[1]=La[2]a[2]z[2]z[2]a[1]a[1]z[1]z[1]W[1]\frac{\partial L}{\partial \mathbf{W}^{[1]}} = \frac{\partial L}{\partial \mathbf{a}^{[2]}} \cdot \frac{\partial \mathbf{a}^{[2]}}{\partial \mathbf{z}^{[2]}} \cdot \frac{\partial \mathbf{z}^{[2]}}{\partial \mathbf{a}^{[1]}} \cdot \frac{\partial \mathbf{a}^{[1]}}{\partial \mathbf{z}^{[1]}} \cdot \frac{\partial \mathbf{z}^{[1]}}{\partial \mathbf{W}^{[1]}}


3. Complete Worked Example

Network Setup

xz=wx+ba=ReLU(z)L=(ay)2x \xrightarrow{z = wx+b} \xrightarrow{a = \text{ReLU}(z)} \xrightarrow{L = (a-y)^2}

Given:

  • Input: x=2x = 2
  • Weight: w=3w = 3
  • Bias: b=1b = 1
  • True label: y=5y = 5

Step 1: Forward Pass

z=wx+b=(3)(2)+1=7z = wx + b = (3)(2) + 1 = 7

a=ReLU(7)=7a = \text{ReLU}(7) = 7

L=(ay)2=(75)2=4L = (a - y)^2 = (7 - 5)^2 = 4


Step 2: Backward Pass

We want Lw\dfrac{\partial L}{\partial w} — how does loss change if we nudge ww?

Apply chain rule:

Lw=Laazzw\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}

🔴 Term 1: La\dfrac{\partial L}{\partial a}

L=(ay)2    La=2(ay)=2(75)=4L = (a-y)^2 \implies \frac{\partial L}{\partial a} = 2(a - y) = 2(7 - 5) = 4

🟡 Term 2: az\dfrac{\partial a}{\partial z}

a=ReLU(z)=max(0,z)    az={1if z>00if z<0a = \text{ReLU}(z) = \max(0, z) \implies \frac{\partial a}{\partial z} = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z < 0 \end{cases}

Since z=7>0z = 7 > 0: az=1\dfrac{\partial a}{\partial z} = 1

🟢 Term 3: zw\dfrac{\partial z}{\partial w}

z=wx+b    zw=x=2z = wx + b \implies \frac{\partial z}{\partial w} = x = 2

🔵 Chain them together:

Lw=412=8\frac{\partial L}{\partial w} = 4 \cdot 1 \cdot 2 = \mathbf{8}

Interpretation: If we increase ww by 1, the loss increases by ~8. So we should decrease ww!


Step 3: Gradient for Bias

Lb=Laazzb=411=4\frac{\partial L}{\partial b} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial b} = 4 \cdot 1 \cdot 1 = 4


4. Data Flow Diagram

Rendering diagram...

5. General Backprop Equations

For layer ll in a deep network:

Error signal (delta):

δ[l]=(W[l+1]Tδ[l+1])f(z[l])\boldsymbol{\delta}^{[l]} = \left(\mathbf{W}^{[l+1]T} \boldsymbol{\delta}^{[l+1]}\right) \odot f'\left(\mathbf{z}^{[l]}\right)

Weight gradient:

LW[l]=δ[l]a[l1]T\frac{\partial L}{\partial \mathbf{W}^{[l]}} = \boldsymbol{\delta}^{[l]} \cdot \mathbf{a}^{[l-1]T}

Bias gradient:

Lb[l]=δ[l]\frac{\partial L}{\partial \mathbf{b}^{[l]}} = \boldsymbol{\delta}^{[l]}

Where \odot denotes element-wise (Hadamard) product and ff' is the derivative of the activation function.


6. Vanishing & Exploding Gradients

Rendering diagram...

Vanishing Gradient

When activation derivatives are small (<1< 1), gradients shrink exponentially:

g1=0.34=0.00810g_1 = 0.3^4 = 0.0081 \approx 0

Early layers learn nothing.

Fixes:

  • Use ReLU (gradient = 1 for positive inputs)
  • ResNet skip connections
  • BatchNorm

Exploding Gradient

When gradients are large (>1> 1), they grow exponentially:

g1=34=81(weights blow up!)g_1 = 3^4 = 81 \quad \text{(weights blow up!)}

Fixes:

  • Gradient clipping: gmin(g,gmax)g \leftarrow \min\left(g, g_{max}\right)
  • Careful weight initialization

7. Summary Table

StepOperationDirectionFormula
ForwardCompute activations\rightarrowa[l]=f(W[l]a[l1]+b[l])a^{[l]} = f(W^{[l]}a^{[l-1]} + b^{[l]})
Compute LossMeasure errorL=loss(y^,y)L = \text{loss}(\hat{y}, y)
BackwardCompute deltas\leftarrowδ[l]=(W[l+1]Tδ[l+1])f(z[l])\delta^{[l]} = (W^{[l+1]T}\delta^{[l+1]}) \odot f'(z^{[l]})
GradientsWeight gradientsW[l]=δ[l]a[l1]T\nabla W^{[l]} = \delta^{[l]} a^{[l-1]T}

8. Quick Reference

Lw=Laazzw\boxed{\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}}

δ[l]=(W[l+1]Tδ[l+1])f(z[l])\boxed{\boldsymbol{\delta}^{[l]} = \left(\mathbf{W}^{[l+1]T} \boldsymbol{\delta}^{[l+1]}\right) \odot f'\left(\mathbf{z}^{[l]}\right)}

Filed underai

Related posts