Activation Functions

1. Why Activation Functions?

Without activation functions, stacking multiple linear layers collapses into a single linear transformation:

$\mathbf{z}^{[2]} = \mathbf{W}^{[2]}(\mathbf{W}^{[1]}\mathbf{x} + \mathbf{b}^{[1]}) + \mathbf{b}^{[2]} = \underbrace{(\mathbf{W}^{[2]}\mathbf{W}^{[1]})}_{\mathbf{W}'}\mathbf{x} + \underbrace{(\mathbf{W}^{[2]}\mathbf{b}^{[1]} + \mathbf{b}^{[2]})}_{\mathbf{b}'}$

No matter how many layers — it's still just $\mathbf{W}'\mathbf{x} + \mathbf{b}'$ . A straight line. Useless for complex patterns.

Activation functions introduce non-linearity — the ability to learn curves, shapes, and complex decision boundaries.

Real-Life Analogy 🧠 — Brain Neuron

A biological neuron only fires if the incoming signal exceeds a threshold:

Below threshold → silence
Above threshold → fires!

Activation functions mimic this: they decide whether and how strongly a neuron "fires."

2. Sigmoid

$\sigma(z) = \frac{1}{1 + e^{-z}}$

Properties

Property	Value
Output range	$(0, 1)$
Derivative	$\sigma(z)(1 - \sigma(z))$
Centered at	0.5
Saturates	Yes (both ends)

Derivative

$\frac{d\sigma}{dz} = \sigma(z)\left(1 - \sigma(z)\right)$

Behavior

$z$	$\sigma(z)$	Interpretation
$-10$	$\approx 0.00005$	Neuron off
$0$	$0.5$	Uncertain
$10$	$\approx 0.99995$	Neuron on

Use Case 📧 — Spam Detection Output

$\sigma(2.5) = \frac{1}{1 + e^{-2.5}} = 0.924 \implies \text{"92.4\% spam"}$

Vanishing Gradient Problem

At saturation, derivative $\to 0$ :

$\frac{d\sigma}{dz}\Big|_{z=10} = 0.99 \times 0.01 \approx 0.01$

$\frac{d\sigma}{dz}\Big|_{z=20} \approx 2 \times 10^{-9}$

Early layers receive essentially zero gradient → learn nothing.

3. Tanh

$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$

Properties

Property	Value
Output range	$(-1, 1)$
Derivative	$1 - \tanh^2(z)$
Zero-centered	✅ Yes
Saturates	Yes (both ends)

Derivative

$\frac{d\tanh}{dz} = 1 - \tanh^2(z)$

Sigmoid vs Tanh

	Sigmoid	Tanh
Output range	$[0, 1]$	$[-1, 1]$
Zero-centered	❌ No	✅ Yes
Gradient flow	Weaker	Stronger
Preferred	Output layer	Hidden layers

4. ReLU — The Game Changer ⭐

$\text{ReLU}(z) = \max(0, z)$

Properties

Property	Value
Output range	$[0, \infty)$
Derivative	$\mathbf{1}[z > 0]$
Computationally	Very fast
Vanishing gradient	No (for $z > 0$ )

Derivative

$\frac{d\text{ReLU}}{dz} = \begin{cases} 1 & z > 0 \\ 0 & z < 0 \end{cases}$

Why ReLU Solved Vanishing Gradient

$\frac{d\text{ReLU}}{dz} = 1 \quad \text{(for positive inputs)}$

Gradient never shrinks! Information flows freely through positive neurons.

Dying ReLU Problem

If $z$ is always negative for a neuron:

$\text{ReLU}(z) = 0 \implies \frac{d\text{ReLU}}{dz} = 0 \implies \text{weights never update} \implies \text{dead neuron 🪦}$

Fix: Use Leaky ReLU or initialize weights carefully.

5. Leaky ReLU

$\text{LeakyReLU}(z) = \max(\alpha z, z), \quad \alpha = 0.01$

$\frac{d\text{LeakyReLU}}{dz} = \begin{cases} 1 & z > 0 \\ \alpha & z \leq 0 \end{cases}$

No more dead neurons — negative inputs receive a small gradient $\alpha = 0.01$ instead of zero.

6. Softmax — Multi-Class Output

$\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{C} e^{z_j}}$

Converts raw logits into a probability distribution that sums to 1.

Digit Recognition Example 🔢

Class	Logit $z$	$e^z$	Probability
0	1.2	3.32	9%
1	0.5	1.65	4%
2	3.1	22.2	82% ← winner
3	0.8	2.23	6%
Sum	—	27.4	100% ✅

7. GELU — Used in Transformers (GPT, Claude, BERT)

$\text{GELU}(z) = z \cdot \Phi(z)$

Where $\Phi(z)$ is the CDF of the standard normal distribution. Approximation:

$\text{GELU}(z) \approx 0.5z\left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}\left(z + 0.044715z^3\right)\right]\right)$

Unlike ReLU, GELU softly gates — negative inputs are suppressed rather than zeroed, preserving a small gradient flow.

8. Side-by-Side Comparison

Function	Range	Vanishes	Dead?	Speed	Used In
Sigmoid	$(0,1)$	✅ Yes	No	Slow	Output (binary)
Tanh	$(-1,1)$	✅ Yes	No	Slow	Hidden (older)
ReLU	$[0,\infty)$	❌ No	Yes	Fast	Hidden ⭐
Leaky ReLU	$(-\infty,\infty)$	❌ No	No	Fast	Hidden
Softmax	$(0,1)$ sum=1	—	—	Medium	Output (multi-class)
GELU	$(-\infty,\infty)$	❌ No	No	Medium	Transformers ⭐

9. Where to Use Each

Rendering diagram...

10. Practical Architecture Example — Spam Classifier

$\mathbf{x} \xrightarrow{W^{[1]}} \text{ReLU} \xrightarrow{W^{[2]}} \text{ReLU} \xrightarrow{W^{[3]}} \sigma \longrightarrow \hat{y} \in (0,1)$

Layer	Activation	Output
Input	—	Feature vector
Hidden 1	ReLU	Abstract features
Hidden 2	ReLU	High-level features
Output	Sigmoid	Spam probability

11. Quick Reference

$\boxed{\sigma(z) = \frac{1}{1+e^{-z}}, \quad \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}}$

$\boxed{\text{ReLU}(z) = \max(0,z), \quad \text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}}$