1. Why Activation Functions?
Without activation functions, stacking multiple linear layers collapses into a single linear transformation:
z[2]=W[2](W[1]x+b[1])+b[2]=W′(W[2]W[1])x+b′(W[2]b[1]+b[2])
No matter how many layers — it's still just W′x+b′. A straight line. Useless for complex patterns.
Activation functions introduce non-linearity — the ability to learn curves, shapes, and complex decision boundaries.
Real-Life Analogy 🧠 — Brain Neuron
A biological neuron only fires if the incoming signal exceeds a threshold:
- Below threshold → silence
- Above threshold → fires!
Activation functions mimic this: they decide whether and how strongly a neuron "fires."
2. Sigmoid
σ(z)=1+e−z1
Properties
| Property | Value |
|---|
| Output range | (0,1) |
| Derivative | σ(z)(1−σ(z)) |
| Centered at | 0.5 |
| Saturates | Yes (both ends) |
Derivative
dzdσ=σ(z)(1−σ(z))
Behavior
| z | σ(z) | Interpretation |
|---|
| −10 | ≈0.00005 | Neuron off |
| 0 | 0.5 | Uncertain |
| 10 | ≈0.99995 | Neuron on |
Use Case 📧 — Spam Detection Output
σ(2.5)=1+e−2.51=0.924⟹"92.4% spam"
Vanishing Gradient Problem
At saturation, derivative →0:
dzdσz=10=0.99×0.01≈0.01
dzdσz=20≈2×10−9
Early layers receive essentially zero gradient → learn nothing.
3. Tanh
tanh(z)=ez+e−zez−e−z
Properties
| Property | Value |
|---|
| Output range | (−1,1) |
| Derivative | 1−tanh2(z) |
| Zero-centered | ✅ Yes |
| Saturates | Yes (both ends) |
Derivative
dzdtanh=1−tanh2(z)
Sigmoid vs Tanh
| Sigmoid | Tanh |
|---|
| Output range | [0,1] | [−1,1] |
| Zero-centered | ❌ No | ✅ Yes |
| Gradient flow | Weaker | Stronger |
| Preferred | Output layer | Hidden layers |
4. ReLU — The Game Changer ⭐
ReLU(z)=max(0,z)
Properties
| Property | Value |
|---|
| Output range | [0,∞) |
| Derivative | 1[z>0] |
| Computationally | Very fast |
| Vanishing gradient | No (for z>0) |
Derivative
dzdReLU={10z>0z<0
Why ReLU Solved Vanishing Gradient
dzdReLU=1(for positive inputs)
Gradient never shrinks! Information flows freely through positive neurons.
Dying ReLU Problem
If z is always negative for a neuron:
ReLU(z)=0⟹dzdReLU=0⟹weights never update⟹dead neuron 🪦
Fix: Use Leaky ReLU or initialize weights carefully.
5. Leaky ReLU
LeakyReLU(z)=max(αz,z),α=0.01
dzdLeakyReLU={1αz>0z≤0
No more dead neurons — negative inputs receive a small gradient α=0.01 instead of zero.
6. Softmax — Multi-Class Output
Softmax(zi)=∑j=1Cezjezi
Converts raw logits into a probability distribution that sums to 1.
Digit Recognition Example 🔢
| Class | Logit z | ez | Probability |
|---|
| 0 | 1.2 | 3.32 | 9% |
| 1 | 0.5 | 1.65 | 4% |
| 2 | 3.1 | 22.2 | 82% ← winner |
| 3 | 0.8 | 2.23 | 6% |
| Sum | — | 27.4 | 100% ✅ |
7. GELU — Used in Transformers (GPT, Claude, BERT)
GELU(z)=z⋅Φ(z)
Where Φ(z) is the CDF of the standard normal distribution. Approximation:
GELU(z)≈0.5z(1+tanh[π2(z+0.044715z3)])
Unlike ReLU, GELU softly gates — negative inputs are suppressed rather than zeroed, preserving a small gradient flow.
8. Side-by-Side Comparison
| Function | Range | Vanishes | Dead? | Speed | Used In |
|---|
| Sigmoid | (0,1) | ✅ Yes | No | Slow | Output (binary) |
| Tanh | (−1,1) | ✅ Yes | No | Slow | Hidden (older) |
| ReLU | [0,∞) | ❌ No | Yes | Fast | Hidden ⭐ |
| Leaky ReLU | (−∞,∞) | ❌ No | No | Fast | Hidden |
| Softmax | (0,1) sum=1 | — | — | Medium | Output (multi-class) |
| GELU | (−∞,∞) | ❌ No | No | Medium | Transformers ⭐ |
9. Where to Use Each
Rendering diagram...
10. Practical Architecture Example — Spam Classifier
xW[1]ReLUW[2]ReLUW[3]σ⟶y^∈(0,1)
| Layer | Activation | Output |
|---|
| Input | — | Feature vector |
| Hidden 1 | ReLU | Abstract features |
| Hidden 2 | ReLU | High-level features |
| Output | Sigmoid | Spam probability |
11. Quick Reference
σ(z)=1+e−z1,tanh(z)=ez+e−zez−e−z
ReLU(z)=max(0,z),Softmax(zi)=∑jezjezi