All posts
ai2025-12-13·4 min read

Activation Functions

The secret ingredient that gives neural networks their power — and why linearity alone falls short.

1. Why Activation Functions?

Without activation functions, stacking multiple linear layers collapses into a single linear transformation:

z[2]=W[2](W[1]x+b[1])+b[2]=(W[2]W[1])Wx+(W[2]b[1]+b[2])b\mathbf{z}^{[2]} = \mathbf{W}^{[2]}(\mathbf{W}^{[1]}\mathbf{x} + \mathbf{b}^{[1]}) + \mathbf{b}^{[2]} = \underbrace{(\mathbf{W}^{[2]}\mathbf{W}^{[1]})}_{\mathbf{W}'}\mathbf{x} + \underbrace{(\mathbf{W}^{[2]}\mathbf{b}^{[1]} + \mathbf{b}^{[2]})}_{\mathbf{b}'}

No matter how many layers — it's still just Wx+b\mathbf{W}'\mathbf{x} + \mathbf{b}'. A straight line. Useless for complex patterns.

Activation functions introduce non-linearity — the ability to learn curves, shapes, and complex decision boundaries.

Real-Life Analogy 🧠 — Brain Neuron

A biological neuron only fires if the incoming signal exceeds a threshold:

  • Below threshold → silence
  • Above threshold → fires!

Activation functions mimic this: they decide whether and how strongly a neuron "fires."


2. Sigmoid

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Properties

PropertyValue
Output range(0,1)(0, 1)
Derivativeσ(z)(1σ(z))\sigma(z)(1 - \sigma(z))
Centered at0.5
SaturatesYes (both ends)

Derivative

dσdz=σ(z)(1σ(z))\frac{d\sigma}{dz} = \sigma(z)\left(1 - \sigma(z)\right)

Behavior

zzσ(z)\sigma(z)Interpretation
10-100.00005\approx 0.00005Neuron off
000.50.5Uncertain
10100.99995\approx 0.99995Neuron on

Use Case 📧 — Spam Detection Output

σ(2.5)=11+e2.5=0.924    "92.4% spam"\sigma(2.5) = \frac{1}{1 + e^{-2.5}} = 0.924 \implies \text{"92.4\% spam"}

Vanishing Gradient Problem

At saturation, derivative 0\to 0:

dσdzz=10=0.99×0.010.01\frac{d\sigma}{dz}\Big|_{z=10} = 0.99 \times 0.01 \approx 0.01

dσdzz=202×109\frac{d\sigma}{dz}\Big|_{z=20} \approx 2 \times 10^{-9}

Early layers receive essentially zero gradient → learn nothing.


3. Tanh

tanh(z)=ezezez+ez\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

Properties

PropertyValue
Output range(1,1)(-1, 1)
Derivative1tanh2(z)1 - \tanh^2(z)
Zero-centered✅ Yes
SaturatesYes (both ends)

Derivative

dtanhdz=1tanh2(z)\frac{d\tanh}{dz} = 1 - \tanh^2(z)

Sigmoid vs Tanh

SigmoidTanh
Output range[0,1][0, 1][1,1][-1, 1]
Zero-centered❌ No✅ Yes
Gradient flowWeakerStronger
PreferredOutput layerHidden layers

4. ReLU — The Game Changer ⭐

ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)

Properties

PropertyValue
Output range[0,)[0, \infty)
Derivative1[z>0]\mathbf{1}[z > 0]
ComputationallyVery fast
Vanishing gradientNo (for z>0z > 0)

Derivative

dReLUdz={1z>00z<0\frac{d\text{ReLU}}{dz} = \begin{cases} 1 & z > 0 \\ 0 & z < 0 \end{cases}

Why ReLU Solved Vanishing Gradient

dReLUdz=1(for positive inputs)\frac{d\text{ReLU}}{dz} = 1 \quad \text{(for positive inputs)}

Gradient never shrinks! Information flows freely through positive neurons.

Dying ReLU Problem

If zz is always negative for a neuron:

ReLU(z)=0    dReLUdz=0    weights never update    dead neuron 🪦\text{ReLU}(z) = 0 \implies \frac{d\text{ReLU}}{dz} = 0 \implies \text{weights never update} \implies \text{dead neuron 🪦}

Fix: Use Leaky ReLU or initialize weights carefully.


5. Leaky ReLU

LeakyReLU(z)=max(αz,z),α=0.01\text{LeakyReLU}(z) = \max(\alpha z, z), \quad \alpha = 0.01

dLeakyReLUdz={1z>0αz0\frac{d\text{LeakyReLU}}{dz} = \begin{cases} 1 & z > 0 \\ \alpha & z \leq 0 \end{cases}

No more dead neurons — negative inputs receive a small gradient α=0.01\alpha = 0.01 instead of zero.


6. Softmax — Multi-Class Output

Softmax(zi)=ezij=1Cezj\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{C} e^{z_j}}

Converts raw logits into a probability distribution that sums to 1.

Digit Recognition Example 🔢

ClassLogit zzeze^zProbability
01.23.329%
10.51.654%
23.122.282% ← winner
30.82.236%
Sum27.4100% ✅

7. GELU — Used in Transformers (GPT, Claude, BERT)

GELU(z)=zΦ(z)\text{GELU}(z) = z \cdot \Phi(z)

Where Φ(z)\Phi(z) is the CDF of the standard normal distribution. Approximation:

GELU(z)0.5z(1+tanh[2π(z+0.044715z3)])\text{GELU}(z) \approx 0.5z\left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}\left(z + 0.044715z^3\right)\right]\right)

Unlike ReLU, GELU softly gates — negative inputs are suppressed rather than zeroed, preserving a small gradient flow.


8. Side-by-Side Comparison

FunctionRangeVanishesDead?SpeedUsed In
Sigmoid(0,1)(0,1)✅ YesNoSlowOutput (binary)
Tanh(1,1)(-1,1)✅ YesNoSlowHidden (older)
ReLU[0,)[0,\infty)❌ NoYesFastHidden ⭐
Leaky ReLU(,)(-\infty,\infty)❌ NoNoFastHidden
Softmax(0,1)(0,1) sum=1MediumOutput (multi-class)
GELU(,)(-\infty,\infty)❌ NoNoMediumTransformers ⭐

9. Where to Use Each

Rendering diagram...

10. Practical Architecture Example — Spam Classifier

xW[1]ReLUW[2]ReLUW[3]σy^(0,1)\mathbf{x} \xrightarrow{W^{[1]}} \text{ReLU} \xrightarrow{W^{[2]}} \text{ReLU} \xrightarrow{W^{[3]}} \sigma \longrightarrow \hat{y} \in (0,1)

LayerActivationOutput
InputFeature vector
Hidden 1ReLUAbstract features
Hidden 2ReLUHigh-level features
OutputSigmoidSpam probability

11. Quick Reference

σ(z)=11+ez,tanh(z)=ezezez+ez\boxed{\sigma(z) = \frac{1}{1+e^{-z}}, \quad \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}}

ReLU(z)=max(0,z),Softmax(zi)=ezijezj\boxed{\text{ReLU}(z) = \max(0,z), \quad \text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}}

Filed underai

Related posts