Loss Functions

1. Intuition

A loss function (also called cost function or objective function) quantifies the gap between what the model predicted ( $\hat{y}$ ) and what the true answer actually is ( $y$ ). It converts that gap into a single number that training tries to minimize.

Real-Life Analogy 🎯 — Darts

Element	Darts	Neural Network
Target	Bullseye	Correct answer $y$
Dart landing	Where it hits	Prediction $\hat{y}$
Distance from bullseye	Measurable error	Loss $L$
Goal	Hit bullseye	Minimize $L$

2. Mean Squared Error (MSE)

Used for regression tasks — predicting continuous values like prices, temperatures, scores.

Formula

$L_{MSE} = \frac{1}{n} \sum_{i=1}^{n} \left(y_i - \hat{y}_i\right)^2$

House Price Example

$L = (300{,}000 - 799{,}000)^2 = (-499{,}000)^2 = 2.49 \times 10^{11}$

That's a huge loss — the model is very wrong!

Why Squared?

Property	Explanation
Always positive	$(-499{,}000)^2 > 0$ — negatives can't cancel positives
Penalizes large errors more	$10^2 = 100$ but $100^2 = 10{,}000$
Smooth & differentiable	Easy to take gradient of

MSE Gradient

$\frac{\partial L_{MSE}}{\partial \hat{y}} = \frac{2}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)$

3. Binary Cross-Entropy Loss

Used for binary classification (spam/not-spam, fraud/legit, yes/no).

Formula

$L_{BCE} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i)\right]$

Spam Detection Example 📧

Variable	Value
True label	Spam ( $y = 1$ )
Model confidence	20% spam ( $\hat{y} = 0.2$ )

$L = -[1 \cdot \log(0.2) + 0 \cdot \log(0.8)]$

$L = -\log(0.2) = -(-1.609) = 1.609 \quad \text{(high loss!)}$

Now if model was correct ( $\hat{y} = 0.95$ ):

$L = -\log(0.95) = 0.051 \quad \text{(low loss ✅)}$

Why Logarithm?

$-\log(\hat{y}) \text{ behavior:}$

Confidence $\hat{y}$	$-\log(\hat{y})$	Interpretation
0.01	4.61	Very wrong, very penalized
0.10	2.30	Wrong, penalized
0.50	0.69	Uncertain
0.90	0.11	Mostly right
0.99	0.01	Very confident and correct

4. Categorical Cross-Entropy

Used for multi-class classification (digit 0–9, sentiment, language).

Formula

$L_{CCE} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)$

Where $C$ is the number of classes and $y_c$ is 1 for the correct class, 0 otherwise.

Digit Recognition Example 🔢

Class	Logit $z$	Softmax $\hat{y}$	True $y$
0	1.2	0.09	0
1	0.5	0.04	0
2	3.1	0.82	1 ← correct
3	0.8	0.06	0

$L = -\log(0.82) = 0.198 \quad \text{(low, correct prediction)}$

5. Mean Absolute Error (MAE)

$L_{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$

MSE vs MAE Comparison

Property	MSE	MAE
Formula	$(y - \hat{y})^2$	$\\|y - \hat{y}\\|$
Outlier sensitivity	High	Low
Gradient	Smooth everywhere	Kink at 0
Best when	Outliers matter	Outliers are noise

6. The Loss Landscape

Rendering diagram...

Visualized as a 2D curve:

Loss
  │        *
  │      *   *
  │    *       *
  │  *           *
  │*               *  ← global minimum
  └──────────────────── Weight value

The loss landscape can be:

Convex (MSE) — one global minimum, easy to optimize
Non-convex (deep networks) — many local minima, harder to optimize

7. Choosing the Right Loss Function

Rendering diagram...

8. Quick Reference

$\boxed{L_{MSE} = \frac{1}{n}\sum(y - \hat{y})^2}$

$\boxed{L_{BCE} = -[y\log\hat{y} + (1-y)\log(1-\hat{y})]}$

$\boxed{L_{MAE} = \frac{1}{n}\sum|y - \hat{y}|}$