All posts
ai2025-11-15·3 min read

Loss Functions

Measuring how wrong your model is — precisely. A deep dive into the functions that guide learning.

1. Intuition

A loss function (also called cost function or objective function) quantifies the gap between what the model predicted (y^\hat{y}) and what the true answer actually is (yy). It converts that gap into a single number that training tries to minimize.

Real-Life Analogy 🎯 — Darts

ElementDartsNeural Network
TargetBullseyeCorrect answer yy
Dart landingWhere it hitsPrediction y^\hat{y}
Distance from bullseyeMeasurable errorLoss LL
GoalHit bullseyeMinimize LL

2. Mean Squared Error (MSE)

Used for regression tasks — predicting continuous values like prices, temperatures, scores.

Formula

LMSE=1ni=1n(yiy^i)2L_{MSE} = \frac{1}{n} \sum_{i=1}^{n} \left(y_i - \hat{y}_i\right)^2

House Price Example

L=(300,000799,000)2=(499,000)2=2.49×1011L = (300{,}000 - 799{,}000)^2 = (-499{,}000)^2 = 2.49 \times 10^{11}

That's a huge loss — the model is very wrong!

Why Squared?

PropertyExplanation
Always positive(499,000)2>0(-499{,}000)^2 > 0 — negatives can't cancel positives
Penalizes large errors more102=10010^2 = 100 but 1002=10,000100^2 = 10{,}000
Smooth & differentiableEasy to take gradient of

MSE Gradient

LMSEy^=2ni=1n(y^iyi)\frac{\partial L_{MSE}}{\partial \hat{y}} = \frac{2}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)


3. Binary Cross-Entropy Loss

Used for binary classification (spam/not-spam, fraud/legit, yes/no).

Formula

LBCE=1ni=1n[yilog(y^i)+(1yi)log(1y^i)]L_{BCE} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i)\right]

Spam Detection Example 📧

VariableValue
True labelSpam (y=1y = 1)
Model confidence20% spam (y^=0.2\hat{y} = 0.2)

L=[1log(0.2)+0log(0.8)]L = -[1 \cdot \log(0.2) + 0 \cdot \log(0.8)]

L=log(0.2)=(1.609)=1.609(high loss!)L = -\log(0.2) = -(-1.609) = 1.609 \quad \text{(high loss!)}

Now if model was correct (y^=0.95\hat{y} = 0.95):

L=log(0.95)=0.051(low loss ✅)L = -\log(0.95) = 0.051 \quad \text{(low loss ✅)}

Why Logarithm?

log(y^) behavior:-\log(\hat{y}) \text{ behavior:}

Confidence y^\hat{y}log(y^)-\log(\hat{y})Interpretation
0.014.61Very wrong, very penalized
0.102.30Wrong, penalized
0.500.69Uncertain
0.900.11Mostly right
0.990.01Very confident and correct

4. Categorical Cross-Entropy

Used for multi-class classification (digit 0–9, sentiment, language).

Formula

LCCE=c=1Cyclog(y^c)L_{CCE} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)

Where CC is the number of classes and ycy_c is 1 for the correct class, 0 otherwise.

Digit Recognition Example 🔢

ClassLogit zzSoftmax y^\hat{y}True yy
01.20.090
10.50.040
23.10.821 ← correct
30.80.060

L=log(0.82)=0.198(low, correct prediction)L = -\log(0.82) = 0.198 \quad \text{(low, correct prediction)}


5. Mean Absolute Error (MAE)

LMAE=1ni=1nyiy^iL_{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|

MSE vs MAE Comparison

PropertyMSEMAE
Formula(yy^)2(y - \hat{y})^2yy^\|y - \hat{y}\|
Outlier sensitivityHighLow
GradientSmooth everywhereKink at 0
Best whenOutliers matterOutliers are noise

6. The Loss Landscape

Rendering diagram...

Visualized as a 2D curve:

Loss
  │        *
  │      *   *
  │    *       *
  │  *           *
  │*               *  ← global minimum
  └──────────────────── Weight value

The loss landscape can be:

  • Convex (MSE) — one global minimum, easy to optimize
  • Non-convex (deep networks) — many local minima, harder to optimize

7. Choosing the Right Loss Function

Rendering diagram...

8. Quick Reference

LMSE=1n(yy^)2\boxed{L_{MSE} = \frac{1}{n}\sum(y - \hat{y})^2}

LBCE=[ylogy^+(1y)log(1y^)]\boxed{L_{BCE} = -[y\log\hat{y} + (1-y)\log(1-\hat{y})]}

LMAE=1nyy^\boxed{L_{MAE} = \frac{1}{n}\sum|y - \hat{y}|}

Filed underai

Related posts