1. Intuition
A loss function (also called cost function or objective function) quantifies the gap between what the model predicted () and what the true answer actually is (). It converts that gap into a single number that training tries to minimize.
Real-Life Analogy 🎯 — Darts
| Element | Darts | Neural Network |
|---|---|---|
| Target | Bullseye | Correct answer |
| Dart landing | Where it hits | Prediction |
| Distance from bullseye | Measurable error | Loss |
| Goal | Hit bullseye | Minimize |
2. Mean Squared Error (MSE)
Used for regression tasks — predicting continuous values like prices, temperatures, scores.
Formula
House Price Example
That's a huge loss — the model is very wrong!
Why Squared?
| Property | Explanation |
|---|---|
| Always positive | — negatives can't cancel positives |
| Penalizes large errors more | but |
| Smooth & differentiable | Easy to take gradient of |
MSE Gradient
3. Binary Cross-Entropy Loss
Used for binary classification (spam/not-spam, fraud/legit, yes/no).
Formula
Spam Detection Example 📧
| Variable | Value |
|---|---|
| True label | Spam () |
| Model confidence | 20% spam () |
Now if model was correct ():
Why Logarithm?
| Confidence | Interpretation | |
|---|---|---|
| 0.01 | 4.61 | Very wrong, very penalized |
| 0.10 | 2.30 | Wrong, penalized |
| 0.50 | 0.69 | Uncertain |
| 0.90 | 0.11 | Mostly right |
| 0.99 | 0.01 | Very confident and correct |
4. Categorical Cross-Entropy
Used for multi-class classification (digit 0–9, sentiment, language).
Formula
Where is the number of classes and is 1 for the correct class, 0 otherwise.
Digit Recognition Example 🔢
| Class | Logit | Softmax | True |
|---|---|---|---|
| 0 | 1.2 | 0.09 | 0 |
| 1 | 0.5 | 0.04 | 0 |
| 2 | 3.1 | 0.82 | 1 ← correct |
| 3 | 0.8 | 0.06 | 0 |
5. Mean Absolute Error (MAE)
MSE vs MAE Comparison
| Property | MSE | MAE |
|---|---|---|
| Formula | ||
| Outlier sensitivity | High | Low |
| Gradient | Smooth everywhere | Kink at 0 |
| Best when | Outliers matter | Outliers are noise |
6. The Loss Landscape
Visualized as a 2D curve:
Loss
│ *
│ * *
│ * *
│ * *
│* * ← global minimum
└──────────────────── Weight value
The loss landscape can be:
- Convex (MSE) — one global minimum, easy to optimize
- Non-convex (deep networks) — many local minima, harder to optimize
7. Choosing the Right Loss Function
8. Quick Reference