The Definitive Guide to Machine Learning Loss Functions: From Theory to Implementation

A comprehensive guide to choosing a right loss function for the task and data

Machine LearningDeep LearningData Science

By Kuriko IWAI

Introduction

Choosing a right loss function is critical to build effective machine learning models because it defines the “goal” of the model by quantifying how far the model’s predictions are from the actual truth.

In other words, the choice of loss function directly dictates what the model is trying to optimize.

This guide explores major loss functions, detailing their mathematical formulations, conceptual graphs, and primary applications by task type.

1. Regression Problems

Regression problems involve predicting a continuous numerical output such as amount, height, weight.

The loss functions measure the difference between the predicted continuous value and the true continuous value.

◼ Mean Squared Error (MSE) / L2 Loss

MSE = \frac {1} {N} \sum_{i=1}^N (y_i - \hat y_i)^2

(N: Total number of samples, y_i: i_th true label in the sample set, y^_i: The corresponding prediction to the i_th true label)

Description: Calculates the average of the squared differences between predicted and actual values. It heavily penalizes large errors due to the squaring operation, making it sensitive to outliers.

When to use: Default choice for most regression tasks. Useful when large errors are particularly undesirable.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Fig. MSE

◼ Mean Absolute Error (MAE) / L1 Loss

MAE = \frac {1} {N} \sum_{i=1}^N |y_i - \hat y_i |

(N: Total number of samples, y_i: i_th true label in the sample set, y^_i: The corresponding prediction to the i_th true label)

Description: Calculates the average of the absolute differences between predicted and actual values. It is less sensitive to outliers compared to MSE because it doesn't square the errors.

When to use: Need a more robust measure of errors when outliers are present in the dataset.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Fig. MAE

◼ Huber Loss (Smooth Mean Absolute Error)

L_\delta(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \le \delta \\ \delta|y - \hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}

(δ: Transitioning point from quadratic to linear, y: A true label, y^: Prediction)

Description: Combines the best of MSE and MAE. It is quadratic for small errors and linear for large errors, with a hyperparameter (δ) defining the transition point. Because of this nature, Huber Loss is less sensitive to outliers than MSE while still being differentiable at zero (unlike MAE).

When to use: Need a loss function robust to outliers but also provides smooth gradients for gradient-based optimizers.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Fig. Huber Loss

◼ Log-Cosh Loss

L(y, \hat{y}) = \sum_{i=1}^{N} \log(\cosh(\hat{y}_i - y_i))

(N: Total number of samples, y_i: i_th true label in the sample set, y^_i: The corresponding prediction to the i_th true label)

Description: A smoother version of MSE. It behaves much like MSE for small errors but is approximately linear for large errors, making it less sensitive to outliers than MSE.

When to use: Need a smooth, differentiable loss function that is more robust to outliers than MSE.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Fig. Log-Cosh Loss

◼ Quantile Loss (Pinball Loss)

L_\tau(y, \hat{y}) = \sum_{i=1}^{N} \begin{cases} \tau(y_i - \hat{y}_i) & \quad \text{if } y_i > \hat{y}_i \\ (1-\tau)(\hat{y}_i - y_i) & \quad \text{if } y_i \leq \hat{y}_i \end{cases}

(τ: The quantile value (0 < τ < 1), N: The total number of samples, y_i: i_th true label in the sample set, y^_i: The corresponding prediction to the i_th true label)

Description: Used in quantile regression to predict specific quantiles (e.g., median, 25th percentile) of the target variable rather than just taking the mean. The quantile τ (tau) is a number between 0 and 1 (exclusively), that determines which quantile of the target distribution the model is being trained to predict.

This loss function is unique because it's asymmetric. Unlike MSE or MAE, Quantile Loss penalizes over- and under-predictions differently based on the value of the quantile.

When to use: Need to understand the uncertainty or distribution of predictions.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Fig. Pinball Loss looking like pinball.

2. Classification Problems

Classification problems involve predicting a discrete category or class (e.g., is_spam, folder_to_classify) based on probabilities estimated by the model.

Loss functions are used to quantify the penalty for incorrect classification.

◼ Binary Cross-Entropy Loss (Log Loss)

L(y, \hat{p}) = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i)]

(N: Total number of samples, y_i: i_th true label in the sample set, y^_i: The corresponding prediction to the i_th true label, p^_i: Predicted probability that the i-th sample belongs to class 1)

Description: Used for binary classification tasks with two classes in the target sample set. It measures the dissimilarity between the predicted probability distribution and the true distribution, heavily penalizing confident but incorrect predictions.

When to use: A standard choice for binary classification problems, especially when the model outputs probabilities (e.g., logistic regression, neural networks with a sigmoid activation in the output layer).

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Fig. Binary Cross-Entropy Loss

◼ Sparse / Categorical Cross-Entropy Loss

L(y, \hat{p}) = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{C} y_{ij} \log(\hat{p}_{ij})

(N: Total number of samples, C: Total number of the classes in the sample set, y_{ij}: True label for data point i and class j , p^_{ij}: Predicted probability that data point i belongs to class j)

Description: Used for multi-class classification problems where each sample belongs to exactly one category from multiple target classes.

▫ 1) Categorical Cross-Entropy Loss

Categorical Cross-Entropy Loss is used when the true labels are one-hot encoded (e.g., [0, 0, 1, 0] for Class 3).

The inner summation over j=1 to C calculates the loss for a single data point by summing the contributions from all possible classes. Because y_{ij} is one-hot encoded, it effectively cancels out incorrect probabilities (Class 1, 2, and 4 in the example) by multiplying zero.

When to use: True labels are one-hot encoded. To penalize the model for assigning a low probability to the correct class, and force it to increase that probability during training.

▫ 2) Sparse Categorical Cross-Entropy Loss

Functionally similar to Categorical Cross-Entropy but used when the true labels are integers (e.g., 0, 1, 2, 3) instead of one-hot encoded vectors.

It internally converts the integer labels to one-hot for calculation.

When to use: Similar to Categorical Cross-Entropy Loss, but saving memory and computational overhead on one-hot encoding process.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Fig. Categorical Cross-Entropy Loss

◼ Hinge Loss

L(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^{N} \max(0, 1 - y_i \cdot \hat{y}_i)

(N: Total number of samples, y_i: i_th true label in the sample set, y^_i: The corresponding prediction to the i_th true label)

Description: Primarily used for "maximum margin" binary classification problems, most famously with Support Vector Machines (SVMs). It penalizes predictions that are on the wrong side of the decision boundary or too close to it. The true labels are typically -1 or 1.

When to use: For SVMs or similar maximum-margin classifiers.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

◼ Squared Hinge Loss

L(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^{N} \max(0, 1 - y_i \cdot \hat{y}_i)^2

(N: Total number of samples, y_i: i_th true label in the sample set, y^_i: The corresponding prediction to the i_th true label)

Description: A variant of hinge loss that provides a smoother error surface, making it easier for gradient optimizers to converge the model.

When to use: Similar to Hinge Loss, but when a smoother loss function is desired for training.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

◼ Focal Loss

L(y, \hat{p}) = -\alpha_t (1-p_t)^\gamma \log(p_t)

where:

α_t: The class-balancing factor (α_t if y = 1, 1 - α_t if y = 0. 0 < α_t < 1),
p_t: The predicted probability for the true class (p if y = 1, 1 - p if y = 0),
−log(p_t): The standard cross-entropy loss, and
(1 − p_t)^γ: The modulating factor with a focusing parameter γ≥0

Description: Addresses the issue of class imbalance in classification tasks by down-weighting the loss assigned to well-classified examples and focusing on hard, misclassified examples.

For a well-classified (“easy”) example, p_t is close to 1, making the modulating factor close to 0, which effectively down-weights the loss for this example.

For a misclassified (“hard”) example, p_t is small, making the modulating factor close to 1, leaving the loss essentially unchanged.

The focusing parameter γ smoothly adjusts the rate at which easy examples are down-weighted.

A higher γ makes the loss function more aggressively focus on hard examples, while a lower γ makes the modulating factor ignorable, making the loss function revert back to standard cross-entropy.

When to use: In object detection and other classification tasks where there's a significant imbalance between foreground and background classes.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Fig. Focal loss

3. Ranking & Metric Learning Losses

Ranking & Metric Learning Losses are loss functions used when the goal is not just to predict a specific value or class, but to learn relative ordering or to embed data points in a space where distances have meaning (e.g., similar items are closer).

◼ Pairwise Ranking Loss

L(x) = -log(\sigma(\text{score_positive} - \text{score_negative}))

(σ: The sigmoid function)

Description: Designed for ranking tasks where the goal is to correctly order items. It encourages a higher score for positive (preferred) items than for negative (not preferred) items.

Bayesian Personalized Ranking (BPR) Loss is a common loss function for implicit feedback. It maximizes the difference between the scores of a positive item and a randomly sampled negative item for a given user.

When to use: When the goal is to provide a ranked list of recommendations, rather than just predicting a specific rating or interaction probability. Use cases include implicit feedback recommendation systems, information retrieval, learning to rank. The goal is to ensure a preferred item gets a higher score than a non-preferred one for a given user/query.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Fig. BPR loss

◼ Triplet Loss

L(a,p,n)=\max(0,\|f(a)-f(p)\|^2 -\|f(a)-f(n)\|^2 +\alpha)

where:

a (anchor): A reference data point,
p (positive): A data point from the same class or category as the anchor,
n (negative): A data point from a different class than the anchor,
∥f(a)−f(p)∥^2: The squared Euclidean distance between the embeddings of two data points, a and p (or n), and
α: A margin hyperparameter.

Description: Used to learn embeddings such that an anchor (a) is closer to a positive (p) example than to a negative (n) example by at least a margin α. It aims to make an anchor closer to a positive example than to a negative example.

When to use: In recommendation systems for learning user and item embeddings where the relative distance between items is important.

Common applications include metric learning, face recognition, recommendation systems (for learning user/item embeddings), and anomaly detection.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Fig. Triplet loss

4. Distributional & Generative Models

Generative models generate a probability distribution instead of a single value. Unique loss functions include KL Divergence and W/GAN losses.

◼ Kullback-Leibler (KL) Divergence / Relative Entropy

D_{KL}(P||Q) = \sum_{i} P(i)\log\left(\frac{P(i)}{Q(i)}\right)

where:

P: True probability distribution,
Q: Learned probability distribution by the model, and
i: i-th outcome.

Description: KL divergence measures the difference - how dissimilar two distributions are - using conceptual measures of information lost from P to Q:

D_{KL}(P∣∣Q) = 0: The two distributions are identical.
D_{KL}(P∣∣Q) > 0: The two distributions are different. A larger value indicates a greater difference between them.

The term log(Q(i) / P(i)) measures the difference in information content for each outcome i.

If Q(i) is a good approximation of P(i), the ratio is close to 1, and the log is close to 0, contributing very little to the total divergence, and vise versa.

When to use: Training probabilistic models. Common use cases are a regularization term for Variational Auto Encoders (VAEs).

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

◼ Adversarial Loss (GAN Loss)

L(G,D)=E_{x∼p_{data(x)}}[logD(x)]+E_{z∼p_z(z)}[log(1−D(G(z)))]

where:

E: The expected value (average) over a distribution,
x∼p_{data}(x): The real data points x sampled from the true data distribution p_{data},
z∼p_z(z): The random noise vectors z sampled from a prior distribution (e.g., a normal distribution), p_z,
D(x): The discriminator's output for a real data point x,
G(z): The generator's output (a fake data point) when given noise z, and
D(G(z)): The discriminator's output for a fake data point generated by G.

Description: The minimax objective function for a Generative Adversarial Network (GAN), describing the adversarial game between two competing neural networks:

Discriminator (D): The discriminator is a classifier that tries to maximize this function by correctly identifying real data. It wants D(x) (the probability that real data x is real) to be close to 1, and D(G(z)) (the probability that fake data G(z) is real) to be close to 0.
Generator (G): The generator is a creative network that tries to minimize this function. It wants to create fake data G(z) that is so convincing it fools the discriminator, making D(G(z)) close to 1.

The loss is based on binary cross-entropy, which penalizes the networks for incorrect predictions.

The ultimate goal of this minimax game is to find a generator that can produce realistic data that the discriminator cannot distinguish from the real data.

When to use: For training GANs to generate realistic data like images or texts.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Fig. GAN loss

◼ Wasserstein Distance / Earth Mover's Distance (WGAN Loss)

W_p(P,Q)=(\text{inf}{\gamma∈Π(P,Q)} E {(x,y)∼γ}[∥x−y∥ ^p]) ^{\frac {1}{p}}

where:

Π(P,Q): A set of all possible joint distributions γ(x, y) whose marginals are P and Q,
E_{(x,y)∼γ}[||x−y||^p]: The expected value of the distance between points x and y, representing the "cost" of a specific transport plan, and
inf: The infimum is the greatest lower bound.

Description: Measures the "cost" of transforming one probability distribution into another. Used as an alternative to adversarial loss in WGANs for more stable training and better convergence.

When to use: For training Wasserstein GANs (WGANs) to address training instability and mode collapse issues in standard GANs.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Fig. Wasserstein distance

◼ Reconstruction Loss

Description: Not a unique loss function itself, but the application of existing losses (like MSE for continuous data or Binary Cross-Entropy for discrete data) within the VAE framework.

When to use: As part of the ELBO (Evidence Lower Bound) objective in VAEs, along with KL Divergence.

Common Applications: Part of the Evidence Lower Bound (ELBO) objective in VAEs, measuring how accurately the decoder reconstructs the input data from the latent representation.

5. Reinforcement Learning Specific Objectives

In reinforcement learning, loss functions indicate the objective functions that an agent optimizes to maximize cumulative reward. These are more about policy/value approximation errors or differences in expected returns.

◼ Mean Squared Error (for Value Function Approximation)

Description: An application of MSE to train neural networks (the "critic" in Actor-Critic methods, or Q-networks in DQN) to estimate the value of states or state-action pairs. The "true" value is typically a bootstrapped target.
Common Applications: Value-based RL methods (e.g., Q-learning, DQN), the critic component of Actor-Critic methods.

◼ Policy Gradient Objectives (often derived from Cross-Entropy principles)

Description: While often involving cross-entropy terms, the full objective in policy gradient methods like REINFORCE or Actor-Critic is more complex. It typically involves maximizing the expected return, often weighted by an advantage function. The "loss" would be the negative of this objective.
Common Applications: Policy-based RL methods (e.g., REINFORCE), the actor component of Actor-Critic methods.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

◼ Actor-Critic Losses

Description: The overall objective for Actor-Critic methods, which typically sums the actor's policy loss (e.g., a policy gradient objective) and the critic's value loss (e.g., MSE).
Common Applications: A broad class of RL algorithms that combine the strengths of value-based and policy-based approaches.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Scaling Generalization: Automating Flexible AI with Meta-Learning and NAS
Learn how to combine Meta-Learning (MAML, Reptile) with Neural Architecture Search (NAS) to build AI models that generalize and adapt to new tasks with minimal data.
A Comparative Guide to Hyperparameter Optimization Strategies
A technical deep dive comparing 5 hyperparameter optimization methods using CNN and SVM simulations. Learn which strategy balances cost and performance.
Optimizing LSTMs with Hyperband: A Comparative Guide to Bandit-Based Tuning
Master Hyperband for ML optimization. A deep dive into successive halving mechanics, PyTorch LSTM implementation for stock prediction, and performance benchmarks against Bayesian Optimization, GA, and Random Search.
Automating Deep Learning: A Guide to Neural Architecture Search (NAS) Strategies
Stop manual tuning. Learn how Neural Architecture Search (NAS) automates deep learning design. Explore RL, Evolutionary Algorithms, and Gradient-based strategies with Python examples.

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Share What You Learned

Kuriko IWAI, "The Definitive Guide to Machine Learning Loss Functions: From Theory to Implementation" in Kernel Labs

https://kuriko-iwai.com/loss-functions-in-machine-learning

Looking for Solutions?

Deploying ML Systems 👉 Book a briefing session
Hiring an ML Engineer 👉 Drop an email
Learn by Doing 👉 Enroll AI Engineering Masterclass

The Definitive Guide to Machine Learning Loss Functions: From Theory to Implementation

A comprehensive guide to choosing a right loss function for the task and data

Table of Contents

Introduction

1. Regression Problems

◼ Mean Squared Error (MSE) / L2 Loss

◼ Mean Absolute Error (MAE) / L1 Loss

◼ Huber Loss (Smooth Mean Absolute Error)

◼ Log-Cosh Loss

◼ Quantile Loss (Pinball Loss)

2. Classification Problems

◼ Binary Cross-Entropy Loss (Log Loss)

◼ Sparse / Categorical Cross-Entropy Loss

▫ 1) Categorical Cross-Entropy Loss

▫ 2) Sparse Categorical Cross-Entropy Loss

◼ Hinge Loss

◼ Squared Hinge Loss

◼ Focal Loss

3. Ranking & Metric Learning Losses

◼ Pairwise Ranking Loss

◼ Triplet Loss

4. Distributional & Generative Models

◼ Kullback-Leibler (KL) Divergence / Relative Entropy

◼ Adversarial Loss (GAN Loss)

◼ Wasserstein Distance / Earth Mover's Distance (WGAN Loss)

◼ Reconstruction Loss

5. Reinforcement Learning Specific Objectives

◼ Mean Squared Error (for Value Function Approximation)

◼ Policy Gradient Objectives (often derived from Cross-Entropy principles)

◼ Actor-Critic Losses

Continue Your Learning

Scaling Generalization: Automating Flexible AI with Meta-Learning and NAS

A Comparative Guide to Hyperparameter Optimization Strategies

Optimizing LSTMs with Hyperband: A Comparative Guide to Bandit-Based Tuning

Automating Deep Learning: A Guide to Neural Architecture Search (NAS) Strategies

Related Books for Further Understanding

Share What You Learned

Looking for Solutions?