Loss Functions
A loss function measures how wrong your model is. Training is the process of making it less wrong, one gradient at a time. Choosing the right loss function is not optional — using MSE for a classification problem is like using a ruler to measure temperature.
Conventions
All loss functions share the same signature pattern:
// Forward
bool tensor_<name>_loss(Tensor *out,
const Tensor *pred, // model output
const Tensor *target, // ground truth
Reduction reduction);
// Backward
bool tensor_<name>_loss_grad(Tensor *out, // gradient w.r.t. pred
const Tensor *pred,
const Tensor *target,
const Tensor *grad, // upstream gradient
Reduction reduction);
Reduction modes
enum Reduction {
REDUCTION_NONE, // Output has same shape as inputs
REDUCTION_MEAN, // Output is a scalar: mean of all losses
REDUCTION_SUM, // Output is a scalar: sum of all losses
};
REDUCTION_MEAN is the most common choice — it normalises by the number of elements so the loss scale doesn't depend on batch size.
For the high-level nn API, the reduction is fixed at REDUCTION_MEAN (controlled by the LossType passed to Model::compile()).
MSE Loss (Mean Squared Error)
L = (1/N) * Σ (pred_i - target_i)²
∂L/∂pred_i = (2/N) * (pred_i - target_i)
bool tensor_mse_loss(Tensor *out, const Tensor *pred, const Tensor *target, Reduction);
bool tensor_mse_loss_grad(Tensor *out, const Tensor *pred, const Tensor *target, const Tensor *grad, Reduction);
Use for: Regression tasks where outliers should be penalised heavily (because the squared term amplifies large errors).
Avoid when: Your data has outliers that genuinely shouldn't dominate — use Huber instead.
nn class: MSELoss
L1 Loss (Mean Absolute Error)
L = (1/N) * Σ |pred_i - target_i|
∂L/∂pred_i = (1/N) * sign(pred_i - target_i)
bool tensor_l1_loss(Tensor *out, const Tensor *pred, const Tensor *target, Reduction);
bool tensor_l1_loss_grad(...);
Use for: Regression with outliers. The linear penalty is less sensitive to large errors than MSE.
Gotcha: The gradient is discontinuous at zero (a subgradient of 0 is used there). This rarely causes problems in practice.
nn class: L1Loss, also MAELoss (which is identical — two names, same implementation).
Huber Loss
L_i = 0.5 * d² if |d| ≤ δ
L_i = δ * (|d| - 0.5 * δ) otherwise
where d = pred_i - target_i
∂L/∂pred_i = d if |d| ≤ δ
= δ * sign(d) otherwise
bool tensor_huber_loss(Tensor *out, const Tensor *pred, const Tensor *target,
float delta, Reduction);
bool tensor_huber_loss_grad(..., float delta, Reduction);
Use for: Regression with outliers. Huber is the best of both worlds: quadratic (MSE) for small errors, linear (L1) for large ones. The delta parameter controls the transition point.
The California Housing tutorial uses delta = 1.0 (the default) with house prices scaled to the [0, 5] range.
nn class: HuberLoss(float delta = 1.0f), exposed as LossType::HUBER in the enum.
As you can see in the chart above...
BCE Loss (Binary Cross-Entropy)
L = -(1/N) * Σ [t * log(p) + (1-t) * log(1-p)]
∂L/∂p = (1/N) * (p - t) / (p * (1 - p))
bool tensor_bce_loss(Tensor *out, const Tensor *pred, const Tensor *target, Reduction);
bool tensor_bce_loss_grad(...);
Inputs: pred must be in (0, 1) — pass sigmoid outputs, not logits. A small epsilon clamps values away from 0 and 1 to avoid log(0) = -inf.
Use for: Binary classification where the output layer applies sigmoid.
nn class: BCELoss
BCE with Logits Loss
L = max(x, 0) - x*y + log(1 + exp(-|x|))
= log(1 + exp(x)) - x*y (numerically stable form)
bool tensor_bce_with_logits_loss(Tensor *out, const Tensor *logits, const Tensor *target, Reduction);
bool tensor_bce_with_logits_loss_grad(...);
Inputs: logits are raw (unbounded) model outputs — not after sigmoid.
Advantage over BCE: Numerically more stable. When logits are very positive, exp(large) overflows; the log-sum-exp formulation avoids this.
∂L/∂x = σ(x) - y
Use for: Binary classification with a linear output layer (no sigmoid).
nn class: BCEWithLogitsLoss
Cross-Entropy Loss
L = -(1/N) * Σ_batch Σ_class target * log(softmax(logits))
Numerically:
log_sum_exp = max(logits) + log(Σ exp(logits - max(logits)))
L_i = Σ_c target_c * (log_sum_exp - logits_c)
∂L/∂logits_c = (1/N) * (softmax(logits)_c - target_c)
bool tensor_cross_entropy_loss(Tensor *out, const Tensor *logits, const Tensor *target, Reduction);
bool tensor_cross_entropy_loss_grad(...);
Inputs: logits are raw scores (before softmax). target is a one-hot encoded distribution. The function applies log-softmax internally.
This is the loss used in the MNIST tutorial. It:
- Subtracts
max(logits)for numerical stability. - Computes log-sum-exp.
- Computes cross-entropy against the target distribution.
Use for: Multi-class classification.
tensor_cross_entropy_loss already applies softmax. Do not add a nn::Softmax layer to your model when using this loss.
nn class: CrossEntropyLoss, exposed as LossType::CROSS_ENTROPY.
NLL Loss (Negative Log-Likelihood)
L = -(1/N) * Σ_batch Σ_class target * log_probs
∂L/∂log_probs = -(1/N) * target
bool tensor_nll_loss(Tensor *out, const Tensor *log_probs, const Tensor *target, Reduction);
bool tensor_nll_loss_grad(...);
Inputs: log_probs must be log-probabilities (output of log(softmax(x))). CrossEntropyLoss is essentially NLLLoss(LogSoftmax(logits), target) combined.
Use for: When you want explicit control over the log-softmax step, or when working with models that output log-probabilities directly.
nn class: NLLLoss
KL Divergence Loss
L = Σ target * (log(target) - pred)
= Σ target * log(target/exp(pred))
bool tensor_kl_div_loss(Tensor *out, const Tensor *pred, const Tensor *target, Reduction);
bool tensor_kl_div_loss_grad(...);
Inputs: pred is expected to be log-probabilities (i.e. after log_softmax). target is a probability distribution.
Use for: Knowledge distillation, variational autoencoders, any task where you're measuring the divergence between two distributions.
Gradient: ∂L/∂pred_i = -target_i / N (only contributes where target > 0).
nn class: KLDivLoss
Hinge Loss
L = (1/N) * Σ max(0, 1 - pred * target)
∂L/∂pred = -target / N if 1 - pred * target > 0
= 0 otherwise
bool tensor_hinge_loss(Tensor *out, const Tensor *pred, const Tensor *target, Reduction);
bool tensor_hinge_loss_grad(...);
Inputs: pred is the model output (can be any real number). target is ±1 (not 0/1).
Use for: Support Vector Machines (SVMs) and binary classification tasks. The "margin" in "maximum margin classifier" comes from this loss.
nn class: HingeLoss
L2 Loss (Weight Regularization)
L = (1/N) * Σ 0.5 * w_i²
∂L/∂w_i = w_i / N
bool tensor_l2_loss(Tensor *out, const Tensor *weights, Reduction);
bool tensor_l2_loss_grad(Tensor *out, const Tensor *weights, const Tensor *grad, Reduction);
L2 regularisation on model weights. Note: in the nn API, weight regularisation is handled by the AdamW optimizer's weight_decay parameter (decoupled from the gradient), which is generally preferable to adding it as a loss term.
nn class: L2Loss
L1 Regularization
L = (1/N) * Σ |w_i|
∂L/∂w_i = sign(w_i) / N
bool tensor_l1_regularization(Tensor *out, const Tensor *weights, Reduction);
bool tensor_l1_regularization_grad(...);
L1 regularisation induces sparsity — it pushes weights toward exactly zero, unlike L2 which only makes them small. Useful when you want the model to actively select features.
nn class: L2Loss (note: L1 regularisation is also exposed via LossFunction but is not in the LossType enum — use directly).
Cosine Embedding Loss
L = 1 - cosine_similarity(x1, x2) if y = 1
L = max(0, cosine_similarity(x1, x2) - margin) if y = -1
bool tensor_cosine_embedding_loss(Tensor *out, const Tensor *x1, const Tensor *x2,
const Tensor *target, float margin, Reduction);
Use for: Learning embeddings where similar pairs should be close (cosine distance ≈ 0) and dissimilar pairs should be far apart (cosine distance > margin).
nn class: CosineEmbeddingLoss (use forward_triplet(x1, x2, target) — the standard forward is a placeholder that warns).
Triplet Loss
L = max(0, dist(anchor, positive) - dist(anchor, negative) + margin)
Where dist is Euclidean distance.
bool tensor_triplet_loss(Tensor *out, const Tensor *anchor, const Tensor *positive,
const Tensor *negative, float margin, Reduction);
Use for: Metric learning — learning an embedding space where the same class clusters together and different classes are separated. Each training example is a triplet: an anchor, a positive (same class), and a negative (different class).
nn class: TripletLoss (use forward_triplet(anchor, positive, negative)).
Choosing a Loss Function
| Task | Recommended Loss |
|---|---|
| Binary classification | BCEWithLogitsLoss (output: linear) or BCELoss (output: sigmoid) |
| Multi-class classification | CrossEntropyLoss |
| Regression, no outliers | MSELoss |
| Regression, outliers present | HuberLoss |
| Regression, heavy outliers | L1Loss |
| Embedding / metric learning | TripletLoss or CosineEmbeddingLoss |
| Knowledge distillation | KLDivLoss |
| SVM-style classification | HingeLoss |