Skip to main content

Loss Operations

The autograd loss functions are differentiable wrappers around the tensor-level loss functions. Each one computes the scalar loss in the forward pass, saves the tensors needed for the gradient, and registers a backward_fn that computes the gradient of the loss with respect to the model's predictions and accumulates it into the prediction's grad tensor.

Header: include/autograd/autograd.hpp
Source: src/autograd/ops/loss/

Target variables never require gradients

All loss functions differentiate with respect to the prediction only, never the target. The target is wrapped as a create_leaf(..., requires_grad=false) and its pointer is saved for the backward computation, but no gradient is accumulated into it. This matches the standard ML convention — you update your model's predictions to match the target, not the other way around.

Output shape

Every loss function produces a scalar output: a 1-element tensor with shape [1]. The requires_grad of the output is inherited from the prediction input. After backward, the loss gradient is seeded to 1.0 by autograd::backward, and the chain rule propagates it back toward the prediction.


Common Structure

Every loss op follows this pattern:

Variable *mse_loss(Arena *arena, Variable *pred, Variable *target, Reduction reduction) {
// 1. Create scalar output, compute forward value
uint32_t scalar_shape[1] = {1};
Tensor *out_data = tensor_create_zeros(arena, 1, scalar_shape);
tensor_mse_loss(out_data, pred->data, target->data, reduction);

// 2. Create output Variable
Variable *out = arena->push<Variable>();
out->data = out_data;
out->requires_grad = pred->requires_grad;
out->is_leaf = false;
out->reduction = reduction; // stored for backward

if (out->requires_grad) {
// 3. Allocate grad, wire parent (pred only), save pred + target
out->grad = tensor_create_zeros(arena, 1, scalar_shape);
out->num_parents = 1;
out->parents = arena->push_array<Edge>(1);
out->parents[0] = {pred};
out->num_saved = 2;
out->saved_tensors = arena->push_array<Tensor *>(2);
out->saved_tensors[0] = pred->data;
out->saved_tensors[1] = target->data;

// 4. Backward function
out->backward_fn = [](Variable *self, Arena *temp_arena) {
Variable *parent = self->parents[0].node;
if (!parent->requires_grad) return;
Tensor *local_grad = tensor_create_zeros(
temp_arena, parent->grad->ndims, parent->grad->shape);
tensor_mse_loss_grad(local_grad,
self->saved_tensors[0],
self->saved_tensors[1],
self->grad,
static_cast<Reduction>(self->reduction));
tensor_add(parent->grad, parent->grad, local_grad);
};
}
return out;
}

The reduction mode is stored as a uint32_t in out->reduction (cast from the Reduction enum) so the backward function can recover it via static_cast<Reduction>(self->reduction).


Standard Two-Input Loss Functions

These all take (Arena*, Variable* pred, Variable* target, Reduction reduction) and differentiate with respect to pred.

mse_loss

Variable *mse_loss(Arena *arena, Variable *pred, Variable *target,
Reduction reduction);

L = (1/N) * Σ (pred_i - target_i)²

Gradient: ∂L/∂pred_i = (2/N) * (pred_i - target_i)

Saves: pred->data, target->data.

auto *loss = autograd::mse_loss(graph_arena, pred, target, REDUCTION_MEAN);

l1_loss

Variable *l1_loss(Arena *arena, Variable *pred, Variable *target,
Reduction reduction);

L = (1/N) * Σ |pred_i - target_i|

Gradient: ∂L/∂pred_i = (1/N) * sign(pred_i - target_i)

Gradient is zero where pred_i == target_i (a subgradient of 0 is chosen).


huber_loss

Variable *huber_loss(Arena *arena, Variable *pred, Variable *target,
float delta, Reduction reduction);
L_i = 0.5 * d² if |d| ≤ delta (d = pred_i - target_i)
L_i = delta * (|d| - 0.5 * delta) otherwise

delta is stored in out->metadata_float and passed to the backward.

Gradient:

∂L/∂pred_i = (1/N) * d if |d| ≤ delta
= (1/N) * delta * sign(d) otherwise

bce_loss

Variable *bce_loss(Arena *arena, Variable *pred, Variable *target,
Reduction reduction);

L = -(1/N) * Σ [t * log(p) + (1-t) * log(1-p)]

pred must contain values in (0, 1) — i.e. the output of a sigmoid layer. A small epsilon clamps values away from 0 and 1 to avoid log(0).

Gradient: ∂L/∂p = (1/N) * (p - t) / (p * (1 - p))

Use bce_with_logits_loss when possible

bce_loss requires sigmoid-ed predictions. bce_with_logits_loss takes raw logits and is numerically stabler.


bce_with_logits_loss

Variable *bce_with_logits_loss(Arena *arena, Variable *logits, Variable *target,
Reduction reduction);

Numerically stable BCE that accepts raw logits (before sigmoid). Uses the log-sum-exp trick:

L = max(x, 0) - x * y + log(1 + exp(-|x|))

Gradient: ∂L/∂x = (1/N) * (σ(x) - y)

This is the preferred loss for binary classification with a linear output layer.


cross_entropy_loss

Variable *cross_entropy_loss(Arena *arena, Variable *logits, Variable *target,
Reduction reduction);

Applies log-softmax and negative log-likelihood in one numerically stable operation.

Forward: For each batch element, subtracts max(logits) for stability, computes log-sum-exp, then computes Σ target_c * (log_sum_exp - logit_c).

Gradient: ∂L/∂logit_c = (1/N) * (softmax(logit)_c - target_c)

The gradient is simply the difference between the predicted probability and the target probability — intuitive and efficient.

Saves: logits->data, target->data.

// pred: raw logits [batch, num_classes], target: one-hot [batch, num_classes]
auto *loss = autograd::cross_entropy_loss(graph_arena, pred, target, REDUCTION_MEAN);
Do not apply softmax before this

cross_entropy_loss applies log-softmax internally. Double-applying softmax will corrupt both the loss value and its gradient.


nll_loss

Variable *nll_loss(Arena *arena, Variable *log_probs, Variable *target,
Reduction reduction);

Negative log-likelihood. Expects log_probs to be the output of log(softmax(x)) — i.e. log-probabilities.

L = -(1/N) * Σ_batch Σ_class target * log_probs

Gradient: ∂L/∂log_probs = -(1/N) * target

cross_entropy_loss combines log_softmax + nll_loss in one stable step and should be preferred for classification.


kl_div_loss

Variable *kl_div_loss(Arena *arena, Variable *pred, Variable *target,
Reduction reduction);

L = Σ target * (log(target) - pred) where pred is log-probabilities.

Gradient: ∂L/∂pred_i = -(1/N) * target_i (only where target_i > 0)

Used for knowledge distillation and variational models where you match predicted log-probabilities to a target distribution.


hinge_loss

Variable *hinge_loss(Arena *arena, Variable *pred, Variable *target,
Reduction reduction);

L = (1/N) * Σ max(0, 1 - pred * target)

Target must be ±1. Used for SVM-style binary classification.

Gradient: ∂L/∂pred = -(1/N) * target where 1 - pred * target > 0, else 0.


Regularisation Losses

These take a single parameter tensor rather than a prediction-target pair.

l2_loss

Variable *l2_loss(Arena *arena, Variable *weights, Reduction reduction);

L = (1/N) * Σ 0.5 * w_i²

Gradient: ∂L/∂w = (1/N) * w

Only one parent (weights). Saves weights->data.

Prefer AdamW's built-in weight decay

autograd::l2_loss adds L2 regularisation as a loss term, which interacts with adaptive learning rate scaling in Adam. optim::AdamW applies weight decay directly and independently, which is the correct formulation. Use AdamW unless you have a specific reason to add L2 through the loss.


l1_regularization

Variable *l1_regularization(Arena *arena, Variable *weights,
Reduction reduction);

L = (1/N) * Σ |w_i|

Gradient: ∂L/∂w = (1/N) * sign(w)

L1 regularisation induces sparsity — weights are pushed toward exactly zero rather than merely made small. Useful when you want the model to actively select features.


Multi-Input Loss Functions

These take more than two Variable inputs and have specialised parent wiring.

cosine_embedding_loss

Variable *cosine_embedding_loss(Arena *arena, Variable *x1, Variable *x2,
Variable *target, float margin,
Reduction reduction);
L = 1 - cos_sim(x1, x2) if target == 1
L = max(0, cos_sim(x1, x2) - margin) if target == -1

Three parents: x1, x2, target. Gradients are computed for x1 (and x2 if it requires grad). margin is stored in metadata_float.

Gradient w.r.t. x1:

∂L/∂x1_f = sign * (x2_f - x1_f * dot/‖x1‖²) / (‖x1‖ * ‖x2‖)

where sign = -1 for similar pairs (target=1) and +1 for dissimilar pairs where the margin is violated.

note

The backward function in src/autograd/ops/loss/cosine_embedding_loss.cpp only computes the gradient for x1, not x2. If you need gradients for both embedding vectors, you will need to extend the backward function or compute x2's gradient symmetrically.

// metric learning: x1 and x2 should be similar (target=1) or dissimilar (target=-1)
auto *loss = autograd::cosine_embedding_loss(
graph_arena, x1, x2, target, /*margin=*/0.5f, REDUCTION_MEAN);

triplet_loss

Variable *triplet_loss(Arena *arena, Variable *anchor, Variable *positive,
Variable *negative, float margin, Reduction reduction);

L = max(0, dist(anchor, positive) - dist(anchor, negative) + margin)

where dist is Euclidean distance.

Three parents: anchor, positive, negative. All three are saved. margin is stored in metadata_float.

Gradient w.r.t. anchor (when the triplet is active, i.e. the loss > 0):

∂L/∂anchor_f = (anchor_f - positive_f) / dist_ap - (anchor_f - negative_f) / dist_an

The backward function in the source computes the anchor gradient; local_grad_positive and local_grad_negative are allocated but their gradients from the backward function are assigned to parent_anchor's grad only. Extending the backward for positive and negative gradients follows from the symmetry of Euclidean distance.

auto *loss = autograd::triplet_loss(
graph_arena, anchor, positive, negative, /*margin=*/1.0f, REDUCTION_MEAN);

Loss Function Quick Reference

FunctionInputsUse case
mse_losspred, targetRegression
l1_losspred, targetRegression, outlier-robust
huber_losspred, target, deltaRegression, best of MSE+L1
bce_losspred (sigmoid), targetBinary classification
bce_with_logits_losslogits, targetBinary classification (preferred)
cross_entropy_losslogits (raw), target (one-hot)Multi-class classification
nll_losslog_probs, targetWhen you control log-softmax step
kl_div_losslog_probs, target (dist)Distribution matching
hinge_losspred, target (±1)SVM-style binary classification
l2_lossweightsL2 weight regularisation
l1_regularizationweightsL1 weight regularisation
cosine_embedding_lossx1, x2, target (±1), marginMetric learning (pairs)
triplet_lossanchor, pos, neg, marginMetric learning (triplets)