Optimizer Overview
Optimizers are responsible for updating model parameters after every backward pass. They read the gradient that autograd::data. Everything else in the training loop — the forward pass, the loss, the backward pass — exists to produce those gradients. The optimizer is where gradients become learning.
SGD
Stochastic Gradient Descent. The oldest optimizer, still competitive, and the easiest to reason about. Every weight moves directly opposite its gradient, scaled by the learning rate.
Adam
Adam (Adaptive Moment Estimation) maintains a running estimate of the first moment (mean) and second moment (uncentred variance) of the gradients, using them to scale the learning rate individually for each parameter. The result is an optimizer that adapts to the local curvature of the loss landscape and typically requires far less tuning than SGD.
AdamW
AdamW is Adam with decoupled weight decay. The key insight (Loshchilov & Hutter, 2019) is that adding L2 regularisation to the loss and then running Adam is not the same as applying weight decay directly to the weights. Standard Adam scales the weight decay term by the adaptive learning rate, making its effective magnitude inconsistent across parameters and training steps. AdamW fixes this by applying weight decay as a direct multiplicative shrinkage of the weight, completely separate from the gradient update.
RMSprop
RMSprop (Root Mean Square Propagation) divides the learning rate by a running average of recent squared gradient magnitudes. This normalises the effective step size across parameters — parameters with historically large gradients get smaller steps, and parameters with small gradients get larger ones. Unlike Adam, RMSprop does not maintain a first moment estimate, making it lighter on memory and slightly simpler.
Adagrad
Adagrad (Adaptive Gradient Algorithm) accumulates the sum of all squared gradients seen so far and divides the current gradient by its square root. Parameters that receive large gradients frequently get progressively smaller effective learning rates; parameters with sparse, infrequent gradients retain a larger effective rate. This makes Adagrad especially well-suited for tasks with naturally sparse gradient signals.
L-BFGS
L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno) is a quasi-Newton optimizer. Rather than following the gradient directly, it uses a history of past gradient changes to build an approximation of the inverse Hessian — the curvature of the loss landscape — and uses that approximation to compute a better search direction. The result is much faster convergence per gradient evaluation than any first-order method, at the cost of requiring a closure (a function that re-evaluates the loss and its gradients on demand for the line search) and being fundamentally incompatible with mini-batch stochastic training.
Optimizer Utilities
optim_utils.hpp is a header-only collection of inline helper functions used internally by the L-BFGS optimizer to flatten parameter and gradient vectors into a single 1D tensor, and to restore parameters from that flat representation. They are also useful when implementing custom second-order optimizers.