Skip to main content

Optimizers

Every optimizer in GradCore-Tensor — SGD, Adam, AdamW, RMSprop, Adagrad, and L-BFGS — with update rules, constructor arguments, and when to use each one.

📄️AdamW

AdamW is Adam with decoupled weight decay. The key insight (Loshchilov & Hutter, 2019) is that adding L2 regularisation to the loss and then running Adam is not the same as applying weight decay directly to the weights. Standard Adam scales the weight decay term by the adaptive learning rate, making its effective magnitude inconsistent across parameters and training steps. AdamW fixes this by applying weight decay as a direct multiplicative shrinkage of the weight, completely separate from the gradient update.

📄️L-BFGS

L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno) is a quasi-Newton optimizer. Rather than following the gradient directly, it uses a history of past gradient changes to build an approximation of the inverse Hessian — the curvature of the loss landscape — and uses that approximation to compute a better search direction. The result is much faster convergence per gradient evaluation than any first-order method, at the cost of requiring a closure (a function that re-evaluates the loss and its gradients on demand for the line search) and being fundamentally incompatible with mini-batch stochastic training.