Skip to main content

Optimizers

Optimizers are responsible for updating model parameters after every backward pass. They read the gradient that autograd::backward accumulated into each Variable::grad, compute an update, and apply it to Variable::data. Everything else in the training loop — the forward pass, the loss, the backward pass — exists to produce those gradients. The optimizer is where gradients become learning.

Header directory: include/optim/
Namespace: gradientcore::optim

What optimizers actually do

Every optimizer in GradCore-Tensor follows the same contract. step(Arena*) iterates over every parameter that has requires_grad = true and a non-null grad tensor, reads the gradient values, updates any internal state tensors (moment estimates, accumulated squared gradients, etc.), and writes the new parameter values back. zero_grad() zeroes every gradient tensor so the next backward pass starts clean.


The Two-Arena Pattern

Optimizers interact with both arenas:

perm_arena: model parameters (data + grad), optimizer state (m, v, sum_sq, …)
graph_arena: passed to step() for any temporaries (e.g. SGD's scaled gradient)

State tensors (m, v, etc.) are allocated once in the constructor on perm_arena and persist for the entire training run. They are never freed during training — they accumulate information across batches.

optim::Adam adam(perm_arena, params, 0.001f);
// ↑ state tensors allocated here, live until perm_arena->destroy()

for (int epoch = 0; epoch < 40; epoch++) {
for (auto& batch : loader) {
uint64_t pos = graph_arena->get_pos();
// ... forward, loss, backward ...
adam.zero_grad();
adam.step(graph_arena); // graph_arena used for temporaries only
graph_arena->pop_to(pos);
}
}

Getting Parameters

Every optimizer constructor takes const std::vector<autograd::Variable*>& params. Obtain this from your model:

// From nn::Sequential directly
auto params = seq->parameters();

// From nn::Model (via the wrapped Sequential)
auto params = model.get_model()->parameters();

parameters() returns a flat list of all Variable* with requires_grad = true across the full module hierarchy. See Module for details.


Choosing an Optimizer

OptimizerBest forNotes
SGDSimple tasks, strong baselinesNeeds careful LR tuning
AdamGeneral purposeGreat default; can over-fit on small datasets
AdamWRegularised trainingLike Adam but with correct weight decay
RMSpropNon-stationary problems, RNNsClassic adaptive method
AdagradSparse gradientsLR decays to zero — use carefully
LBFGSSmall datasets, fine-tuningRequires a closure; not mini-batch friendly

For most new experiments, start with Adam (lr=0.001) or AdamW (lr=0.001, weight_decay=0.01). Switch to SGD with momentum if you need a reproducible baseline or are training very large models.


Quick Reference

ClassState per paramKey hyperparams
SGDNonelr
Adamm, vlr, beta1, beta2, eps
AdamWm, vlr, beta1, beta2, eps, weight_decay
RMSpropvlr, alpha, eps, weight_decay
Adagradsum_sqlr, eps, weight_decay
LBFGShistory dequelr, history_size, tol_grad, tol_change

Using Optimizers with nn::Model

The high-level Model API creates and manages the optimizer for you via compile():

model.compile(nn::OptimizerType::ADAMW,
nn::LossType::HUBER,
/*lr=*/0.001f, /*epochs=*/200, /*batch=*/128);
model.train(X_train, Y_train);

If you need a combination not in the OptimizerType enum, or you need access to optimizer internals (e.g. learning rate scheduling), construct the optimizer directly and use nn::Trainer:

optim::RMSprop rms(perm_arena, seq->parameters(), 0.01f, 0.99f);
nn::HuberLoss huber(2.0f);
nn::Trainer<optim::RMSprop, nn::HuberLoss> trainer(seq, &rms, &huber, graph);
trainer.fit_dataloader(loader, 100);

The Contiguous Fast Path

Every optimizer checks tensor_is_contiguous on the gradient and parameter tensors. If both are contiguous — which is true for all parameters created by nn::Linear, nn::BatchNorm, etc. — the optimizer uses direct pointer arithmetic with optional OpenMP parallelisation. If either is non-contiguous, it falls back to stride-aware indexing via tensor_get_flat_index.

In practice, model parameters are always contiguous, so the fast path is always taken during training.