Skip to main content

Variables & the Computation Graph

This page covers the two foundational functions of the autograd engine: create_leaf, which wraps a raw tensor into a graph-trackable Variable, and backward, which traverses the computation graph in reverse and accumulates gradients. Everything else in the autograd module builds on these two.

Header: include/autograd/autograd.hpp
Source: src/autograd/engine.cpp


create_leaf — Entering the Graph

Variable *create_leaf(Arena *arena, Tensor *data, bool requires_grad);

create_leaf is the entry point into the autograd graph. It wraps a raw Tensor in a Variable and declares whether that tensor participates in gradient computation.

ParameterDescription
arenaArena on which the Variable struct and its grad tensor are allocated.
dataThe tensor holding the actual float values. Not copied — the Variable holds a pointer to it.
requires_gradIf true, a zero-initialised gradient tensor is allocated alongside data and gradients will flow back through this node.
// A learnable parameter — gradients needed
Tensor *w_tensor = tensor_create(perm_arena, 2, shape);
autograd::Variable *W = autograd::create_leaf(perm_arena, w_tensor, true);

// An input batch — no gradient needed
Tensor *x_tensor = tensor_create(graph_arena, 2, batch_shape);
autograd::Variable *x = autograd::create_leaf(graph_arena, x_tensor, false);

What create_leaf sets

v->data = data;
v->requires_grad = requires_grad;
v->is_leaf = true;
v->backward_fn = nullptr; // leaves have no backward function
v->parents = nullptr; // leaves have no parents
v->num_parents = 0;
v->grad = requires_grad
? tensor_create_zeros(arena, data->ndims, data->shape)
: nullptr;

Leaf Variable structs allocated on perm_arena persist for the program lifetime — that is how parameters outlive the graph. Leaf Variable structs for input data are typically allocated on graph_arena and freed with the rest of the graph after each batch.

Two kinds of leaves

Kindarenarequires_gradPurpose
Parameterperm_arenatrueModel weights, biases, BatchNorm gamma/beta. Optimizer reads grad to update data.
Input / targetgraph_arenafalseBatch features and labels. No gradient needed; grad is nullptr.

Non-leaf Variables

Every autograd::op() call (e.g. autograd::relu, autograd::matmul) produces a non-leaf Variable:

v->is_leaf = false;
v->backward_fn = [](Variable *self, Arena *arena) { ... };
v->parents = arena->push_array<Edge>(num_inputs);
v->num_parents = num_inputs;
v->saved_tensors = ...; // tensors the backward_fn needs

Non-leaf Variables are always allocated on the graph arena. They are the intermediate activations, the loss node, and everything in between. They exist only for the duration of a single forward–backward step.


The Computation Graph

After a forward pass through a three-layer network:

x (leaf, no grad)
W1 (leaf, grad) ──► matmul ──► z1 ──► relu ──► a1
b1 (leaf, grad) ──► add ───────────────────────────► z2 ──► cross_entropy ──► loss
W2 (leaf, grad) ──► matmul ──────────────────────────────►
b2 (leaf, grad) ──► add ──────────────────────────────────►

Each interior node (matmul, relu, add, cross_entropy) is a Variable with:

  • A backward_fn lambda that knows how to push gradients toward its parents.
  • A parents array pointing to the Variables that were its inputs.
  • A saved_tensors array holding whatever tensor data the backward needs (e.g. relu saves the pre-activation input to know which elements were positive).

backward — Running the Chain Rule

void backward(Arena *arena, Variable *loss_node);

Runs the full backward pass from loss_node to all leaf parameters that have requires_grad = true.

ParameterDescription
arenaThe graph arena. Temporary tensors inside backward functions are allocated here and survive until pop_to is called externally.
loss_nodeThe scalar loss Variable to differentiate. Must have requires_grad = true.
autograd::backward(graph_arena, loss);

What backward does internally

// 1. Seed the loss gradient: dL/dL = 1
tensor_fill(loss_node->grad, 1.0f);

// 2. Build the topological order of the graph
std::vector<Variable *> topo;
std::unordered_set<Variable *> visited;
build_topo(loss_node, visited, topo); // recursive DFS

// 3. Call each backward_fn in reverse topological order
for (auto it = topo.rbegin(); it != topo.rend(); ++it) {
if ((*it)->backward_fn)
(*it)->backward_fn(*it, arena);
}

The topological sort (DFS from the loss node, appending each node after its children) guarantees that when backward_fn for node v is called, v->grad already contains the full accumulated upstream gradient — every downstream node has already contributed its piece via tensor_add(parent->grad, parent->grad, local_grad).

Gradient accumulation

Every backward_fn adds to parent->grad rather than assigning:

tensor_add(parent->grad, parent->grad, local_grad);

This is correct because a parameter may appear in multiple places in the graph (e.g. a weight matrix used in several layers via tensor_view). Each use contributes a separate gradient contribution, and they all accumulate into the same grad tensor.

This is also why zero_grad() must be called before each backward() — without it, gradients from the previous step remain in the grad tensors and corrupt the current update.


A Complete Step in Code

// --- Permanent setup (once) ---
optim::Adam adam(perm_arena, seq->parameters(), 0.001f);
nn::CrossEntropyLoss criterion;

// --- Per-batch training loop ---
for (each batch) {
uint64_t pos = graph_arena->get_pos(); // save graph arena position

// 1. Wrap batch data as non-gradient leaves
auto *x = autograd::create_leaf(graph_arena, batch_features, false);
auto *y = autograd::create_leaf(graph_arena, batch_labels, false);

// 2. Forward pass — builds the computation graph on graph_arena
auto *pred = seq->forward(graph_arena, x);
auto *loss = criterion.forward(graph_arena, pred, y);

// 3. Zero gradients — clear perm_arena grad tensors from last step
adam.zero_grad();

// 4. Backward pass — walks the graph, accumulates into perm grad tensors
autograd::backward(graph_arena, loss);

// 5. Optimizer step — reads grad tensors, updates data tensors
adam.step();

// 6. Free the graph — all non-leaf Variables and temporaries gone in O(1)
graph_arena->pop_to(pos);
}

saved_tensors vs parents

These two fields serve distinct purposes and are easy to confuse:

FieldTypePurpose
parentsEdge* (→ Variable*)Graph topology. Used by build_topo for traversal and by backward_fn to reach the parent's grad tensor.
saved_tensorsTensor**Data needed by the backward computation. Contains raw tensor pointers, not Variables.

For relu, the backward_fn needs to know which elements of the input were positive:

// Forward:
out->saved_tensors[0] = in->data; // the pre-activation input tensor

// Backward closure:
tensor_relu_grad(local_grad,
self->saved_tensors[0], // pre-activation values
self->grad); // upstream gradient
tensor_add(parent->grad, parent->grad, local_grad);

Only the tensor data is saved, not the entire Variable — because the backward function only needs the float values, not the graph structure.


Checking requires_grad

Operations propagate requires_grad conservatively: if any input has requires_grad = true, the output also has requires_grad = true. If no input requires a gradient, the output doesn't need one either and no backward function is registered.

Variable *out = arena->push<Variable>();
out->requires_grad = a->requires_grad || b->requires_grad;

if (out->requires_grad) {
out->grad = tensor_create_zeros(arena, ...);
out->backward_fn = [...](Variable *self, Arena *arena) { ... };
// ... wire parents, saved_tensors ...
}

This means operations involving only non-gradient inputs (e.g. two input batch tensors multiplied together for data augmentation) produce no graph nodes and incur no autograd overhead.


Memory Layout

perm_arena: graph_arena (one batch):
┌─────────────────────┐ ┌─────────────────────────────────────┐ ← saved pos
│ W1.data W1.grad │ │ batch x, y (leaf Variable structs) │
│ b1.data b1.grad │◄──────────►│ z1, a1, z2 (non-leaf Variables) │
│ W2.data W2.grad │ grad acc │ loss Variable + grad seed │
│ b2.data b2.grad │ │ backward temporaries (local_grad …) │
└─────────────────────┘ └─────────────────────────────────────┘
↑ pop_to(saved pos) → all gone

After pop_to, only perm_arena content survives: the updated parameter data tensors and their (now-zeroed, ready for next step) gradient tensors.