Tensors in the Autograd Engine
The tensor module provides the data. The autograd engine provides the memory of what you did with that data — so it can run the chain rule backwards. Here's how the two connect.
autograd::Variable
Every tensor that participates in a differentiable computation is wrapped in a Variable:
struct Variable {
Tensor *data; // The actual tensor values
Tensor *grad; // Gradient accumulator (same shape as data)
bool requires_grad; // Should we compute gradients for this?
bool is_leaf; // Is this a parameter (true) or intermediate (false)?
Edge *parents; // Inputs to the op that created this Variable
uint32_t num_parents;
Tensor **saved_tensors; // Tensors saved for the backward pass
uint32_t num_saved;
uint32_t reduction; // For loss ops
float metadata_float; // alpha, delta, scale, etc.
void (*backward_fn)(Variable *self, Arena *arena);
};
Variable structs live on the graph arena — they're freed en masse when you call graph_arena->pop_to(pos) after each batch.
create_leaf — wrapping a tensor
Variable *x = autograd::create_leaf(graph_arena, t_x, false);
create_leaf is how you turn a raw Tensor into something the autograd graph can track:
Variable *create_leaf(Arena *arena, Tensor *data, bool requires_grad) {
Variable *v = arena->push<Variable>();
v->data = data;
v->requires_grad = requires_grad;
v->is_leaf = true;
v->backward_fn = nullptr; // Leaves don't have a backward fn
if (requires_grad) {
v->grad = tensor_create_zeros(arena, data->ndims, data->shape);
} else {
v->grad = nullptr;
}
return v;
}
requires_grad = true: This is a parameter — the optimizer will update it, and backward will accumulate intov->grad.requires_grad = false: This is data (input batch, targets) — no gradients needed, no grad tensor allocated.
How Ops Build the Graph
Every differentiable operation (e.g. autograd::relu) does three things:
- Compute the output using the corresponding
tensor_*function. - Allocate a new
Variablefor the result on the graph arena. - Wire up the backward function.
Variable *relu(Arena *arena, Variable *in) {
// 1. Forward computation
Tensor *out_data = tensor_create_zeros(arena, in->data->ndims, in->data->shape);
tensor_relu(out_data, in->data);
// 2. Allocate output Variable
Variable *out = arena->push<Variable>();
out->data = out_data;
out->requires_grad = in->requires_grad;
out->is_leaf = false;
if (out->requires_grad) {
// Allocate grad tensor
out->grad = tensor_create_zeros(arena, out_data->ndims, out_data->shape);
// Wire parents (for graph traversal)
out->num_parents = 1;
out->parents = arena->push_array<Edge>(1);
out->parents[0] = {in};
// Save tensors needed by backward
out->num_saved = 1;
out->saved_tensors = arena->push_array<Tensor *>(1);
out->saved_tensors[0] = in->data; // Need input to compute gradient mask
// 3. Backward closure
out->backward_fn = [](Variable *self, Arena *temp_arena) {
Variable *parent = self->parents[0].node;
if (!parent->requires_grad) return;
Tensor *local_grad = tensor_create_zeros(
temp_arena, parent->grad->ndims, parent->grad->shape);
tensor_relu_grad(local_grad, self->saved_tensors[0], self->grad);
tensor_add(parent->grad, parent->grad, local_grad);
};
}
return out;
}
The result is a DAG (directed acyclic graph) of Variable nodes:
input ─► relu ─► linear ─► cross_entropy ─► loss
│ │ │ │
└parent └parent └parent └parent
│ │
[saved: [saved:
input] input, weight]
Backward Pass
autograd::backward(arena, loss_node) reverses the graph:
void backward(Arena *arena, Variable *loss_node) {
// Start gradient: d(loss)/d(loss) = 1
tensor_fill(loss_node->grad, 1.0f);
// Topological sort of the computation graph
std::vector<Variable *> topo;
build_topo(loss_node, visited, topo);
// Reverse traversal: apply each backward_fn
for (auto it = topo.rbegin(); it != topo.rend(); ++it) {
if ((*it)->backward_fn)
(*it)->backward_fn(*it, arena);
}
}
Each backward_fn:
- Computes the local gradient contribution using
tensor_*_grad. - Adds it to the parent's
gradtensor (tensor_add(parent->grad, parent->grad, local_grad)).
Gradients accumulate via tensor_add — this is the correct behaviour for parameters that appear in multiple places in the graph.
Memory Layout During a Batch
graph_arena (before batch):
┌─────────────────────────────┐ ← saved_pos
│ │
│ │
graph_arena (during forward):
┌──────────────┬──────────────┬──────────────┬──────────────┐
│ batch data │ activations │ Variables │ grad tensors │
│ (x, y) │ (post-relu) │ (DAG) │ │
└──────────────┴──────────────┴──────────────┴──────────────┘
After pop_to(saved_pos): entirely reclaimed → back to empty
Everything allocated after saved_pos — intermediate tensors, grad tensors, the graph nodes themselves — vanishes in a single pointer reset. The only things that survive are on perm_arena: the model parameters and their gradient accumulators.
saved_tensors vs parents
These serve different purposes:
parents: Links for graph traversal (the topology). Points toVariablenodes.saved_tensors: Data needed by the backward function. Points toTensordata, because the input data is what you need to compute the local gradient — not the whole Variable.
For relu, the backward needs the pre-activation values (to know which elements were positive). For cross_entropy_loss, the backward needs both the logits and the targets. Only what's strictly necessary is saved.