Arena Allocator

The Arena is GradCore-Tensor's memory backbone. If you've ever watched a neural network training session grind to a halt because malloc decided it was a great time to coalesce free blocks, you'll appreciate what an arena allocator does for you: it doesn't bother with any of that.

The Core Idea

An arena (also called a "linear allocator" or "bump allocator") works like this:

Tensor module overview

To allocate, the arena simply advances pos by the requested size. To "free" an entire epoch's worth of computation graphs, it resets pos back to where it was before. No per-object tracking, no fragmentation, no surprises.

Allocation: O(1) — bump a pointer.
Deallocation of a scope: O(1) — restore a saved position.
Deallocation of an individual object: not supported (and you don't need it).

Struct Layout

struct Arena {
    Arena   *current;       // Points to the most-recently-grown chunk
    Arena   *prev;          // Linked list of grown chunks

    uint64_t reserve_size;  // Total virtual address range reserved
    uint64_t commit_size;   // Chunk size for committing new pages
    bool     growable;      // Can this arena chain new chunks?

    uint64_t base_pos;      // Absolute offset of this chunk's start
    uint64_t pos;           // Current write position within this chunk
    uint64_t commit_pos;    // How far we've committed to the OS
};

Creating an Arena

Arena *perm_arena  = Arena::create(MiB(1024), MiB(64), true);
Arena *graph_arena = Arena::create(MiB(512),  MiB(32), true);

Parameter	Meaning
`reserve_size`	Virtual address space to reserve (does not consume physical RAM)
`commit_size`	Pages committed from the OS in chunks of this size
`growable`	If `true`, chains a new chunk when the current one fills up

MiB(n) and KiB(n) are constexpr helpers:

constexpr uint64_t MiB(uint64_t n) { return n << 20; }
constexpr uint64_t KiB(uint64_t n) { return n << 10; }

Two-arena pattern

GradCore-Tensor uses two arenas everywhere:

Permanent arena (perm_arena) — model parameters, optimizer state, datasets. Lives for the entire program.
Graph arena (graph_arena) — forward-pass activations, autograd graph nodes, batch tensors. Rewound after every batch.

This split is what makes the memory model so clean: you never need to track which intermediate tensors to free.

Allocating on an Arena

`push<T>()` — allocate a single object

Tensor *t = arena->push<Tensor>();

Allocates sizeof(T) bytes, zero-initialises them, and returns a typed pointer. The object is placement-new'd into the slab — the arena doesn't call constructors for you, so use placement new when you need one:

nn::Linear *l = perm_arena->push<nn::Linear>();
new (l) nn::Linear(perm_arena, 8, 128);

`push_array<T>(count)` — allocate an array

float *data = arena->push_array<float>(784);
Edge  *edges = arena->push_array<Edge>(num_parents);

`push_raw(size)` — raw bytes

void *buf = arena->push_raw(my_size_in_bytes);

All three variants zero-initialise by default (non_zero = false means "please zero it"). Pass true to skip zeroing for a small performance gain when you're about to overwrite immediately anyway.

Saving and Restoring Position

This is the killer feature. Before a forward pass:

uint64_t saved = graph_arena->get_pos();

// ... entire forward pass, loss, backward ...

graph_arena->pop_to(saved);   // All graph tensors gone in O(1)

get_pos() returns the current absolute write position across all chunks. pop_to(pos) rewinds to exactly that position, releasing any chunks that were grown past it.

`pop(size)` vs `pop_to(pos)`

arena->pop(1024);           // Free the last 1024 bytes
arena->pop_to(saved_pos);   // Free everything after saved_pos

Prefer pop_to — it's less error-prone than remembering how many bytes you pushed.

ArenaTemp — RAII scope guard

For temporary scratch work inside a function:

{
    ArenaTemp temp(scratch_arena);
    // allocate freely on scratch_arena ...
} // temp's destructor calls pop_to automatically

ArenaTemp stores the position at construction and restores it at destruction. It is move-only (no copies).

Thread-local Scratch Arenas

ArenaTemp scratch = scratch_get(conflicts, num_conflicts);

scratch_get returns a thread-local scratch arena that does not conflict with any arena in the conflicts array. This is used internally by backward passes to avoid aliasing issues when multiple arenas share the same thread. There are ARENA_NUM_SCRATCH (= 2) scratch arenas per thread.

Growing Across Chunks

When a growable arena runs out of reserved virtual memory, push_raw chains a new chunk:

Tensor module overview

get_pos() returns current->base_pos + current->pos — a single global offset that works correctly across all chunks. pop_to walks the linked list backwards, releasing chunks that are completely freed.

Non-growable arenas

If growable = false and the arena fills up, push_raw calls platform::exit(1). Don't set growable = false unless you've sized the arena carefully.

Alignment

All allocations are aligned to sizeof(void*) (8 bytes on 64-bit platforms). This is ARENA_ALIGN:

constexpr size_t ARENA_ALIGN = sizeof(void *);

The arena header itself is placed at the very start of the first reserved chunk, so sizeof(Arena) bytes are consumed before any user data.

Destroying an Arena

arena->destroy();

Walks the chunk linked list and calls platform::mem_release (i.e. munmap on Linux) on each. After destroy(), the pointer is invalid — don't use it.

Full Example

// Permanent storage: model lives here forever
auto* perm  = Arena::create(MiB(512), MiB(32), true);

// Temporary storage: rewound every batch
auto* graph = Arena::create(MiB(256), MiB(16), true);

// Allocate model parameters on perm
uint32_t shape[2] = {784, 128};
Tensor *weights = tensor_create(perm, 2, shape);

// Train loop
for (int epoch = 0; epoch < 40; epoch++) {
    for (auto& batch : dataloader) {
        uint64_t pos = graph->get_pos();   // Save graph state

        // Forward pass (everything allocated on graph)
        auto* out  = model.forward(graph, x);
        auto* loss = cross_entropy_loss(graph, out, y, REDUCTION_MEAN);

        // Backward pass (also on graph)
        backward(graph, loss);
        optimizer.step(graph);

        graph->pop_to(pos);               // Free entire batch graph
    }
}

perm->destroy();
graph->destroy();

The Core Idea​

Struct Layout​

Creating an Arena​

Allocating on an Arena​

push<T>() — allocate a single object​

push_array<T>(count) — allocate an array​

push_raw(size) — raw bytes​

Saving and Restoring Position​

pop(size) vs pop_to(pos)​

ArenaTemp — RAII scope guard​

Thread-local Scratch Arenas​

Growing Across Chunks​

Alignment​

Destroying an Arena​

Full Example​