Arithmetic Operations

These are the low-level tensor math functions. Every layer in GradCore-Tensor ultimately calls one or more of these. They all follow the same conventions:

Output tensor is always the first argument.
The output must be pre-allocated by the caller (typically on an arena).
Return bool — true on success, false on shape mismatch or null input.
All support both contiguous (fast path) and non-contiguous (correct-but-slower) tensors.
OpenMP parallelisation is enabled when compiled with -fopenmp.

Element-wise Operations

`tensor_add`

bool tensor_add(Tensor *out, const Tensor *a, const Tensor *b);

Computes out = a + b element-wise with broadcasting.

Broadcasting rules (NumPy-style): dimensions are compatible if they are equal or one of them is 1. The output shape is the element-wise maximum of the two input shapes, right-aligned.

a: shape [32, 128]       b: shape [1, 128]
─────────────────────────────────────────────
out: shape [32, 128]     (b is broadcast over the batch dimension)

This is how bias addition works in nn::Linear:

// out = matmul(x, weight) + bias
// bias has shape [1, out_features], out has shape [batch, out_features]
out = autograd::add(compute_arena, out, bias);

`tensor_sub`

bool tensor_sub(Tensor *out, const Tensor *a, const Tensor *b);

out = a - b, with the same broadcasting rules as tensor_add. Used in gradient computations (e.g. d(a-b)/db = -1, so the gradient is negated after summation).

`tensor_mul`

bool tensor_mul(Tensor *out, const Tensor *a, const Tensor *b);

out = a * b, element-wise with broadcasting. This is Hadamard product, not matrix multiplication. Used in:

Gradient of the mul autograd op: grad_a = grad_output * b
Dropout mask application

`tensor_scale`

void tensor_scale(Tensor *t, float scale);

In-place scalar multiplication: t *= scale. Unlike the element-wise ops, this modifies t directly rather than writing to a separate output. Used by:

Weight update: tensor_scale(grad, learning_rate)
Gradient negation: tensor_scale(reduced_grad, -1.0f) in the sub backward pass
LBFGS search direction scaling

Matrix Multiplication

`mat_mul`

bool mat_mul(Tensor *out,
             const Tensor *a,
             const Tensor *b,
             bool zero_out,
             bool transpose_a,
             bool transpose_b);

Batched matrix multiplication supporting optional transposition of either operand. This is the performance-critical kernel.

Parameter	Meaning
`out`	Pre-allocated output tensor
`a`, `b`	Input matrices (must be ≥ 2D)
`zero_out`	If `true`, zero `out` before accumulating
`transpose_a`	Treat `a` as transposed for the multiply
`transpose_b`	Treat `b` as transposed for the multiply

The effective operation:

out = (transpose_a ? a^T : a)  @  (transpose_b ? b^T : b)

Shapes

For 2D inputs with transpose_a=false, transpose_b=false:

a: [M, K]
b: [K, N]
out: [M, N]

For the common case of Linear forward pass:

// x: [batch, in_features]   weight: [in_features, out_features]
// out: [batch, out_features]
mat_mul(out_data, x->data, weight->data, true, false, false);

And the backward pass:

// grad_a = grad_output @ weight^T    (shape: [batch, in_features])
mat_mul(local_grad_a, self->grad, b_data, true, false, true);

// grad_b = x^T @ grad_output         (shape: [in_features, out_features])
mat_mul(local_grad_b, a_data, self->grad, true, true, false);

Batched Operation

For inputs with ndims > 2, the leading dimensions are treated as a batch. Broadcasting applies to the batch dimensions:

a: [batch, M, K]
b: [1, K, N]        ← broadcast over batch
out: [batch, M, N]

Implementation: Blocked GEMM

The kernel uses a classic blocked (tiled) GEMM strategy to maximise cache utilisation:

for m_block in [0, M, BLOCK_M=64]:
  for k_block in [0, K, BLOCK_K=256]:
    for n_block in [0, N, BLOCK_N=64]:
      micro_kernel(out[m..m+MR, n..n+NR],
                   a[m..m+MR, k..k+MK],
                   b[k..k+MK, n..n+NR])

The micro-kernel (MR=6 × NR=16) uses AVX2 FMA instructions when available:

AVX2 path:    processes 6×16 = 96 output elements per k-iteration
              using eight 256-bit registers (_mm256_fmadd_ps)

Scalar path:  generic fallback for CPUs without AVX2

Compile with -mavx2 -mfma (see Getting Started) to enable the fast path. The #if defined(USE_AVX2) branch is selected at compile time.

Reduction Operations

`tensor_sum`

float total = tensor_sum(Tensor *t);

Sum of all elements. Returns a scalar float (not a tensor). OpenMP reduction for the contiguous case.

`tensor_sum_to_shape`

bool tensor_sum_to_shape(Tensor *out, const Tensor *in);

Reduces in to the shape of out by summing over broadcast-expanded dimensions. This is the gradient accumulation operation for tensor_add:

in:  shape [32, 128]
out: shape [1, 128]   ← sum over dim 0 (batch)

Used in the add and sub backward passes to accumulate gradients back into bias tensors (which have shape [1, features]).

Fast vs Slow Path

Every element-wise operation checks tensor_is_contiguous and takes one of two paths:

if (tensor_is_contiguous(a) && tensor_is_contiguous(b) && tensor_is_contiguous(out)) {
    // Fast path: direct pointer arithmetic
    for (uint64_t i = 0; i < out->size; i++) {
        out->storage->data[out->offset + i] =
            a->storage->data[a->offset + i] + b->storage->data[b->offset + i];
    }
} else {
    // Correct path: stride-aware indexing
    uint32_t indices[MAX_TENSOR_DIMS] = {0};
    for (uint64_t i = 0; i < out->size; i++) {
        out->storage->data[tensor_get_flat_index(out, indices)] =
            a->storage->data[tensor_get_flat_index(a, indices)] +
            b->storage->data[tensor_get_flat_index(b, indices)];
        // advance indices ...
    }
}

In practice, tensors created by tensor_create and then used directly (without transposing) will always hit the fast path.

Element-wise Operations​

tensor_add​

tensor_sub​

tensor_mul​

tensor_scale​

Matrix Multiplication​

mat_mul​

Shapes​

Batched Operation​

Implementation: Blocked GEMM​

Reduction Operations​

tensor_sum​

tensor_sum_to_shape​

Fast vs Slow Path​