Activation Operations

Every activation function in GradCore-Tensor has an autograd-layer wrapper. Each wrapper computes the activation in the forward pass, saves whatever tensors its backward function needs, and registers a backward_fn that calls the corresponding tensor_*_grad function.

Header: include/autograd/autograd.hpp
Source: src/autograd/ops/activations/

What is saved for backward

Most activations save the pre-activation input (the input tensor in->data), because the gradient formula needs the original input values — for example, relu_grad checks which elements were positive, and sigmoid_grad recomputes σ(x) from the input. The one notable exception is softmax, which saves the output (the softmax probabilities) because its Jacobian is expressed more cleanly in terms of the output values.

Common Structure

Every activation op follows this pattern:

Variable *relu(Arena *arena, Variable *in) {
    // 1. Allocate output tensor and compute forward value
    Tensor *out_data = tensor_create_zeros(arena, in->data->ndims, in->data->shape);
    tensor_relu(out_data, in->data);

    // 2. Allocate output Variable
    Variable *out = arena->push<Variable>();
    out->data          = out_data;
    out->requires_grad = in->requires_grad;
    out->is_leaf       = false;

    if (out->requires_grad) {
        // 3. Allocate grad tensor, wire parent, save input for backward
        out->grad            = tensor_create_zeros(arena, out_data->ndims, out_data->shape);
        out->num_parents     = 1;
        out->parents         = arena->push_array<Edge>(1);
        out->parents[0]      = {in};
        out->num_saved       = 1;
        out->saved_tensors   = arena->push_array<Tensor *>(1);
        out->saved_tensors[0] = in->data;

        // 4. Register backward function
        out->backward_fn = [](Variable *self, Arena *temp_arena) {
            Variable *parent = self->parents[0].node;
            if (!parent->requires_grad) return;
            Tensor *local_grad = tensor_create_zeros(
                temp_arena, parent->grad->ndims, parent->grad->shape);
            tensor_relu_grad(local_grad, self->saved_tensors[0], self->grad);
            tensor_add(parent->grad, parent->grad, local_grad);
        };
    }
    return out;
}

The structure is identical for every activation — only the forward call, the saved tensor choice, and the backward call differ.

`relu`

Variable *relu(Arena *arena, Variable *in);

f(x) = max(0, x). Saves in->data (pre-activation) for the backward pass.

Backward: tensor_relu_grad — passes through the upstream gradient where the input was positive, zeros it where the input was negative or zero.

nn layer: nn::ReLU

`relu6`

Variable *relu6(Arena *arena, Variable *in);

f(x) = clamp(x, 0, 6). Saves in->data for the backward.

Backward: tensor_relu6_grad — passes the upstream gradient only where 0 < x ≤ 6, zeros elsewhere.

nn layer: nn::ReLU6

`leaky_relu`

Variable *leaky_relu(Arena *arena, Variable *in, float alpha);

f(x) = x if x > 0, else alpha * x. The alpha value is stored in out->metadata_float and passed to the backward function. Saves in->data.

Backward: tensor_leaky_relu_grad(out, saved_input, upstream_grad, self->metadata_float).

nn layer: nn::LeakyReLU(float alpha = 0.01f)

`elu`

Variable *elu(Arena *arena, Variable *in, float alpha);

f(x) = x if x > 0, else alpha * (exp(x) - 1). alpha stored in metadata_float. Saves in->data.

Backward: tensor_elu_grad — derivative is 1 for positive inputs and alpha * exp(x) for negative inputs.

nn layer: nn::ELU(float alpha = 1.0f)

`gelu`

Variable *gelu(Arena *arena, Variable *in);

f(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³))). Saves in->data.

Backward: tensor_gelu_grad — derivative is computed analytically from the tanh approximation:

f'(x) = 0.5 * (1 + t) + 0.5 * x * (1 - t²) * (√(2/π) + C * x²)
where t = tanh(√(2/π) * (x + 0.044715 * x³))
      C = 3 * 0.044715 * √(2/π) ≈ 0.1070322243

nn layer: nn::GELU

`swish`

Variable *swish(Arena *arena, Variable *in);

f(x) = x * σ(x) = x / (1 + exp(-x)). Saves in->data.

Backward: tensor_swish_grad — derivative is σ(x) * (1 + x * (1 - σ(x))).

nn layer: nn::Swish

`sigmoid`

Variable *sigmoid(Arena *arena, Variable *in);

f(x) = 1 / (1 + exp(-x)). Saves in->data.

Backward: tensor_sigmoid_grad — derivative is σ(x) * (1 - σ(x)), recomputed from the saved input.

nn layer: nn::Sigmoid

`tanh`

Variable *tanh(Arena *arena, Variable *in);

f(x) = tanh(x). Saves in->data.

Backward: tensor_tanh_grad — derivative is 1 - tanh²(x).

nn layer: nn::Tanh

`hard_sigmoid`

Variable *hard_sigmoid(Arena *arena, Variable *in);

f(x) = clamp((x + 3) / 6, 0, 1). Saves in->data.

Backward: tensor_hard_sigmoid_grad — derivative is 1/6 where -3 < x < 3, zero elsewhere.

nn layer: nn::HardSigmoid

`hard_swish`

Variable *hard_swish(Arena *arena, Variable *in);

f(x) = x * HardSigmoid(x). Saves in->data.

Backward: tensor_hard_swish_grad — piecewise derivative:

x ≤ -3 :  0
x ≥  3 :  1
else   :  (2x + 3) / 6

nn layer: nn::HardSwish

`softplus`

Variable *softplus(Arena *arena, Variable *in);

f(x) = log(1 + exp(x)) (with linear approximation for large x). Saves in->data.

Backward: tensor_softplus_grad — derivative is σ(x) = 1 / (1 + exp(-x)), which is exactly the sigmoid of the input.

nn layer: nn::SoftPlus

`softmax`

Variable *softmax(Arena *arena, Variable *in, int32_t dim = -1);

f(x)_i = exp(x_i - max(x)) / Σ exp(x_j - max(x)), applied along dim. The dim value is stored in out->metadata_float.

What is saved: Unlike all other activations, softmax saves the output (out_data) rather than the input. The Jacobian of softmax is expressed in terms of its output values:

∂f_i/∂x_j = f_i * (δ_ij - f_j)

Backward: tensor_softmax_grad — the efficient formulation avoids constructing the full Jacobian:

out_i = f_i * (grad_i - Σ_j(f_j * grad_j))

This computes the dot product Σ f_j * grad_j once and uses it to adjust every element.

Do not use softmax before CrossEntropyLoss

autograd::cross_entropy_loss applies log-softmax internally. Adding autograd::softmax before it applies softmax twice, producing a wrong loss value and gradient. The nn::CrossEntropyLoss documentation has the same warning — it bears repeating here.

nn layer: nn::Softmax(int32_t dim = -1)

Quick Reference

Function	Signature	Saves	nn layer
`relu`	`(arena, in)`	`in->data`	`nn::ReLU`
`relu6`	`(arena, in)`	`in->data`	`nn::ReLU6`
`leaky_relu`	`(arena, in, alpha)`	`in->data`	`nn::LeakyReLU`
`elu`	`(arena, in, alpha)`	`in->data`	`nn::ELU`
`gelu`	`(arena, in)`	`in->data`	`nn::GELU`
`swish`	`(arena, in)`	`in->data`	`nn::Swish`
`sigmoid`	`(arena, in)`	`in->data`	`nn::Sigmoid`
`tanh`	`(arena, in)`	`in->data`	`nn::Tanh`
`hard_sigmoid`	`(arena, in)`	`in->data`	`nn::HardSigmoid`
`hard_swish`	`(arena, in)`	`in->data`	`nn::HardSwish`
`softplus`	`(arena, in)`	`in->data`	`nn::SoftPlus`
`softmax`	`(arena, in, dim=-1)`	`out_data` (output!)	`nn::Softmax`

Common Structure​

relu​

relu6​

leaky_relu​

elu​

gelu​

swish​

sigmoid​

tanh​

hard_sigmoid​

hard_swish​

softplus​

softmax​

Quick Reference​