Weight Initialisation

Good weight initialisation is not optional — initialise too small and gradients vanish, too large and they explode. The functions in nn::init give you the standard schemes that the deep learning literature has converged on over the past decade.

Header: include/nn/utils/initialization.hpp
Namespace: gradientcore::nn::init

What they call

All initialisation functions write directly into a Variable's underlying Tensor data buffer using prng::randf() or prng::std_norm() from the PRNG module. They modify the tensor in-place and return nothing. They do not build autograd graph nodes — initialisation happens before training, outside the computation graph.

General Signature

All initialisation functions take a single autograd::Variable*:

void scheme_(autograd::Variable *weight);

The trailing underscore (_) follows the PyTorch convention indicating an in-place operation. The function modifies weight->data directly.

All functions print an error and return early if weight or weight->data is null.

Kaiming (He) Initialisation

Designed for ReLU-family activations. Accounts for the fact that ReLU zeroes out roughly half of all activations, which would halve the variance at each layer if not corrected.

`kaiming_normal_`

void kaiming_normal_(autograd::Variable *weight);

Draws weights from a normal distribution:

std = sqrt(2 / fan_in)
W ~ N(0, std²)

The sqrt(2) factor is the He correction for ReLU. fan_in is the number of input units to the layer (i.e. in_features for a Linear layer).

This is the default initialisation for nn::Linear.

init::kaiming_normal_(layer->weight);

`kaiming_uniform_`

void kaiming_uniform_(autograd::Variable *weight);

Draws weights uniformly:

limit = sqrt(6 / fan_in)
W ~ Uniform(-limit, limit)

Mathematically equivalent variance to kaiming_normal_ but with bounded support. Useful when you want to guarantee no single initial weight is very large.

Xavier (Glorot) Initialisation

Designed for symmetric activations (Tanh, Sigmoid) where the gradient is approximately linear near zero. Xavier initialisation keeps variance consistent both in the forward pass and the backward pass simultaneously.

`xavier_normal_`

void xavier_normal_(autograd::Variable *weight);

std = sqrt(2 / (fan_in + fan_out))
W ~ N(0, std²)

fan_in and fan_out are the number of input and output units respectively. For a Linear(M, N) layer: fan_in = M, fan_out = N.

init::xavier_normal_(layer->weight);

`xavier_uniform_`

void xavier_uniform_(autograd::Variable *weight);

limit = sqrt(6 / (fan_in + fan_out))
W ~ Uniform(-limit, limit)

The original Glorot & Bengio (2010) formulation. Still widely used as a general-purpose default, especially with Tanh activations.

Simple Distributions

`uniform_`

void uniform_(autograd::Variable *weight,
              float min_val = -1.0f,
              float max_val =  1.0f);

Fills with values drawn uniformly from [min_val, max_val]:

W ~ Uniform(min_val, max_val)

init::uniform_(layer->weight, -0.1f, 0.1f);

`normal_`

void normal_(autograd::Variable *weight,
             float mean = 0.0f,
             float std  = 1.0f);

Fills with values drawn from a normal distribution:

W ~ N(mean, std²)

Uses the Box-Muller transform via prng::std_norm(), scaled and shifted.

init::normal_(layer->weight, 0.0f, 0.02f);   // typical for small init

Constant Initialisation

`constant_`

void constant_(autograd::Variable *weight, float value = 0.0f);

Sets every element to value.

init::constant_(layer->bias, 0.1f);   // small positive bias for ReLU

`zeros_`

inline void zeros_(autograd::Variable *weight) {
    constant_(weight, 0.0f);
}

Sets all elements to zero. Used for bias initialisation (zero bias is the standard default).

init::zeros_(layer->bias);

`ones_`

inline void ones_(autograd::Variable *weight) {
    constant_(weight, 1.0f);
}

Sets all elements to one. Used internally to initialise BatchNorm's gamma (scale) parameter — the identity transform is the right starting point before the network has learned to rescale.

init::ones_(bn->gamma);
init::zeros_(bn->beta);

Fan Calculation

All Kaiming and Xavier functions use an internal calculate_fans helper:

static void calculate_fans(const Tensor *tensor,
                            uint32_t &fan_in,
                            uint32_t &fan_out);

Tensor shape	`fan_in`	`fan_out`
`[M, N]` (2D — Linear weights)	`M`	`N`
`[N]` (1D)	`N`	`N`
Higher-dimensional	Last dim	Product of all other dims

For nn::Linear(in=784, out=128), the weight shape is [784, 128], so fan_in = 784, fan_out = 128.

Choosing an Initialisation Scheme

Activation	Recommended scheme
ReLU, ReLU6, LeakyReLU, ELU	`kaiming_normal_` or `kaiming_uniform_`
GELU, Swish	`kaiming_normal_` (works well in practice)
Tanh, Sigmoid	`xavier_normal_` or `xavier_uniform_`
Linear output (regression)	`xavier_uniform_` or `normal_(0, 0.01)`
Custom / experimental	`uniform_` or `normal_` with manual scale
Biases	`zeros_` (universal default)
BatchNorm gamma	`ones_`
BatchNorm beta	`zeros_`

Re-initialising a Layer After Construction

nn::Linear calls reset_parameters() in its constructor, which applies kaiming_normal_ to weights and zeros_ to bias. You can override this after construction:

auto* l = perm->push<nn::Linear>();
new (l) nn::Linear(perm, 784, 128);   // kaiming_normal applied here

// Override with Xavier for a Tanh network:
init::xavier_uniform_(l->weight);
init::zeros_(l->bias);

model.add_layer(l);

Implementation Notes

Box-Muller Transform

kaiming_normal_, xavier_normal_, and normal_ all use Box-Muller to generate normal random numbers from pairs of uniform samples:

u1, u2 ~ Uniform(0, 1)
z0 = sqrt(-2 * log(u1)) * cos(2π * u2)
z1 = sqrt(-2 * log(u1)) * sin(2π * u2)

Both z0 and z1 are standard normal. The initialisation loops stride by 2 and use both values to avoid wasting half the computation:

for (uint64_t i = 0; i < size; i += 2) {
    float u1 = prng::randf();
    float u2 = prng::randf();
    if (u1 < 1e-7f) u1 = 1e-7f;   // guard against log(0)

    float z0 = std::sqrt(-2.0f * std::log(u1)) * std::cos(2.0f * M_PI * u2);
    data[i] = z0 * std_dev;

    if (i + 1 < size) {
        float z1 = std::sqrt(-2.0f * std::log(u1)) * std::sin(2.0f * M_PI * u2);
        data[i + 1] = z1 * std_dev;
    }
}

If size is odd, the last element uses only z0.

Reproducibility

Initialisation uses the global thread-local PRNG. Seed it before constructing your model for reproducible results:

prng::seed(42, 1);   // fixed seed

nn::Model model(perm, graph);
// ... add layers — weights will be the same every run

See PRNG for details.

General Signature​

Kaiming (He) Initialisation​

kaiming_normal_​

kaiming_uniform_​

Xavier (Glorot) Initialisation​

xavier_normal_​

xavier_uniform_​

Simple Distributions​

uniform_​

normal_​

Constant Initialisation​

constant_​

zeros_​

ones_​

Fan Calculation​

Choosing an Initialisation Scheme​

Re-initialising a Layer After Construction​

Implementation Notes​

Box-Muller Transform​

Reproducibility​