Skip to main content

Tutorial 2: MNIST Digit Classification

This tutorial covers multi-class image classification using the MNIST handwritten digit dataset. You will learn how to set up batched loading with DataLoader, build a classifier, train with CrossEntropyLoss, evaluate test accuracy, and run per-sample inference with a visual terminal display.

The full source lives in examples/mnist/.

Dataset - Github


What you'll build

A three-layer MLP that classifies 28×28 greyscale digit images (flattened to 784 floats) into one of 10 classes. This matches GradCore-Tensor's MNIST benchmark: 97.77 % test accuracy.

Architecture at a glance:

Input (784) → Linear(784→128) → ReLU → Linear(128→10)

Optimizer: Adam, Loss: CrossEntropyLoss, LR: 0.0005, Epochs: 40, Batch size: 64.


Prerequisites

The MNIST CSVs are large files tracked by Git LFS (~110 MB train, ~18 MB test). Pull them before building:

git lfs pull
cd examples/mnist
./run.sh # compiles train, inference, and autoencoder binaries

Step 1 — Load and preprocess MNIST

MNIST is distributed as pixel rows in CSV format. The first column is the label (0–9); the remaining 784 columns are pixel intensities (0–255).

#include "../../include/gradient.hpp"
using namespace gradientcore;

// Load training data (skip header row)
auto csv_raw = CSVLoader::load_csv("data/mnist_train.csv", true);

std::vector<std::vector<float>> features, labels_raw, labels_onehot;
CSVLoader::parse_mnist_csv(csv_raw, features, labels_raw);

Normalize pixel values

normalize_minmax scales every value to [0, 1], which stabilizes gradient magnitudes during training:

CSVLoader::normalize_minmax(features);

One-hot encode labels

CrossEntropyLoss expects a target distribution, not a scalar class index. one_hot_encode converts each integer label into a length-10 vector with a single 1.0 at the correct class position:

CSVLoader::one_hot_encode(labels_raw, 10, labels_onehot);
// e.g. label "3" → [0,0,0,1,0,0,0,0,0,0]

Step 2 — Allocate arenas

MNIST training loads 60 000 images. The permanent arena must be large enough to hold the model parameters plus Adam's two moment tensors per parameter:

auto* perm_arena = Arena::create(MiB(1024), MiB(64), true);
auto* graph_arena = Arena::create(MiB(512), MiB(32), true);

The graph arena is rewound after each batch, so its size only needs to accommodate one batch at a time rather than the full dataset.


Step 3 — Define the model

nn::Model model(perm_arena, graph_arena);

auto* l1 = perm_arena->push<nn::Linear>();
new (l1) nn::Linear(perm_arena, 784, 128);
model.add_layer(l1);

auto* relu = perm_arena->push<nn::ReLU>();
new (relu) nn::ReLU();
model.add_layer(relu);

auto* l2 = perm_arena->push<nn::Linear>();
new (l2) nn::Linear(perm_arena, 128, 10);
model.add_layer(l2);

The final Linear(128→10) produces 10 raw logits — one per digit class. CrossEntropyLoss applies softmax internally, so no explicit Softmax layer is needed here.

Verifying the parameter count

std::cout << "Total parameters: "
<< model.get_model()->num_parameters() << std::endl;
// Expected: 784*128 + 128 + 128*10 + 10 = 101 770
if (model.get_model()->num_parameters() == 0) {
std::cerr << "Error: 0 parameters — check add_layer." << std::endl;
return 1;
}

Step 4 — Compile and train

model.compile(
nn::OptimizerType::ADAM,
nn::LossType::CROSS_ENTROPY,
0.0005f, // learning rate
40, // epochs
64 // batch size
);

model.train(features, labels_onehot);

Why CrossEntropyLoss? It applies log-softmax to the logits internally and computes the negative log-likelihood against the one-hot target, penalizing confidently wrong predictions much more than uncertain ones.

Why Adam with LR 0.0005? A lower learning rate prevents the optimizer from overshooting the minimum in the later epochs, which is important when pushing past 97 % accuracy.


Step 5 — Evaluate test accuracy

After training, load the test split and compute top-1 accuracy in batches of 100:

auto test_csv = CSVLoader::load_csv("data/mnist_test.csv", true);
std::vector<std::vector<float>> test_features, test_labels_raw;
CSVLoader::parse_mnist_csv(test_csv, test_features, test_labels_raw);
CSVLoader::normalize_minmax(test_features);

model.get_model()->eval(); // disable dropout, use BatchNorm running stats

uint32_t correct = 0;
uint32_t batch_size = 100;

for (uint32_t i = 0; i < test_features.size(); i += batch_size) {
uint32_t current_bs = std::min(batch_size, (uint32_t)test_features.size() - i);
uint64_t start_pos = graph_arena->get_pos();

uint32_t shape_x[2] = {current_bs, 784};
Tensor* t_x = tensor_create(graph_arena, 2, shape_x);

for (uint32_t b = 0; b < current_bs; b++)
for (uint32_t j = 0; j < 784; j++)
t_x->storage->data[t_x->offset + b * 784 + j] =
test_features[i + b][j];

autograd::Variable* x = autograd::create_leaf(graph_arena, t_x, false);
autograd::Variable* out = model.get_model()->forward(graph_arena, x);

for (uint32_t b = 0; b < current_bs; b++) {
float max_v = -1e9f; int pred = 0;
for (int c = 0; c < 10; c++) {
float v = out->data->storage->data[out->data->offset + b * 10 + c];
if (v > max_v) { max_v = v; pred = c; }
}
if (pred == (int)test_labels_raw[i + b][0]) correct++;
}

graph_arena->pop_to(start_pos);
}

float accuracy = 100.0f * correct / test_features.size();
std::cout << "Test Accuracy: " << correct << " / " << test_features.size()
<< " (" << accuracy << "%)" << std::endl;
// → Test Accuracy: 9777 / 10000 (97.77%)
note

model.get_model()->eval() is critical before evaluating. Without it, BatchNorm continues updating running statistics from test data, corrupting evaluation metrics.


Step 6 — Save the model

#include <filesystem>
std::filesystem::create_directories("bin");
model.save("bin/mnist_model.bin", "binary");

Step 7 — Inference with terminal visualization

The inference binary loads a saved model and lets you inspect any test sample interactively. It renders the digit using ANSI 256-color escape codes directly in your terminal.

if (!model.load("bin/mnist_model.bin")) { return 1; }
model.get_model()->eval();

int n;
std::cout << "Enter sample index (0 to " << features.size() - 1 << "): ";
std::cin >> n;

draw_mnist_digit(features[n].data()); // ANSI terminal render

uint32_t shape[2] = {1, 784};
Tensor* input = tensor_create(graph_arena, 2, shape);
std::memcpy(input->storage->data, features[n].data(), 784 * sizeof(float));

autograd::Variable* x = autograd::create_leaf(graph_arena, input, false);
autograd::Variable* out = model.get_model()->forward(graph_arena, x);

float max_v = -1e9f; int pred = 0;
for (int i = 0; i < 10; ++i)
if (out->data->storage->data[i] > max_v) {
max_v = out->data->storage->data[i]; pred = i;
}

std::cout << "Actual: " << labels_raw[n][0]
<< " Predicted: " << pred << std::endl;

Running the example

cd examples/mnist

# Train (writes bin/mnist_model.bin)
./bin/train_mnist

# Interactive inference
./bin/inference_mnist
# → Enter sample index (0 to 9999): 42
# → [terminal digit visualization]
# → Actual: 7 Predicted: 7

Bonus: Autoencoder

examples/mnist/autoencoder.cpp trains an encoder–decoder network to compress each 784-dimensional digit into a 32-dimensional latent space and reconstruct it:

Input(784) → Linear(784→128) → ReLU
→ Linear(128→32) → ReLU ← bottleneck
→ Linear(32→128) → ReLU
→ Linear(128→784) → Sigmoid

Loss is MSELoss between input and reconstruction — no labels needed. After training it prints both the original and its reconstruction side-by-side in the terminal.

./bin/autoencoder

Key concepts recap

ConceptWhere it appears
CSVLoader::parse_mnist_csvParsing MNIST CSV format
CSVLoader::normalize_minmaxPixel value scaling
CSVLoader::one_hot_encodeConverting integer labels to distributions
nn::Model::compile / trainHigh-level training API
model.get_model()->eval()Switching to inference mode
model.save / model.loadCheckpoint persistence
autograd::create_leaf(..., false)Inference without gradient tracking
graph_arena->pop_toO(1) graph memory reclamation

Next steps

You now understand the core training loop and inference patterns. From here you can explore:

  • Module deep dives — how tensor, autograd, nn, and optim are implemented internally.

  • Adding new layers — see the Contributing Guide for how to add a custom activation or layer type.

  • Extending the autoencoder — try varying the bottleneck dimension or adding noise to inputs for denoising autoencoding.