Skip to main content

CSVLoader

CSVLoader is a static utility class for loading, parsing, and preprocessing tabular data from CSV files. It handles the entire pipeline from raw file to training-ready vectors — no external libraries, no Python dependency, no nonsense.

Header: include/nn/data/csv_loader.hpp
Namespace: gradientcore

All methods are static — you never instantiate CSVLoader. Just call the methods directly.


Step 1 — Load Raw CSV

load_csv

static std::vector<std::vector<std::string>>
load_csv(const std::string &filepath, bool skip_header = false);
ParameterDescription
filepathPath to the CSV file.
skip_headerIf true, the first row is discarded. Default false.

Reads the file line by line, splits each line on commas, and trims whitespace from each cell. Returns a 2D vector of strings — one vector<string> per row.

Returns an empty vector and prints an error if the file cannot be opened.

auto raw = CSVLoader::load_csv("data/housing.csv", /*skip_header=*/true);
// raw[0][0] = "-122.23" (first feature of first sample, as a string)

The raw format is intentionally untyped — you decide how to parse the strings in the next step.


Step 2 — Parse to Floats

parse_csv_to_float

For general tabular CSVs with a fixed number of feature columns:

static void parse_csv_to_float(
const std::vector<std::vector<std::string>> &csv_data,
uint32_t feature_cols,
bool has_label,
std::vector<std::vector<float>> &features, // output
std::vector<std::vector<float>> &labels); // output
ParameterDescription
csv_dataRaw string rows from load_csv.
feature_colsNumber of columns to treat as features (read from the left).
has_labelIf true, the column after the feature columns is parsed as the label.
featuresOutput: one vector<float> per sample.
labelsOutput: one vector<float> per sample (single element).

Rows with fewer columns than feature_cols are silently skipped. Cells that fail std::stof are replaced with 0.0f.

std::vector<std::vector<float>> features, labels;
CSVLoader::parse_csv_to_float(raw, /*feature_cols=*/8, /*has_label=*/true,
features, labels);
// features[i] has 8 floats, labels[i] has 1 float
std::cout << "Loaded " << features.size() << " samples\n";

parse_mnist_csv

Specialised parser for the MNIST CSV format (label in column 0, 784 pixels in columns 1–784):

static void parse_mnist_csv(
const std::vector<std::vector<std::string>> &csv_data,
std::vector<std::vector<float>> &features, // output: 784-element vectors
std::vector<std::vector<float>> &labels); // output: 1-element vectors [0..9]
auto raw = CSVLoader::load_csv("data/mnist_train.csv", true);
std::vector<std::vector<float>> features, labels;
CSVLoader::parse_mnist_csv(raw, features, labels);
// features[i]: 784 floats (pixel values 0–255)
// labels[i]: 1 float (digit class 0–9)

Rows with fewer than 785 columns are silently skipped.


Step 3 — Preprocess

normalize_minmax

static void normalize_minmax(std::vector<std::vector<float>> &features);

Scales each feature column independently to [0, 1]:

x_norm = (x - min) / (max - min)

Operates in-place. If a column is constant (min == max), it is left unchanged. Used for MNIST pixel normalisation (pixel values 0–255 → 0–1).

CSVLoader::normalize_minmax(features);

standardize

static void standardize(std::vector<std::vector<float>> &features);

Scales each feature column to zero mean and unit variance:

x_std = (x - mean) / std_dev

Operates in-place. Columns with zero standard deviation are left unchanged. Used for the California Housing dataset where features have wildly different natural scales (longitude vs total rooms vs median income).

CSVLoader::standardize(features);

normalize_minmax vs standardize: use min-max for pixel data or bounded inputs, standardize for continuous features with unknown range and possible outliers.


Step 4 — Encode Labels

one_hot_encode

static void one_hot_encode(
const std::vector<std::vector<float>> &labels,
uint32_t num_classes,
std::vector<std::vector<float>> &encoded); // output

Converts integer class labels to one-hot vectors:

label = 3, num_classes = 10
→ [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
ParameterDescription
labelsInput: each element is a 1-element vector containing an integer class index.
num_classesTotal number of classes. Determines output vector length.
encodedOutput: one-hot vectors of length num_classes.

Required for CrossEntropyLoss, which expects a probability distribution (one-hot is the hard-label special case of a distribution).

std::vector<std::vector<float>> labels_onehot;
CSVLoader::one_hot_encode(labels, /*num_classes=*/10, labels_onehot);

Class indices outside [0, num_classes) produce all-zero vectors (not an error, but worth knowing).


Step 5 — Split

train_test_split

static void train_test_split(
const std::vector<std::vector<float>> &features,
const std::vector<std::vector<float>> &labels,
float train_ratio,
std::vector<std::vector<float>> &X_train, // output
std::vector<std::vector<float>> &Y_train, // output
std::vector<std::vector<float>> &X_test, // output
std::vector<std::vector<float>> &Y_test); // output

Splits data into training and test sets by taking the first train_ratio * N samples for training and the rest for testing. The split is deterministic and not shuffled — the order of rows in the CSV determines which samples end up in each split.

ParameterDescription
train_ratioFraction of data for training. 0.8 = 80% train, 20% test.
std::vector<std::vector<float>> X_train, Y_train, X_test, Y_test;
CSVLoader::train_test_split(features, labels, 0.8f,
X_train, Y_train,
X_test, Y_test);
std::cout << X_train.size() << " train, " << X_test.size() << " test\n";
Shuffle before splitting

train_test_split does not shuffle. If your CSV is sorted by class or time, a sequential split will produce biased train/test sets. Shuffle the data externally (or use the DataLoader's built-in shuffle during training) to mitigate this.


Complete Pipeline Examples

California Housing (regression)

// 1. Load
auto raw = CSVLoader::load_csv("data/housing.csv", true);

// 2. Parse
std::vector<std::vector<float>> features, labels;
CSVLoader::parse_csv_to_float(raw, 8, true, features, labels);

// 3. Preprocess
CSVLoader::standardize(features);
for (auto& label : labels) label[0] /= 100000.0f; // scale to ~[0, 5]

// 4. Split
std::vector<std::vector<float>> X_train, Y_train, X_test, Y_test;
CSVLoader::train_test_split(features, labels, 0.8f,
X_train, Y_train, X_test, Y_test);

MNIST (classification)

// 1. Load
auto raw = CSVLoader::load_csv("data/mnist_train.csv", true);

// 2. Parse (MNIST-specific)
std::vector<std::vector<float>> features, labels_raw;
CSVLoader::parse_mnist_csv(raw, features, labels_raw);

// 3. Normalise pixels
CSVLoader::normalize_minmax(features);

// 4. One-hot encode labels
std::vector<std::vector<float>> labels_onehot;
CSVLoader::one_hot_encode(labels_raw, 10, labels_onehot);

// 5. No split needed — test set is a separate file
auto test_raw = CSVLoader::load_csv("data/mnist_test.csv", true);
std::vector<std::vector<float>> test_features, test_labels;
CSVLoader::parse_mnist_csv(test_raw, test_features, test_labels);
CSVLoader::normalize_minmax(test_features);