Skip to content

Dataloader Tutorial

kartikdutt18 edited this page Jun 2, 2020 · 2 revisions

models provide an easy to use data loader to load popular datasets in just a single line of code. Maybe you want to use it some other dataset and you can do that too.

Template Parameters:

Dataloader requires defaults to using arma::mat for training features and prediction however those of who want to play around with other armadillo types can simply pass template parameters to change them.

DatasetX : Datatype for loading input features.
DatasetY : Datatype for prediction features.
ScalerType : mlpack's Scaler Object for scaling features.

1. Loading Popular datasets

Use our constructor to pass the required information about the dataset that you want to load and we will download it, extract it and process it so that it's ready to use. For supported datasets take a look at this list.

Constructor Parameters

For simply trying out our dataloader without any hassle, just pass the dataset name and whether or not to shuffle data and you are done. For example, Loading mnist dataset is very simple.

Simple Usage

Dataloader<> dataloader("mnist", true);

This will fill TrainFeatures, TrainLabels, ValidationFeatures, ValidationLabels and TestFeatures for the dataloader. We will discuss them in detail below.

Advanced parameters: Currently we are working on providing support for augmentation support and we will update the tutorial with the same.

datasetPath : Path or name of dataset.
shuffle : Whether or not to shuffle the data.
ratio : Ratio for train-test split. Defaults to 0.75.
useScaler : Use feature scaler for pre-processing the dataset. Defaults to false.
augmentation : Adds augmentation to training data only. Defaults to an empty vector.
augmentationProbability : Probability of applying augmentation on dataset. Defaults to 0.2.

Advanced Usage

With the help of the above parameters you can use features such as scaling and augmentation to make your model robust. A sample usage is shown below.

Note: Augmentation class is under development.

Dataloader<arma::mat, arma::mat, mlpack::data::MinMaxScaler> dataloader("Pascal-VOC-detection",
    true, 0.7, true, {"horizontal-flip", "vertical-flip"}, 0.2);

Refer to accessor methods in data loader to understand how to use data loader for training and testing.

2. Loading Other datasets

You can use our data loaders to load any type of dataset you want. We are currently developing Image data loaders to get images path from either CSVs or directories. Till then, we only support CSV datasets as part of our data loader.

Note : We support wrapped indices in our data loader i.e. using index such as -1 implies last column / row and so on.

Downloading datasets In case your dataset is hosted on a server somewhere, you can use our utility functions to download it.

Utils::DownloadFile("path-in-mlpack-server", "path-where-to-save-the-dataset")

For more details on how to use it to download files from other servers refer to our Utils tutorial wiki page.

Usage

Use the default constructor to create the data loader object. Then use one of our data loader methods to load the data.

Load CSV Method

This method can be used to load CSVs and preprocess the loaded data.

Load CSV Usage

You can simply load a CSV, scale it, perform train-test split and split the data into input features and output labels.

datasetPath : Path to the dataset.
loadTrainData : Boolean to determine whether data will be stored for
                training or testing. If true, data will be loaded for training.
                Note: This option augmentation to NULL, set ratio to 1 and
                scaler will be used to only transform the test data.
shuffle : Boolean to determine whether or not to shuffle the data.
ratio : Ratio for train-test split.
useScaler : Fits the scaler on training data and transforms dataset.
dropHeader : Drops the first row from CSV.
startInputFeatures : First Index which will be fed into the model as input.
endInputFeature : Last Index which will be fed into the model as input.
startPredictionFeatures : First Index which be predicted by the model as output.
endPredictionFeatures : Last Index which be predicted by the model as output.
augmentation : Vector strings of augmentations supported by mlpack.
augmentationProbability : Probability of applying augmentation to a particular cell.

An example of code is given below :

DataLoader<> irisDataloader;

std::string datasetPath = "./iris.csv";
// Starting column index for Training Features.
size_t startInputFeatures = 0;
// Ending column index for training Features.
size_t endInputFeatures = -2;
// Prediction columns. 
size_t startInputLabels = -1;

irisDataloader(datasetPath, isTrainingData, shuffleData, ratioForTrainTestSplit,
    useFeatureScaling, dropHeader, startInputFeatures, endInputFeatures, startInputLabels);

Refer to accessor methods in data loader to understand how to use data loader for training and testing.

3. Accessor Methods : Using DataLoader object for training and inference

We provide access to loaded data using accessor and modifiers functions. This will allow you to perform extra pre-processing on dataset if you want. Details about the data loader members are given below.

TrainFeatures() : Returns input features to be used by model during training.
TrainLabels() :  Returns ground truth for training input features.

TestFeatures() : Returns input features to be used by model during testing.
TestLabels() : Return predictions made by model for test input features. Initially empty.

ValidFeatures() : Returns input features to be used by model during validation.
ValidLabels() : Returns ground truth for validation input features.

TrainSet() : Returns a tuple containing both TrainFeatures and TrainLabels.

ValidSet() : Returns a tuple containing both ValidFeatures and ValidLabels.

TestSet() : Returns a tuple containing both TestFeatures and TestLabels.

4. Supported Datasets

Currently we only support mnist dataset. Over the next few weeks we will be adding support for various other datasets such as Pascal-VOC, Imagenet and so many more. We are an open source organization and we really appreciate it if you take the time to add any popular dataset in the dataloader or you can open an issue and someone will get to it.

Clone this wiki locally