Skip to content

Efficient, check-pointed data loading for deep learning with massive data sets.

License

Notifications You must be signed in to change notification settings

marian-nmt/infinibatch

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InfiniBatch

Infinibatch is a library of checkpointable iterators for randomized data loading of massive data sets in deep neural network training.

Features

  • support for corpora much larger than fit into RAM
  • hierarchical block+sentence-level randomization over the whole corpus, different randomization in each epoch
  • only load the data that is needed
  • very fast start-up time (does not need to read full corpus)
  • only requires the most basic of data preparation (e.g. no indexing)
  • for multi-GPU, only load what the respective GPU needs
  • 100% accurate check-pointing, restore from checkpoint should not read all data up to the checkpoint
  • support automatic bucketed batching with dynamic batch sizes
  • pre-fetching thread
  • composable, as to support for complex batching, e.g. negative samples from multiple documents

Getting Started

Infinibatch requires Python 3.6 or higher and has no dependencies. There is presently no pip package.

To install it, clone this repository and install it locally.

git clone https://github.com/microsoft/infinibatch
cd infinibatch
pip install -e .

Documentation

The documentation can be found here: https://microsoft.github.io/infinibatch/

Tutorial

This little tutorial walks you through the steps of preparing your data and consuming them from Python code as batches.

Infinibatch Basics: Iterators and Checkpointing

Infinibatch provides Python iterators to read your data. An iterator represents a stream of data that can be retrieved item by item, e.g. via a for loop or repeatedly calling next() on it.

Infinibatch is agnostic to the data type of the items, which is determined by a user-supplied file-read function. In NLP applications, items would typically be tuples of text. In other applications, they can be images or an audio file with a textual annotation.

Infinibatch makes it easy to read your data in randomized order, and supports checkpointing, which allows you to restart training exactly where you left off.

Randomization is done on the fly, which means that it is not necessary to read the entire data set into memory to be shuffled. Infinibatch implements a hierarchical shuffling algorithm that only holds a subset of the data in RAM at any point in time.

Infinibatch iterators are checkpointable. Checkpointing lets you retrieve the current position (the "checkpoint") in the data stream at any time, so that later, you can "rewind" to that same position. The sad reality is that long-running trainings occasionally crash. To be able to continue a crashed training as if it had not crashed, save your Infinibatch iterator's checkpoint to disk whenever you save an intermediate model during training. To restart a crashed training, reset the iterator to the saved checkpoint. The data reader will now yield the exact same data-item sequence it would have yielded without the crash.

Data Preparation

Infinibatch has one requirement on your data organization: To use your data with Infinibatch, it must be split into a large number of small chunks. A chunk is the smallest unit of data that is loaded from disk into RAM. Infinibatch holds a random subset of chunks in memory that it randomly draws samples from.

Below we want to show how such a split can be created. An easy way to split your data into chunks is with the Linux split command.

In this tutorial, our "corpus" consists of 6 lines of text, where each line is one data item. To create that corpus, please run this command in a bash shell. It creates a 6-line text file named corpus.txt:

echo \\
'Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
The quick brown fox jumps over the lazy dog.' \\
> corpus.txt

Now let us split it into 3 chunks of 2 lines each. Each chunk is stored as a zipped text file. We will create them inside a new subdirectory called corpus_chunks:

mkdir corpus_chunks
split  --lines 2  --numeric-suffixes                 \\
       --filter 'gzip > corpus_chunks/$FILE.txt.gz'  \\
       corpus.txt  corpus.

This will have created three files: corpus_chunks/corpus.00.txt.gz, corpus_chunks/corpus.01.txt.gz, and corpus_chunks/corpus.02.txt.gz. To verify whether the data has been split as expected, you can use this command:

zcat corpus_chunks/corpus.*.txt.gz

Hint: For large corpora, we recommend replacing gzip by pigz (apt-get install pigz), which runs notably faster via multi-threading.

Reading Items in Random Order With Infinibatch

We will first show the easiest way to read data with Infinibatch, using the helper function chunked_dataset_iterator``(). This function will create an Infinibatch iterator that yields the content of your data in random order. Please the following program:

import gzip, glob

from infinibatch import datasets as ds

ds = ds.chunked_dataset_iterator(
    chunk_refs = glob.glob('corpus_chunks/corpus.*.txt.gz'),
    read_chunk_fn = lambda path: iter(gzip.decompress(open(path, "rb")  \\
                                      .read()).decode(encoding='utf-8') \\
                                      .splitlines()),
    buffer_size = 6, seed = 1)

for i in range(10):
    print(next(ds))

You should get output that contains the 6 example lines in randomized order:

Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
The quick brown fox jumps over the lazy dog.
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
consectetur adipiscing elit,
Lorem ipsum dolor sit amet,
The quick brown fox jumps over the lazy dog.
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Note: The buffer_size parameter determines how many sentences are read into memory at any given time, to draw randomized items from. In real settings with corpora of hundreds of millions of text lines, the buffer_size parameter should be set in the millions. RAM usage and startup time will be proportional to the buffer size (but much lower than having to load the entire corpus into RAM).

Reading Items of Different Lengths in Batches

For deep learning, we want to group multiple items into batches. For NLP tasks, items are often lines of text of varying length. Infinibatch implements an algorithm that randomizes the input sequence and groups it into batches of approximately the same length (aka bucketing).

Infinibatch's BucketedReadaheadBatchIterator performs this task. It implements an algorithm modeled after the Marian toolkit that preloads a large number of randomized items (typically millions; in this example: 6), sorts them and groups them into batches of similar length, and then yields them, in turn, in randomized order.

Here is an example. Note that the BucketedReadaheadBatchIterator accepts the previous randomized sentence sequence iterator (ds) as the source of items to randomize over. This is an example how one forms pipelines of iterators with Infinibatch (a concept familiar from Python's own itertools). Once an iterator is passed to another as its source, consider it owned by that other iterator, it must no longer be accessed by the calling code.

import gzip, glob

from infinibatch import datasets as ds
from infinibatch import iterators as it

ds = ds.chunked_dataset_iterator(
    chunk_refs = glob.glob('corpus_chunks/corpus.*.txt.gz'),
    read_chunk_fn = lambda path: iter(gzip.decompress(open(path, "rb")  \\
                                      .read()).decode(encoding='utf-8') \\
                                      .splitlines()),
    buffer_size = 6, seed = 1)

bs = it.BucketedReadaheadBatchIterator(
    source_iterator = ds,   # note: this is the iterator from above
    read_ahead = 6,
    key = lambda line: len(line),
    batch_size = 2,
    seed = 1)

for i in range(25):
    print(next(bs))

This code should output something like this:

['sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.',
 'The quick brown fox jumps over the lazy dog.']
['consectetur adipiscing elit,', 'Lorem ipsum dolor sit amet,']
['Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.',
 'Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.']

followed by different permutations of the same tuples. As you can see, the sentences are in random order and grouped in batches of 2 of approximately the same length. You may notice that there is no variation in how the items get grouped into batches--that is an artifact of this example, and generally not the case in real use when the data size is much larger than the batch size.

In NLP, sentence length often varies considerably. As a result, using batches of a fixed number of lines, as in the example above, will waste GPU RAM and cores. This is because the number of lines is limited by the longest possible sequence; batches of shorter lines would leave GPU cycles on the table. Ideally, one would use batches that have as many lines as fit into GPU RAM, given the number of tokens of the longest line in the batch. To support variable batch sizes, Infinibatch allows to pass a function as the batch_size parameter. That function will be given the longest item of a batch and should estimate how many items of at most this length can fit.

In our example, we assume that batches can hold at most 150 tokens. Please change the above code as follows:

    batch_size = lambda longest_line: 150 // len(longest_line),

The output looks like this:

['consectetur adipiscing elit,', 'Lorem ipsum dolor sit amet,']
['Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.']
['sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.',
 'The quick brown fox jumps over the lazy dog.']
['Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.']

That shorter sentences got grouped, while longer did not because they would exceed the total of 150 characters.

Reading Batches Into Numpy Arrays

Lastly, we will need to feed batches into our favorite deep-learning tool. We will show how to convert the batches of text lines into padded numpy arrays.

In a typical NLP application, text items would be tokenized, and then each token would be represented by an index into a unit vocabulary. For simplicity, in this example each character is its own token, and each token's numeric unit index is just its ASCII code. These sequences are then padded to equal length with -1, and converted into a numpy array.

Please rerun the previous example, but first insert the following code before the final for loop. This example uses an Infinibatch MapIterator, which applies a user-supplied function or lambda to each item:

import numpy as np
def collate(lines_batch):
    # tokenize all lines in the batch and map to unit ids
    ids_batch = [[ord(c) for c in line] for line in lines_batch]
    # create a padded numpy array as wide as the longest line,
    # where shorter sequences are padded with -1
    width = max(len(ids) for ids in ids_batch)
    return np.array([ids + [-1] * (width-len(ids)) for ids in ids_batch])

bs = it.MapIterator(
    source_iterator = bs,
    transform = collate)

This will output batches like this. Note that in batches with multiple sentences, some entries are padded with -1.

[[ 99 111 110 115 101  99 116 101 116 117 114  32  97 100 105 112 105 115
   99 105 110 103  32 101 108 105 116  44]
 [ 76 111 114 101 109  32 105 112 115 117 109  32 100 111 108 111 114  32
  115 105 116  32  97 109 101 116  44  -1]]
[[ 85 116  32 101 110 105 109  32  97 100  32 109 105 110 105 109  32 118
  101 110 105  97 109  44  32 113 117 105 115  32 110 111 115 116 114 117
  100  32 101 120 101 114  99 105 116  97 116 105 111 110  32 117 108 108
   97 109  99 111  32 108  97  98 111 114 105 115  32 110 105 115 105  32
  117 116  32  97 108 105 113 117 105 112  32 101 120  32 101  97  32  99
  111 109 109 111 100 111  32  99 111 110 115 101 113 117  97 116  46]]
[[115 101 100  32 100 111  32 101 105 117 115 109 111 100  32 116 101 109
  112 111 114  32 105 110  99 105 100 105 100 117 110 116  32 117 116  32
  108  97  98 111 114 101  32 101 116  32 100 111 108 111 114 101  32 109
   97 103 110  97  32  97 108 105 113 117  97  46]
 [ 84 104 101  32 113 117 105  99 107  32  98 114 111 119 110  32 102 111
  120  32 106 117 109 112 115  32 111 118 101 114  32 116 104 101  32 108
   97 122 121  32 100 111 103  46  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1
   -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1]]
[[ 68 117 105 115  32  97 117 116 101  32 105 114 117 114 101  32 100 111
  108 111 114  32 105 110  32 114 101 112 114 101 104 101 110 100 101 114
  105 116  32 105 110  32 118 111 108 117 112 116  97 116 101  32 118 101
  108 105 116  32 101 115 115 101  32  99 105 108 108 117 109  32 100 111
  108 111 114 101  32 101 117  32 102 117 103 105  97 116  32 110 117 108
  108  97  32 112  97 114 105  97 116 117 114  46]]

Where To Go From Here

The above tutorial showed you the use of the most common iterator type, as created by the convenience function chunked_dataset_iterator().

Not all real-life scenarios are covered by this function. For example, multi-task learning scenarios require more complex combinations of data. To create those, you will need to compose the necessary data reader from the underlying building blocks. This is described at the documentation of the module iterators.

Documentation

To view the documentation, please clone the repository and go to docs/infinibatch/index.html

When working on the documentation, install pdoc:

pip install pdoc3

You can then start a local http server that dynamically updates the documentation:

pdoc --template-dir docs --http : infinibatch

We currently haven't set up the CI to automatically generate the documentation. Before you merge anything into master, please delete the existing documentation in docs/infinibatch and run

pdoc -o docs --template-dir docs --html infinibatch

Testing

To run unit tests, run the following command.

python -m unittest discover -s test

If you would like the unit tests to stop after the first failed test, use:

python -m unittest discover -s test --failfast

To type-check with mypy (if installed):

mypy infinibatch

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

About

Efficient, check-pointed data loading for deep learning with massive data sets.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%