The Vesuvius 3D DataStreamer is crafted to meet the growing demands of processing an increasing number of high-resolution 3D segments from the Herculaneum scrolls of the Vesuvius Challenge (https://scrollprize.org/).
It is designed to bypass the limitations of in-memory data loading by directly streaming from disk sampled 3D blocks, which becomes increasingly critical in training pipelines as the dataset expands.
If possible, it can sample also from numpy arrays loaded in the RAM. In this case, the access to the data is faster.
streamer.py
: Implements VesuviusStream
, a PyTorch IterableDataset
for streaming 3D chunks from multiple Zarr archives (on disk) or numpy arrays (loaded on RAM). When dealing with disk-data, the algorithm is designed for efficiency, only loading the necessary data for each chunk. Can sample with two strategies: uniform
will first uniformly select one file, and then sample without replacement within the file; proportional
will select one file with a probability proportional to the number of samples in it (more samples, more probability).
converter.py
: A script to convert TIFF files into a Zarr archive, with options for cropping and chunking, facilitating efficient 3D array manipulation.
example_zarr
folder contains a sample example.zarr
archive, created from the "monster scroll" with specified 3D region of interest parameters.
training_example.ipynb
: Demonstrates using VesuviusStream
with PyTorch's DataLoader
for a machine learning model.
Install the required packages:
pip install -r requirements.txt
First one has to convert the TIFF images to Zarr for efficient access:
python tools/converter.py /path/to/tiff_folder/ /path/to/destination.zarr --parameters
All TIFF files from 00.tif
to 64.tif
for a segment or scroll must be present in the folder.
The parameters can be set to specify the ROI (region of interest) in the 3D images in terms of 3D coordinates.
example.zarr
has been generated with the following (ROI) parameters:
- --z_start 26 --z_end 36
- --x_start 6000 --x_end 7000
- --y_start 4000 --y_end 5000
Zarr saves multidimensional arrays in separated chunks. The chunk size parameters here determine the shapes that zarr will use to split the data, and not the 3D chunks used for the ML model. By default, chunks are:
- --z_chunksize 4
- --y_chunksize 512
- --y_chunksize 512
This setting will produce ~2MB chunks, but feel free to play with the settings.
VesuviusStream loads data on-the-fly:
from datautils.streamer import VesuviusStream
from torch.utils.data import DataLoader
# fragment_img = np.array(fragment loaded as 3D image)
dataset = VesuviusStream(files=['./example_zarr/example.zarr', fragment_img], z_size=2, y_size=4, x_size=6, samples_per_epoch=16, sampling_method='uniform', shuffle=True)
loader = DataLoader(dataset, batch_size=4, num_workers=2)
Here the parameters z_size
, y_size
and x_size
indicate the shape of the 3D block to sample from the scroll/fragment.
Refer to training_example.ipynb for integrating data streaming into a training loop.
The streamer reads only the chunks in the Zarr archive necessary to produce the block. This is why it is memory efficient. It is recommended to set the chunksize parameters in the converter greater than the dimensions of the blocks to fetch. However, if the chunksize is too big the streamer will read too much useless data from the disk.
In my Google Colab simulations, the time to sample 8
batches of size = 128
each, block_size = (4,64,64)
, from disk is 15 seconds
.
The time to sample the same amount of data, from a numpy array in the RAM, is 6 seconds
.
Therefore, if possible, load the numpy array in the RAM.
The tool is built for PyTorch but can be adapted for other frameworks.
Contributions are welcome.
Dr. Giorgio Angelotti
For any inquiries or further information, feel free to contact me at [email protected]
This project is licensed under the MIT License -- see the LICENSE
file for details.