HDF5-FEMNIST enables easy access and fast loading to the FEMNIST dataset from LEAF with the help of HDF5.
There are currently limited public accessable federated datasets for research purpose. FEMNIST is one of the most popular and early datasets implemented under LEAF framework which is rarely used by majorities of researchers nowadays. Tensorflow-Federated has a buildin version of FEMNIST. But for the TyTorch users, there are now easy access to the FEMNIST datasets.
According to the idea of creating FEMNIST based on NIST dataset, this repo simplify the producdures of spliting the dataset into different users. Then the dataset is converted into HDF5 format for easy access and fast loading. Loading all ther writer datasets using PyTorch ImageFolder takes about 10 to 30 minutes, while loading the same dataset using HDF5 takes only a few seconds.
This repo is able to generate the HDF5 datasets for each writer in two setting. The first is digits only datasets, which contains only digits from 0 to 9. The second is the full datasets, which contains all the 62 classes of characters. We have to note that not all the writers have all the 62 classes of characters or even all 10 digits in digits only mode. The datasets are generated in the following structure:
HDF5-FEMNIST
├── f0000_14
│ ├── images
│ │ ├── n0 * 28 * 28
│ └── labels
│ ├── n0 * 1
├── f0001_41
│ ├── images
│ │ ├── n1 * 28 * 28
│ └── labels
│ ├── n1 * 1
├── ...
├── f4099_10
│ ├── images
│ │ ├── n4099 * 28 * 28
│ └── labels
│ ├── n4099 * 1
In digits only mode, the labels are in the range of 0 to 9. In full mode, the labels are in the range of 0 to 61. The mapping from the label to the character is in the following table:
Character | 0 | ... | 9 | A | ... | Z | a | ... | z |
---|---|---|---|---|---|---|---|---|---|
Label | 0 | ... | 9 | 10 | ... | 35 | 36 | ... | 61 |
To use the datasets, just clone the repo and run the following command:
./get_data.sh
The conversion takes about 2 hours and 12 minutes on linux machine with Intel® Xeon® Platinum 8360Y, if you just want to use the dataset, you can download the compressed version directly.
wget https://github.com/Xiao-Chenguang/HDF5-FEMNIST/raw/refs/heads/main/write_digits.hdf5.gz
gunzip write_digits.hdf5.gz
md5sum write_digits.hdf5
make sure the md5sum is a4a7beee023c2148ca9e4d6d4690e3ce write_digits.hdf5
You can specify the mode of the dataset by adding the argument -d
followed by True
(default, digits only mode) or False
(all characters mode).
After converting the dataset to HDF5 format, you can use the following code to load the dataset:
import h5py
# load the dataset
dataset = h5py.File('path_to_your_femnist.hdf5', 'r')
# get the key of each writer datasets
writers = sorted(dataset.keys())
# get the images and labels of the first writer as numpy array
images = dataset[writers[0]]['images']
labels = dataset[writers[0]]['labels']
# transform the images and labels to torch tensor
images_tensor = torch.from_numpy(images)
labels_tensor = torch.from_numpy(labels)
A demo notebook is provided to explore properties of the dataset. Feel free to play with it.
python with the following packages:
- h5py
- numpy
- Pillow
- tqdm
- pandas