Writing a dataset to FFCV format¶
Datasets in FFCV are stored in a custom .beton format that allows for fast
reading (see the Making an FFCV dataloader section).
Such files can be generated using the class ffcv.writer.DatasetWriter from two potential sources:
Indexable objects: They need to implement
__len__and a__getitem__function returning the data associated to a sample as a tuple/list (of any length). Examples of this kind of dataset include but are not limited to:torch.utils.data.Dataset,numpy.ndarray, or even Python lists.Webdataset (Github): This allows users to integrate large scale and/or remote datasets into FFCV easily.
In this tutorial, we will show how to handle datasets from these two categories.
Additionally, in the folder /examples of our repository we also include a
conversion script illustrating the conversion of CIFAR-10 and ImageNet from their PyTorch counterparts.
The first step is to include the following class into your script:
from ffcv.writer import DatasetWriter
Indexable Dataset¶
For this example, we’ll construct a simple linear regression dataset that returns an input vector and its corresponding label:
import numpy as np
class LinearRegressionDataset:
def __init__(self, N, d):
self.X = np.random.randn(N, d)
self.Y = np.random.randn(N)
def __getitem__(self, idx):
return (self.X[idx].astype('float32'), self.Y[idx])
def __len__(self):
return len(self.X)
N, d = (100, 6)
dataset = LinearRegressionDataset(N, d)
Note
The class LinearRegressionDataset implements the interface required to be a
torch.utils.data.Dataset so one could use any PyTorch Dataset instead of our
toy example here.
The class responsible for converting datasets to FFCV format is the
ffcv.writer.DatasetWriter. The writer takes in:
A path, where the
.betonwill be writtenA dictionary mapping keys to fields (
Field).
Each field corresponds to an element of the data tuple returned by our dataset, and specifies how the element should be written to (and later, read from) the FFCV dataset file. In our case, the dataset has two fields, one for the (vector) input and the other for the corresponding (scalar) label. Both of these fields already have default implementations in FFCV, which we use below:
from ffcv.fields import NDArrayField, FloatField
writer = DatasetWriter(write_path, {
'covariate': NDArrayField(shape=(d,), dtype=np.dtype('float32')),
'label': FloatField(),
}, num_workers=16)
Note
Starting in Python 3.6, dictionary keys are ordered, and DatasetWriter uses
this order to match the given fields to the elements returned by the
__getitem__ function of the dataset. Make sure to provide
the fields in the right order to avoid errors.
After constructing the writer, the only remaining step is to write the dataset:
writer.from_indexed_dataset(my_dataset)
Webdataset¶
For this second example we will assume that you have access to a
webdataset version of ImageNet (or similar) dataset, and that all the
shards are in a folder called FOLDER.
In order to perform the conversion to a .beton file, we first need to
collect the list of shards. This can be simply done with glob:
from glob import glob
from os import path
my_shards = glob(path.join(FOLDER, '*'))
Internally, FFCV will split the shards between the available workers.
However, each worker still needs to know how to decode a given shard. This is done
by defining a pipeline (very similar to how one would use a webdataset for training):
def pipeline(dataset):
return dataset.decode('rgb8').to_tuple('jpg:png;jpeg cls')
Since FFCV expects images in the numpy uint8 format, we use the parameter 'rgb8'
of webdataset to decode the images. We then convert the dictionary to a tuple
that FFCV will be able to process.
We now just have to glue everything together:
from ffcv.fields import RGBImageField, IntField
writer = DatasetWriter(write_path, {
'image': RGBImageField()
'label': IntField(),
}, num_workers=40)
writer.from_webdataset(my_shards, pipeline)
Fields¶
Beyond the examples used above, FFCV supports a variety of built-in fields that make it easy to directly convert most datasets. We review them below:
RGBImageField: Handles images including (optional) compression and resizing. Pass in a PyTorch Tensor.IntFieldandFloatField: Handle simple scalar fields. Pass inintorfloat.BytesField: Stores byte arrays of variable length. Pass innumpybyte array.JSONField: Encodes a JSON document. Pass indictthat can be JSON-encoded.
That’s it! You are now ready to construct loaders for this dataset and start loading the data.