Writing a dataset to FFCV format

Datasets in FFCV are stored in a custom .beton format that allows for fast reading (see the Making an FFCV dataloader section).

Such files can be generated using the class ffcv.writer.DatasetWriter from two potential sources:

  • Indexable objects: They need to implement __len__ and a __getitem__ function returning the data associated to a sample as a tuple/list (of any length). Examples of this kind of dataset include but are not limited to: torch.utils.data.Dataset, numpy.ndarray, or even Python lists.

  • Webdataset (Github): This allows users to integrate large scale and/or remote datasets into FFCV easily.

In this tutorial, we will show how to handle datasets from these two categories. Additionally, in the folder /examples of our repository we also include a conversion script illustrating the conversion of CIFAR-10 and ImageNet from their PyTorch counterparts.

The first step is to include the following class into your script:

from ffcv.writer import DatasetWriter

Indexable Dataset

For this example, we’ll construct a simple linear regression dataset that returns an input vector and its corresponding label:

import numpy as np

class LinearRegressionDataset:
    def __init__(self, N, d):
        self.X = np.random.randn(N, d)
        self.Y = np.random.randn(N)

    def __getitem__(self, idx):
        return (self.X[idx].astype('float32'), self.Y[idx])

    def __len__(self):
        return len(self.X)

N, d = (100, 6)
dataset = LinearRegressionDataset(N, d)


The class LinearRegressionDataset implements the interface required to be a torch.utils.data.Dataset so one could use any PyTorch Dataset instead of our toy example here.

The class responsible for converting datasets to FFCV format is the ffcv.writer.DatasetWriter. The writer takes in:

  • A path, where the .beton will be written

  • A dictionary mapping keys to fields (Field).

Each field corresponds to an element of the data tuple returned by our dataset, and specifies how the element should be written to (and later, read from) the FFCV dataset file. In our case, the dataset has two fields, one for the (vector) input and the other for the corresponding (scalar) label. Both of these fields already have default implementations in FFCV, which we use below:

from ffcv.fields import NDArrayField, FloatField

writer = DatasetWriter(write_path, {
    'covariate': NDArrayField(shape=(d,), dtype=np.dtype('float32')),
    'label': FloatField(),

}, num_workers=16)


Starting in Python 3.6, dictionary keys are ordered, and DatasetWriter uses this order to match the given fields to the elements returned by the __getitem__ function of the dataset. Make sure to provide the fields in the right order to avoid errors.

After constructing the writer, the only remaining step is to write the dataset:



For this second example we will assume that you have access to a webdataset version of ImageNet (or similar) dataset, and that all the shards are in a folder called FOLDER.

In order to perform the conversion to a .beton file, we first need to collect the list of shards. This can be simply done with glob:

from glob import glob
from os import path

my_shards = glob(path.join(FOLDER, '*'))

Internally, FFCV will split the shards between the available workers. However, each worker still needs to know how to decode a given shard. This is done by defining a pipeline (very similar to how one would use a webdataset for training):

def pipeline(dataset):
    return dataset.decode('rgb8').to_tuple('jpg:png;jpeg cls')

Since FFCV expects images in the numpy uint8 format, we use the parameter 'rgb8' of webdataset to decode the images. We then convert the dictionary to a tuple that FFCV will be able to process.

We now just have to glue everything together:

from ffcv.fields import RGBImageField, IntField

writer = DatasetWriter(write_path, {
    'image': RGBImageField()
    'label': IntField(),

}, num_workers=40)

writer.from_webdataset(my_shards, pipeline)


Beyond the examples used above, FFCV supports a variety of built-in fields that make it easy to directly convert most datasets. We review them below:

  • RGBImageField: Handles images including (optional) compression and resizing. Pass in a PyTorch Tensor.

  • IntField and FloatField: Handle simple scalar fields. Pass in int or float.

  • BytesField: Stores byte arrays of variable length. Pass in numpy byte array.

  • JSONField: Encodes a JSON document. Pass in dict that can be JSON-encoded.

That’s it! You are now ready to construct loaders for this dataset and start loading the data.