Writing a dataset to FFCV format ================================ Datasets in FFCV are stored in a custom ``.beton`` format that allows for fast reading (see the :ref:`Making an FFCV dataloader ` section). Such files can be generated using the class :class:`ffcv.writer.DatasetWriter` from two potential sources: - **Indexable objects**: They need to implement ``__len__`` and a ``__getitem__`` function returning the data associated to a sample as a tuple/list (of any length). Examples of this kind of dataset include but are not limited to: ``torch.utils.data.Dataset``, ``numpy.ndarray``, or even Python lists. - **Webdataset** (`Github `_): This allows users to integrate large scale and/or remote datasets into FFCV easily. In this tutorial, we will show how to handle datasets from these two categories. Additionally, in the folder ``/examples`` of our `repository `_ we also include a conversion script illustrating the conversion of `CIFAR-10 `_ and `ImageNet `_ from their PyTorch counterparts. The first step is to include the following class into your script: .. code-block:: python from ffcv.writer import DatasetWriter Indexable Dataset ----------------- For this example, we'll construct a simple linear regression dataset that returns an input vector and its corresponding label: .. code-block:: python import numpy as np class LinearRegressionDataset: def __init__(self, N, d): self.X = np.random.randn(N, d) self.Y = np.random.randn(N) def __getitem__(self, idx): return (self.X[idx].astype('float32'), self.Y[idx]) def __len__(self): return len(self.X) N, d = (100, 6) dataset = LinearRegressionDataset(N, d) .. note :: The class ``LinearRegressionDataset`` implements the interface required to be a ``torch.utils.data.Dataset`` so one could use any PyTorch Dataset instead of our toy example here. The class responsible for converting datasets to FFCV format is the :class:`ffcv.writer.DatasetWriter`. The writer takes in: - A path, where the ``.beton`` will be written - A dictionary mapping keys to *fields* (:class:`~ffcv.fields.Field`). Each field corresponds to an element of the data tuple returned by our dataset, and specifies how the element should be written to (and later, read from) the FFCV dataset file. In our case, the dataset has two fields, one for the (vector) input and the other for the corresponding (scalar) label. Both of these fields already have default implementations in FFCV, which we use below: .. code-block:: python from ffcv.fields import NDArrayField, FloatField writer = DatasetWriter(write_path, { 'covariate': NDArrayField(shape=(d,), dtype=np.dtype('float32')), 'label': FloatField(), }, num_workers=16) .. note:: Starting in Python 3.6, dictionary keys are ordered, and :class:`~ffcv.writer.DatasetWriter` uses this order to match the given fields to the elements returned by the ``__getitem__`` function of the dataset. Make sure to provide the fields in the right order to avoid errors. After constructing the writer, the only remaining step is to write the dataset: .. code-block:: python writer.from_indexed_dataset(my_dataset) Webdataset ---------- For this second example we will assume that you have access to a ``webdataset`` version of ImageNet (or similar) dataset, and that all the shards are in a folder called ``FOLDER``. In order to perform the conversion to a ``.beton`` file, we first need to collect the list of shards. This can be simply done with ``glob``: .. code-block:: python from glob import glob from os import path my_shards = glob(path.join(FOLDER, '*')) Internally, FFCV will split the shards between the available workers. However, each worker still needs to know how to decode a given shard. This is done by defining a pipeline (very similar to how one would use a ``webdataset`` for training): .. code-block:: python def pipeline(dataset): return dataset.decode('rgb8').to_tuple('jpg:png;jpeg cls') Since FFCV expects images in the numpy ``uint8`` format, we use the parameter ``'rgb8'`` of ``webdataset`` to decode the images. We then convert the dictionary to a tuple that FFCV will be able to process. We now just have to glue everything together: .. code-block:: python from ffcv.fields import RGBImageField, IntField writer = DatasetWriter(write_path, { 'image': RGBImageField() 'label': IntField(), }, num_workers=40) writer.from_webdataset(my_shards, pipeline) Fields ------ Beyond the examples used above, FFCV supports a variety of built-in fields that make it easy to directly convert most datasets. We review them below: - :class:`~ffcv.fields.RGBImageField`: Handles images including (optional) compression and resizing. Pass in a PyTorch Tensor. - :class:`~ffcv.fields.IntField` and :class:`~ffcv.fields.FloatField`: Handle simple scalar fields. Pass in ``int`` or ``float``. - :class:`~ffcv.fields.BytesField`: Stores byte arrays of variable length. Pass in ``numpy`` byte array. - :class:`~ffcv.fields.JSONField`: Encodes a JSON document. Pass in ``dict`` that can be JSON-encoded. That's it! You are now ready to :ref:`construct loaders` for this dataset and start loading the data.