Making an FFCV dataloader¶
After writing an FFCV dataset, we are
ready to start loading data (and training models)! We’ll continue using the same
regression dataset as the previous guide, and we’ll assume that the dataset has
been written to /path/to/dataset.beton
.
In order to load the dataset that we’ve written, we’ll need the
ffcv.loader.Loader
class (which will do most of the heavy lifting), and
a set of decoders corresponding to the fields present in the dataset (so in
our case, we will use the FloatDecoder
and
NDArrayDecoder
classes):
from ffcv.loader import Loader, OrderOption
from ffcv.fields.decoders import NDArrayDecoder, FloatDecoder
Our first step is instantiating the Loader
class:
loader = Loader('/path/to/dataset.beton',
batch_size=BATCH_SIZE,
num_workers=NUM_WORKERS,
order=ORDERING,
pipelines=PIPELINES)
In order to create a loader, we need to specify a path to the FFCV dataset,
batch size, number of workers, as well as two less standard arguments, order
and pipelines
, which we discuss below:
Dataset ordering¶
The order
option in the loader initialization is similar to PyTorch DataLoader’s shuffle
option, with some additional options. This argument
takes an enum
provided by ffcv.loader.OrderOption
:
from ffcv.loader import OrderOption
# Truly random shuffling (shuffle=True in PyTorch)
ORDERING = OrderOption.RANDOM
# Unshuffled (i.e., served in the order the dataset was written)
ORDERING = OrderOption.SEQUENTIAL
# Memory-efficient but not truly random loading
# Speeds up loading over RANDOM when the whole dataset does not fit in RAM!
ORDERING = OrderOption.QUASI_RANDOM
Note
order
options require different amounts of RAM, thus should be used considering how much RAM available in a case-by-case basis.
RANDOM
requires RAM the most since it will have to cache the entire dataset to sample perfectly at random. If the available RAM is not enough, it will throw an exception.QUASI_RANDOM
requires much less RAM thanRANDOM
, but a bit more thanSEQUENTIAL
, in order to cache a part of samples. It is used when the entire dataset can not fit RAM.SEQUENTIAL
requires least RAM. It only keeps several samples loaded ahead of time used in incoming training iterations.
Pipelines¶
The pipeline
option in Loader
specifies the dataset and
tells the loader what fields to read, how to read them, and what operations to
apply on top. Specifically, a pipeline is a key-value dictionary where the key
matches the one used in writing the dataset, and the value is a
sequence of operations to perform. The operations must start with a
ffcv.fields.decoders.Decoder
object corresponding to that field followed by a
sequence of transforms.
For example, the following pipeline reads the fields and then converts each one
to a PyTorch tensor:
from ffcv.transforms import ToTensor
PIPELINES = {
'covariate': [NDArrayDecoder(), ToTensor()],
'label': [FloatDecoder(), ToTensor()]
}
This is already enough to start loading data, but pipelines are also our opportunity to apply fast pre-processing to the data through a series of transformations—transforms are automatically compiled to machine code at runtime and, for GPU-intensive applications like training neural networks, can reduce any additional training overhead.
Note
In fact, declaring field pipelines is optional: for any field that exists
in the dataset file without a corresponding pipeline specified in the
pipelines
dictionary, the Loader
will default to
the bare-bones pipeline above, i.e., first a decoder
then a conversion to PyTorch tensor. (You can force FFCV to explicitly not
load a field by adding a corresponding None
entry to the pipelines
dictionary.)
If the entire pipelines
argument is
unspecified, this bare-bones pipeline will be applied to all fields.
Transforms¶
There are three easy ways to specify transformations in a pipeline:
A set of standard transformations in the
ffcv.transforms
module. These include standard image data augmentations such asRandomHorizontalFlip
andCutout
.Any subclass of
torch.nn.Module
: FFCV automatically converts them into an operation.Custom transformations: you can implement your own by subclassing
ffcv.transforms.Operation
, as discussed in the Making custom transforms guide.
The following shows an example of a full pipeline for a vector field starts with the field decoder,
NDArrayDecoder
, followed by conversion to torch.Tensor
, and a custom transform implemented as a torch.nn.Module
that adds Gaussian noise to each vector:
class AddGaussianNoise(ch.nn.Module):
def __init__(self, scale=1):
super(AddGaussianNoise, self).__init__()
self.scale = scale
def forward(self, x):
return x + ch.randn_like(x) * self.scale
pipeline: List[Operation] = [
NDArrayDecoder(),
ToTensor(),
AddGaussianNoise(0.1)
]
For an example of a different field, this could be a pipeline for an RGBImageField
:
image_pipeline: List[Operation] = [
SimpleRGBImageDecoder(),
RandomHorizontalFlip(),
torchvision.transforms.ColorJitter(.4,.4,.4),
RandomTranslate(padding=2),
ToTensor(),
ToDevice('cuda:0', non_blocking=True),
ToTorchImage(),
Convert(ch.float16),
torchvision.transforms.Normalize(MEAN, STD), # Normalize using image statistics
])
Putting together¶
Back to our running linear regression dataset example, in summary the final loader can be constructed as follows:
loader = Loader('/path/to/dataset.beton',
batch_size=BATCH_SIZE,
num_workers=NUM_WORKERS,
order=OrderOption.RANDOM,
pipelines={
'covariate': [NDArrayDecoder(), ToTensor(), AddGaussianNoise(0.1)],
'label': [FloatDecoder(), ToTensor()]
})
Other options¶
You can also specify the following additional options when constructing an ffcv.loader.Loader
:
os_cache
: IfTrue
, the OS automatically determines whether the dataset is held in memory or not, depending on available RAM. IfFalse
, FFCV manages the caching, and the amount of RAM needed depends onorder
option.distributed
: For training on multiple GPUsseed
: Specify the random seed for batch orderingindices
: Provide indices to load a subset of the datasetcustom_fields
: For specifying decoders for fields with custom encodersdrop_last
: IfTrue
, drops the last non-full batch from each iterationbatches_ahead
: Set the number of batches prepared in advance. Increasing it absorbs variation in processing time to make sure the training loop does not stall for too long to process batches. Decreasing it reduces RAM usage.recompile
: Recompile every iteration. Useful if you have transforms that change their behavior from epoch to epoch, for instance code that uses the shape as a compile time param. (But if they just change their memory usage, e.g., the resolution changes, it’s not necessary.)
More information¶
For information on available transforms and the Loader
class, see our API Reference.
For examples of constructing loaders and using them, see the tutorials Training CIFAR-10 in 36 seconds on a single A100 and Large-Scale Linear Regression.