Tuning Guide

FFCV is a generic library and achieving the very best performance may require tuning some options to fit the particular use case and computing resources.

In order to help users with that task, we consider a couple of common use cases and provide recommendations for setting parameters.

Scenario: Small dataset

If the dataset you are working on is small or if you are lucky enough to have a machine with large amounts of RAM, we recommend the following settings for ffcv.loader.Loader:

  • Use os_cache=True. The first epoch will be a little bit slower as the operating system is not able to pre-fetch the data as well but once the dataset has been completely cached in RAM then it will read directly from there with no overhead.

  • Set order to OrderOption.RANDOM or OrderOption.QUASI_RANDOM. They should both perform very similarly (QUASI_RANDOM might be marginally better).

Scenario: Large scale datasets

If your dataset is too large to be cached on the machine we recommend:

  • Use os_cache=False. Since the data can’t be cached, FFCV will have to read it over and over. Having FFCV take over the operating system for caching is beneficial as it knows in advance the which samples will be needed in the future and can load them ahead of time.

  • For order, we recommend using the QUASI_RANDOM traversal order if you need randomness but perfect uniform sampling isn’t mission critical. This will optimize the order to minimize the reads on the underlying storage while maintaining very good randomness properties. If you have experience with the shuffle() function of webdataset and the quality of the randomness wasn’t sufficient, we still suggest you give QUASI_RANDOM a try as it should be significantly better. Using RANDOM is unfeasible in this situation because it needs to load the entire dataset in RAM, causing an out-of-memory exception.

Scenario: Multi-GPU training (1 model, multiple GPUs)

FFCV’s Loader class offers a flag distributed that will make the loader behave similarly to the PyTorch’s DistributedSampler used with their DataLoader. If that’s what your code is using, switching to FFCV should just be a matter of replacing the data loader.

FFCV should also work fine with PyTorch’s DataParallel wrapper but we agree with the developers and recommend you use DistributedDataParallel with FFCV’s distributed flag enabled.

The same recommendations above related to dataset size still apply here, but we emphasize that os_cache=True is particularly beneficial in this scenario. Indeed, as multiple processes will access the same dataset, having the caching at the OS level allows for data sharing between them, reducing overall memory consumption.


QUASI_RANDOM isn’t currently supported with distributed=True. While this is technically possible to implement, we haven’t yet invested the necessary time yet. It is on the medium-term roadmap, and we also welcome pull requests!

We encourage users to try different values for the num_workers parameters. As FFCV is usually very CPU resource efficient it is sometimes beneficial to use fewer workers to avoid scheduling and cache inefficiencies.

Scenario: Grid search (1 model per GPU)

This use case is similar to the previous. One should still have one process per GPU and if training all models on the same dataset, os_cache=True is preferred to allow cache sharing between the jobs. Note that if the dataset is bigger than the amount of main memory, os_cache=False might still perform better and we encourage users to try both.

Scenario: Extreme grid search (2+ models per GPU)

Unlike other solutions, FFCV is thread based and not process based. As a result, users are able to train multiple models on a single GPU. This is particularly useful for small models that can’t leverage the compute power of powerful GPUs. To do so, users have to do the following:

  • Run a single process per GPU

  • The main thread of that process should start one thread for each model which will be trained concurrently

  • Each thread creates its own FFCV Loader and model and trains normally

  • As for regular grid search, os_cache=True is mostly the best choice here, but it doesn’t hurt to try disabling it for very large scale datasets


It is a common mistake to assume that running multiple processes on the same GPU will improve speed. For security reasons and unless Nvidia MPS service is enabled, a GPU can only be used by a single process at a time. If you run more processes, GPU time will be shared between them but they will never run concurrently.


We have experienced some CUDNN bugs while running multiple models on the same GPU. It seems to originate from scheduling concurrently multiple BatchNorm layers. If you encounter that issue a simple fix is to put a lock around the forward pass of your models. This will make sure that no two forward passes are scheduled concurrently. This shouldn’t impact performance too much as CUDA calls are asynchronous anyway.