The Bottleneck Doctor


To summarize the scenarios from the Tuning Guide, we provide a map from a type of system bottleneck to the FFCV options that will help get the most performance out of your system:

Disk-read bottlenecks

What if your GPUs sit idle from low disk or throughput? Maybe you’re reading from a networked drive, maybe you have too many GPUs; either way, try:

  • If your dataset fits in memory, use OS-level page caching (enabled by default in FFCV) to ensure that concurrent training executions properly exploit caching.

  • If your dataset does not fit in memory, use process-level page caching, (enabled by setting os_cache=False when constructing the ffcv.loader.Loader) to avoid caching the entire dataset at once.

  • Especially when using process-level caching, consider using the quasi-random data sampler, enabled using the order=OrderOption.QUASI_RANDOM argument to the Loader constructor. Quasi-random sampling tries to imitate random sampling while minimizing the underlying number of disk reads. (Again, note that QUASI_RANDOM is not yet supported for distributed training.)

  • Another option for computer vision datasets is storing resized images: many datasets have gigantic images that end up being resized and cropped anyways in the data augmentation pipeline. You can avoid paying the cost of loading these giant images by writing them to an appropriate side length in the first place with ffcv.writer.DatasetWriter (see the Working with Image Data in FFCV guide)

  • Similarly, you can store images in JPEG format to save both disk space and reading time, and lower serialized JPEG quality to decrease storage sizes.

CPU bottlenecks

All CPUs at 100% and you’re still not hitting maximal GPU usage? Consider the following:

  • Use pre-made, JIT-compiled augmentations from ffcv.transforms: these augmentations use pre-allocated pinned memory, and are fused together and compiled to machine code at runtime, making them a much faster alternative to standard data augmentation functions.

  • Make your own JIT-compiled augmentations: If you don’t see your desired augmentation among the pre-implemented ones, implementing your own efficient augmentation is simple and only requires implementing a single Python function. See any of the existing augmentations for an example, or read the Customization guide (coming soon!) for a tutorial.

  • Store (some) raw pixel data: FFCV allows you to smoothly trade off I/O workload and compute workload (raw pixels require no JPEG decoding) by randomly storing a specified fraction of the dataset as raw pixel data instead of JPEG.

GPU bottlenecks

Even if you’re not bottlenecked by data loading, FFCV can still help you accelerate your system:

  • Asynchronous CPU-GPU data transfer: we always asynchronously transfer data, and also include tools for ensuring unblocked GPU execution.

  • Train multiple models on the same GPU: Fully asynchronous thread-based data loading means that unlike for other data loading systems, different training processes using FFCV running on the same GPU won’t block each other.

  • Offload compute to the CPU: because FFCV offer extremely fast JIT-compiled data transformations, it’s often helpful to move parts of the data pipeline (e.g., input normalization or image augmentation) to CPU; FFCV will handle compilation and parallelization of these functions so that the CPU-induced slowdown isn’t too much, and the freed-up GPU time can be used for more GPU-intensive tasks (e.g., matrix multiplication).


This list is limited to what FFCV offers in data loading; check out guides like the PyTorch performance guide for more model-based ways to speed up training.