Large-Scale Linear Regression¶

In this example, we’ll see how to run large-scale regularized linear regression with FFCV (by “large-scale” here we mean a dataset that definitely doesn’t fit in GPU memory, and may barely even fit in RAM).

See here for the script corresponding to this tutorial.

Setup: Generating a fake dataset¶

Let’s start by generating a fake dataset on which we’ll run linear regression. We’ll generate the independent variables (also known as the covariates or inputs) as random uniform vectors, and the dependent variable (also known as the responses or outputs) as the noised product of the dependent variable and a fixed weight vector:

import numpy as np

# 1,000,000 inputs each of dimension 10,000 = 40GB of data
N, D = 1000000, 10000
X = np.random.rand(N, D).astype('float32')
# Ground-truth vector
W, b = np.random.rand(D).astype('float32'), np.random.rand()
# Response variables
Y = X @ W + np.random.randn(N)
# Save the dataset:
pkl.dump((X, W, b, Y), open('/tmp/linreg_data.pkl', 'wb'))

Basic code template¶

Our goal is to, given X and Y, recover the true parameter W. We will accomplish this via SGD on the squared-loss:

import torch as ch
from tqdm import tqdm
import time

train_loader = None # TODO!

# Calculate data mean and variance for normalization
def calculate_stats(loader, N):
    mean, stdev = 0., 0.
    for x_batch, _ in tqdm(loader):
        mean += x_batch.sum(0) / N
        stdev += x_batch.pow(2).sum(0) / N
    return mean, ch.sqrt(stdev - mean.pow(2))

start_time = time.time()
mean, stdev = calculate_stats(train_loader, N)
mean, stdev = mean.cuda(), stdev.cuda()
w_est, b_est = ch.zeros(D).cuda(), ch.zeros(1).cuda() # Initial guess for W
num_epochs = 10 # Number of full passes over the data to do

lr = 5e-2
for _ in range(num_epochs):
    total_loss, num_examples = 0., 0.
    start_time = time.time()
    for (x_batch, y_batch) in tqdm(train_loader):
        x_batch = x_batch.cuda()
        y_batch = y_batch.cuda()
        # Normalize the data for stability
        x_batch = (x_batch - mean) / stdev
        residual = x_batch @ w_est + b_est - y_batch
        # Gradients
        w_grad = x_batch.T @ residual / x_batch.shape[0]
        b_grad = ch.mean(residual, dim=0)
        # import ipdb; ipdb.set_trace()
        w_est = w_est - lr * w_grad
        b_est = b_est - lr * b_grad
        total_loss += residual.pow(2).sum()
        num_examples += x_batch.shape[0]
    print('Epoch time:', time.time() - start_time)
    print(f'Average loss: {total_loss / num_examples:.3f} | ',
        f'Norm diff', ch.norm(w_est / stdev - ch.tensor(W).cuda()).cpu().item())

print(f'Total script running time: {time.time() - start_time:.2f}s')

Note

Note that in general, using vanilla gradient descent to solve regularized linear regression is typically a very bad idea, and users are better served implementing an algorithm based on conjugate gradients or variance-reduced gradient methods. That said, the exact same principles here apply to any algorithm, so we use gradient descent to keep the code as clean as possible.

Naive approach: PyTorch TensorDataset¶

The only thing that remains unspecified in our implementation above is the train_loader. The standard way of making a loader here would be to use PyTorch’s built-in TensorDataset class, as follows:

from torch.utils.data import TensorDataset, DataLoader

X, W, b, Y = pkl.load(open('/tmp/linreg_data.pkl', 'rb'))
dataset = TensorDataset(ch.tensor(X), ch.tensor(Y))
train_loader = DataLoader(dataset, num_workers=8, shuffle=True)
# ... rest of code as above

The resulting code is runnable and correct. It will use 40GB of memory, since the entire tensor X will be kept in RAM. Running our script in an environment with a single A100 GPU and 8 CPU cores takes 16 seconds per epoch.

Speeding things up with FFCV¶

We’ll now try to improve on these results by replacing the standard PyTorch data loading pipeline with FFCV. The first step is to rewrite X and Y as a FFCV dataset (as detailed in the Writing a dataset to FFCV format guide):

from ffcv.fields import NDArrayField, FloatField

class LinearRegressionDataset:
    def __getitem__(self, idx):
        return (X[idx], np.array(Y[idx]).astype('float32'))

    def __len__(self):
        return len(X)

writer = DatasetWriter('/tmp/linreg_data.beton', {
    'covariate': NDArrayField(shape=(D,), dtype=np.dtype('float32')),
    'label': NDArrayField(shape=(1,), dtype=np.dtype('float32')),
}, num_workers=16)

writer.from_indexed_dataset(LinearRegressionDataset())

This allows us to replace the TensorDataset from the previous section with an FFCV data loader:

from ffcv.loader import Loader, OrderOption
from ffcv.fields.decoders import NDArrayDecoder
from ffcv.transforms import ToTensor, Squeeze, ToDevice

train_loader = Loader('/tmp/linreg_data.beton', batch_size=2048,
            num_workers=8, order=OrderOption.RANDOM,
            pipelines={
                'covariate': [NDArrayDecoder(), ToTensor(), ToDevice(ch.device('cuda:0'))],
                'label': [NDArrayDecoder(), ToTensor(), Squeeze(), ToDevice(ch.device('cuda:0'))]
            })

With just this simple substitution, our code goes from 16 seconds per epoch on an A100 GPU to 6 seconds.

As expected, GPU utilization also increases dramatically since data loading is no longer a bottleneck—this allows us to make optimizations elsewhere and make the code even faster!

More speed, less memory¶

We conclude this guide by suggesting a few ways to make our linear regression program even faster, and to reduce its memory footprint:

In our example above, FFCV caches the entire dataset in-memory: which means that, in the event of insufficient RAM, the program will not error our (unlike the TensorDataset example, which will raise a Segmentation Fault), but it will become significantly slower. An alternative discussed in the Tuning Guide that we didn’t explore here is to initialize the loader with os_cache=False and order=OrderOption.QUASI_RANDOM—this will disable caching of the full dataset (and thus can operate with very little memory!), and will read examples in an order which is nearly random but still minimizes underlying disk reads.
We can also optimize the main loop itself: for example, the gradient updates should be performed as in-place operations, as should the normalization. Since data loading is no longer the main bottleneck, such optimizations will result in improved performance.