202112251111 PyTorch Tricks

Scheduler

Both following schedulers prove to be faster in convergence, with the cost of introduction of few extra hyper-parameters – minimum learning rate, maximum learning rate. ¹

Dataloader

Using multiple data workers to speed up loading the dataset, but be aware of the data duplicates: ¹
- For map-style dataset, data is retrieved with indices generated by sampler, so no duplication is created;
- For iterable-style dataset, each worker should have specific handling according to its init function and parameters;
pin_memory speeds up data transfer from memory to GPU memory. ¹

Automatic Mixed Precision

import torch
# Creates once at the beginning of training
scaler = torch.cuda.amp.GradScaler()

for data, label in data_iter:
   optimizer.zero_grad()
   # Casts operations to mixed precision
   with torch.cuda.amp.autocast():
      loss = model(data)

   # Scales the loss, and calls backward()
   # to create scaled gradients
   scaler.scale(loss).backward()

   # Unscales gradients and calls
   # or skips optimizer.step()
   scaler.step(optimizer)

   # Updates the scale for next iteration
   scaler.update()

[[202109261531 Machine Learning Compilers and Optimizers|Optimizers]]❌

TorchScript

Footnotes

https://efficientdl.com/faster-deep-learning-in-pytorch-a-guide