Scheduler
Both following schedulers prove to be faster in convergence, with the cost of introduction of few extra hyper-parameters – minimum learning rate, maximum learning rate. 1
- CyclicLR — PyTorch 1.10.1 documentation based on @smith_2017
- OneCycleLR — PyTorch 1.10.1 documentation
Dataloader
-
Using multiple data workers to speed up loading the dataset, but be aware of the data duplicates:
1
- For map-style dataset, data is retrieved with indices generated by sampler, so no duplication is created;
- For iterable-style dataset, each worker should have specific handling according to its init function and parameters;
-
pin_memoryspeeds up data transfer from memory to GPU memory. 1
Automatic Mixed Precision
import torch
# Creates once at the beginning of training
scaler = torch.cuda.amp.GradScaler()
for data, label in data_iter:
optimizer.zero_grad()
# Casts operations to mixed precision
with torch.cuda.amp.autocast():
loss = model(data)
# Scales the loss, and calls backward()
# to create scaled gradients
scaler.scale(loss).backward()
# Unscales gradients and calls
# or skips optimizer.step()
scaler.step(optimizer)
# Updates the scale for next iteration
scaler.update()