Training on a single GPU (limitations, bottlenecks)
Mixed-precision training: FP32, BF16, FP16, FP8
Data parallelism and All-Reduce
ZeRO optimization stages
Fully Sharded Data Parallel (FSDP)
- 📄 DeepSpeed ZeRO — memory optimization
- 📄 ZeRO-Offload — CPU offloading
- 📄 PyTorch FSDP — official documentation
- 📄 Mixed Precision Training — fundamentals
- 🤗 HF Playbook — practical playbook [HIGHLY RECOMMENDED]
- 📄 NVIDIA FP8 Training — low precision training