| Topic | Details |
| Topic 1 | - Control Plane Installation and Configuration: Covers deploying the software stack including Base Command Manager, OS, Slurm
- Enroot
- Pyxis, NVIDIA GPU and DOCA drivers, container toolkit, and NGC CLI.
|
| Topic 2 | - System and Server Bring-up: Covers end-to-end physical setup of GPU-based AI infrastructure, including BMC
- OOB
- TPM configuration, firmware upgrades, hardware installation, and power and cooling validation to ensure servers are workload-ready.
|
| Topic 3 | - Troubleshoot and Optimize: Covers identifying and replacing faulty hardware components such as GPUs, network cards, and power supplies, along with performance optimization for AMD
- Intel servers and storage.
|
| Topic 4 | - Cluster Test and Verification: Covers full cluster validation through HPL and NCCL benchmarks, NVLink and fabric bandwidth tests, cable and firmware checks, and burn-in testing using HPL, NCCL, and NeMo.
|
| Topic 5 | - Physical Layer Management: Covers configuring BlueField network platform devices and setting up Multi-Instance GPU (MIG) partitioning for AI and HPC workloads.
|