This repo contains a distributed-training benchmark harness for GPT-2 on NYU Big Purple. It launches multi-node runs through Slurm, writes structured artifacts per run, supports optional Nsight Systems profiling, and includes a checked-in DeepSpeed communication-tuning comparison that improved throughput by about 19.5% under fixed-work conditions.
The point of the project is not just to train GPT-2. It is to make multi-node runs comparable, inspectable, and easier to debug.
- Slurm launch workflow for 1-8 V100 GPT-2 benchmark runs.
- PyTorch/DeepSpeed training entrypoint with fixed-work controls.
- Batch-size invariant checks so runs do not silently compare different workloads.
- Structured
training_metrics.jsonartifacts with git commit, command line, environment, hardware, Slurm metadata, throughput, step-time percentiles, CUDA memory, and completion markers. - Optional NCCL and Nsight Systems capture paths for communication/debug artifacts.
- Parser utilities that turn Nsight text output into compact profiling summaries.
- Tests for metrics schema, scaling-report logic, artifact validation, and profiling parser behavior.
flowchart TD
A[Slurm launcher<br/>scripts/slurm/run_2node_8gpu.sbatch] --> B[srun / torchrun<br/>rank and rendezvous setup]
B --> C[Distributed training workers<br/>src/gpt2.py]
C --> D[Dataset + train/val loop]
C --> E[DeepSpeed + torch.distributed]
E --> F[NCCL / CUDA / 2 nodes x 8 V100]
A --> G[Optional debug / profiling toggles<br/>NSYS, NCCL_LOGS, DIST_DEBUG]
G --> B
C --> H[training_metrics.json]
C --> I[launcher_metadata.json]
C --> J[RUN_COMPLETE.txt]
B --> K[profiles/nsys_*.nsys-rep]
B --> L[nccl_rank_*.log / nccl_topo.xml / ibstat.txt / topo.txt]
K --> M[nsys stats text export]
M --> N[scripts/profiling/parse_nsys_stats.py]
N --> O[profiles/profile_summary.json]
H --> P[scripts/generate_scaling_table.py]
H --> Q[scripts/run_scaling_benchmarks.py]
H --> R[scripts/verify_run_artifacts.py]
The project has three layers:
- Orchestration: Slurm +
srun+torchrunlaunch the multi-node job and enable profiling/debug modes. - Runtime:
src/gpt2.pydrives the train loop while DeepSpeed,torch.distributed, NCCL, and CUDA handle distributed execution. - Observability and analysis: each run emits structured artifacts, and the Python tooling turns raw profiler output into summaries that are easier to compare.
The full benchmark requires NYU Big Purple, Slurm, V100 GPUs, and local scratch storage. A fresh clone can still verify the documentation-relevant tooling:
python -m pytest -q
python scripts/generate_scaling_table.py \
--gpu1 tests/fixtures/repeat/1gpu_run1.json \
--gpu2 tests/fixtures/repeat/2gpu_run1.json \
--gpu4 tests/fixtures/repeat/4gpu_run1.jsonWhat this verifies:
- Metrics JSON schema and Slurm metadata expectations.
- Scaling report calculations and invariant checks.
- Nsight parser behavior on fixture-style inputs.
- That local validation does not require fabricating a cluster rerun.
Recorded run context:
- Cluster: NYU Big Purple (Slurm)
- Nodes: 2 nodes (examples seen in runs:
gn-0013,gn-0014) - GPUs: Tesla V100-SXM2-16GB, 4 per node, 8 GPUs total (
world_size=8) - Python 3.11.14, PyTorch 2.9.1+cu128, DeepSpeed 0.18.3, Transformers 4.57.3, CUDA 12.8
- Model: GPT-2 (n_layer=12, n_head=12, n_embd=768),
seq_len=512 - Precision: fp16 via DeepSpeed, ZeRO stage=1
- Dataset:
train_small/val_small(subset sizes recorded intraining_metrics.json)
Prereqs:
train_small.bin/val_small.binpresent in repo root (see Data below).- Run from the repo root.
python scripts/1_download_data.py
python scripts/preprocess_small.pyUse this for throughput numbers. It keeps profiling off and runs a fixed work window.
RUN_DIR=/gpfs/scratch/$USER/GPT2-Optimization/benchmarks/bigpurple_v100_$(date +%F)/8gpu_2node_accum2_300 \
NSYS=0 NCCL_LOGS=0 TORCHRUN_LOGS=0 DIST_DEBUG=0 \
GRAD_ACCUM_STEPS=2 MICRO_BATCH_SIZE_PER_GPU=2 \
GPT2_EXTRA_ARGS="--profile_mode --max_train_steps 300 --max_val_steps 50" \
sbatch scripts/slurm/run_2node_8gpu.sbatchNotes:
--profile_modethrottles hot-loop logging/tqdm and adds stable, high-level NVTX ranges (train/*,val/*) on top of DeepSpeed's NVTX ranges.--max_train_steps/--max_val_stepsbound the run for quick, repeatable comparisons.- The exact command line is also recorded under
training_metrics.json["command_line"].
Use this for attribution, not for headline throughput.
RUN_DIR=/gpfs/scratch/$USER/GPT2-Optimization/benchmarks/bigpurple_v100_$(date +%F)/8gpu_2node_accum2_bucket200_nsys80 \
NSYS=1 NCCL_LOGS=0 TORCHRUN_LOGS=1 DIST_DEBUG=0 \
GRAD_ACCUM_STEPS=2 MICRO_BATCH_SIZE_PER_GPU=2 \
GPT2_EXTRA_ARGS="--profile_mode --max_train_steps 80 --max_val_steps 0" \
sbatch scripts/slurm/run_2node_8gpu.sbatchEach Slurm run writes a run directory at RUN_DIR containing:
training_metrics.json(rank0): schema v2.0 metrics, including tokens/sec, wall time, batch config, Slurm metadata when available.RUN_COMPLETE.txt(rank0): completion marker includingworld_size,tokens_per_sec, andtotal_wall_time_sec.launcher_metadata.json(rank0): launcher context (host, env summary, Slurm info).- Checkpoints (e.g.
epoch-1): large model artifacts; not meant for git.
Optional NCCL/debug artifacts (enable with NCCL_LOGS=1):
nccl_topo.xml: NCCL topology dump.nccl_rank_<host>_<pid>.log: per-rank NCCL debug logs (these runs show "Using network IB").ibstat.txt,topo.txt: network + GPU topology evidence (e.g.,mlx5_0speed100000).
Optional profiling artifacts (enable with NSYS=1):
profiles/nsys_<jobid>_<host>.nsys-repprofiles/nsys_<jobid>_<host>.sqliteprofiles/nsys_stats_<host>.txt(NVTX/OSRT/CUDA API summaries)profiles/profile_summary.json(parsed top5, generated byscripts/profiling/parse_nsys_stats.py)
Checked-in artifacts used in this README live under:
artifacts/feature4_bigpurple_v100_2026-01-28/
Comparable A/B setup: constant world_size=8, seq_len=512, micro_batch=2, grad_accum=2, max_train_steps=300, max_val_steps=50.
Fixed variables for the A/B table:
| Variable | Value |
|---|---|
| World size | 8 GPUs |
| Hardware | 2 Big Purple nodes, 4 V100-SXM2-16GB GPUs per node |
| Sequence length | 512 |
| Micro-batch size per GPU | 2 |
| Gradient accumulation | 2 |
| Global batch size | 32 |
| Training steps | 300 |
| Validation steps | 50 |
| Precision | fp16 with DeepSpeed ZeRO-1 |
| Profiling | off for throughput table (NSYS=0) |
src/deepspeed_config.json: setzero_optimization.reduce_bucket_size=200000000andzero_optimization.allgather_bucket_size=200000000(about 200MB).src/deepspeed_config.json: disabled activation checkpoint partitioning (activation_checkpointing.partition_activations=false) for this workload.
Minimal config snippet:
{
"zero_optimization": {
"stage": 1,
"reduce_bucket_size": 200000000,
"allgather_bucket_size": 200000000
},
"activation_checkpointing": {
"partition_activations": false
}
}Run A (baseline, no profiler):
RUN_DIR=/gpfs/scratch/$USER/GPT2-Optimization/benchmarks/bigpurple_v100_$(date +%F)/8gpu_2node_accum2_300 \
NSYS=0 NCCL_LOGS=0 TORCHRUN_LOGS=0 DIST_DEBUG=0 \
GRAD_ACCUM_STEPS=2 MICRO_BATCH_SIZE_PER_GPU=2 \
GPT2_EXTRA_ARGS="--profile_mode --max_train_steps 300 --max_val_steps 50" \
sbatch scripts/slurm/run_2node_8gpu.sbatchRun B (tuned "bucket200", no profiler):
RUN_DIR=/gpfs/scratch/$USER/GPT2-Optimization/benchmarks/bigpurple_v100_$(date +%F)/8gpu_2node_accum2_bucket200_300 \
NSYS=0 NCCL_LOGS=0 TORCHRUN_LOGS=0 DIST_DEBUG=0 \
GRAD_ACCUM_STEPS=2 MICRO_BATCH_SIZE_PER_GPU=2 \
GPT2_EXTRA_ARGS="--profile_mode --max_train_steps 300 --max_val_steps 50" \
sbatch scripts/slurm/run_2node_8gpu.sbatchCompare:
RUN_DIR/training_metrics.jsontoepochs[0].tokens_per_sec_global,epochs[0].step_time_p95_secRUN_DIR/training_metrics.jsontosummary.total_wall_time_sec
The numbers below are backed by checked-in files under:
artifacts/feature4_bigpurple_v100_2026-01-28/
Benchmark runs (NSYS=0, 300-step harness):
- Baseline metrics:
artifacts/feature4_bigpurple_v100_2026-01-28/accum2_300/training_metrics.json - Tuned metrics:
artifacts/feature4_bigpurple_v100_2026-01-28/bucket200_300/training_metrics.json
Profiling runs (NSYS=1, used for attribution only):
- Baseline
nsys stats:artifacts/feature4_bigpurple_v100_2026-01-28/baseline_2026-01-26/nsys_stats_gn-0011.txt - Tuned
nsys stats:artifacts/feature4_bigpurple_v100_2026-01-28/bucket200_nsys80/nsys_stats_gn-0013.txt - Parsed top-5 summary:
artifacts/feature4_bigpurple_v100_2026-01-28/bucket200_nsys80/profile_summary.json
| Run | run_dir (curated) | Tokens/sec (global) | total_wall_time_sec | step_time_p95_sec | Notes |
|---|---|---|---|---|---|
| Baseline (accum2) | artifacts/feature4_bigpurple_v100_2026-01-28/accum2_300 |
29,971.23 | 82.96 | 0.07741 | NSYS=0, global_batch=32 |
| Tuned (bucket200) | artifacts/feature4_bigpurple_v100_2026-01-28/bucket200_300 |
35,806.75 | 71.20 | 0.06317 | NSYS=0, ZeRO-1 bucket sizing |
Throughput improvement:
(35,806.75 / 29,971.23 - 1) is about +19.5%
Rigor note:
- The A/B runs above were executed on different commits (
d8ca451vsba03420). The intended behavioral change for Feature 4 is the DeepSpeed bucket sizing + activation-checkpoint toggle described above; rerunning Run A on the latest commit is recommended for a single-commit apples-to-apples comparison.
Profiling-overhead example:
artifacts/feature4_bigpurple_v100_2026-01-28/bucket200_nsys80reportstokens_per_sec_globalabout24,090.97withNSYS=1.
The main profiler takeaway is that training time is dominated by backward and gradient synchronization rather than forward compute.
-
Baseline NVTX (Nsight Systems
nvtx_sum)::DeepSpeedEngine.backward53.9%:DeepSpeedEngine.allreduce_gradients40.8%NCCL:ncclAllReduceappears with 42,748 instances- Source:
artifacts/feature4_bigpurple_v100_2026-01-28/baseline_2026-01-26/nsys_stats_gn-0011.txt
-
Tuned NVTX (bucket200,
NSYS=1, 80-step profiling run)::DeepSpeedEngine.allreduce_gradients16.8%NCCL:ncclAllReduce520 instances- Source:
artifacts/feature4_bigpurple_v100_2026-01-28/bucket200_nsys80/nsys_stats_gn-0013.txt
The baseline and tuned profiles above come from different capture windows, so they are useful as attribution evidence, not as direct benchmark comparisons.
OS Runtime Summary shows large time in poll / pthread_cond_timedwait / sem_wait / sem_timedwait, consistent with distributed waiting and synchronization.
- Run a short profiling job (
NSYS=1) and wait for completion. Artifacts land underRUN_DIR/profiles/. - Parse the stats into a compact summary:
python scripts/profiling/parse_nsys_stats.py --run_dir "$RUN_DIR"
cat "$RUN_DIR/profiles/profile_summary.json"- View raw tables:
sed -n '/NVTX Range Summary/,/OS Runtime Summary/p' "$RUN_DIR"/profiles/nsys_stats_*.txt
sed -n '/OS Runtime Summary/,/CUDA API Summary/p' "$RUN_DIR"/profiles/nsys_stats_*.txt- Open
profiles/nsys_<jobid>_<host>.nsys-repin the Nsight Systems GUI to inspect the full timeline.
src/src/gpt2.py: training entrypoint (baseline + DeepSpeed), metrics output, optional profiling-friendly mode.src/deepspeed_config.json: DeepSpeed defaults (fp16, ZeRO-1, bucket sizing).
scripts/scripts/slurm/run_2node_8gpu.sbatch: 2-node launcher with optional NSYS/NCCL logs.scripts/profiling/parse_nsys_stats.py: parsesnsys_stats_*.txttoprofiles/profile_summary.json.scripts/1_download_data.py,scripts/preprocess_small.py: data pipeline fortrain_small.bin/val_small.bin.
benchmarks/: example benchmark outputs (full runs).artifacts/: curated, small artifacts used to document Feature 4.
- Scope: this is an academic/HPC benchmark harness, not a production-scale training platform.
- Profiling overhead:
NSYS=1reduces throughput; use it for attribution only. - Comparable runs: for throughput claims, keep
world_size,seq_len,micro_batch_size_per_gpu,grad_accum_steps, and step limits identical. - Slurm specifics: always set
RUN_DIRto a writable scratch path. - NCCL logs are expensive:
NCCL_LOGS=1produces large per-rank logs and can slow runs.