GitHub - osayamenja/DataCruncher: Scripts for data analysis and plots

Data Cruncher

Nsight Traces

We open source our traces in Azure Blob storage. You can generate all datasets available in data by downloading the linked files and running their corresponding commands.

Note

The time filtering in the commands are to restrict the data to complete iterations. You can verify this claim by viewing the traces in the GUI.

Requirements

CUDA Toolkit
Linux
Python
Nsight Systems CLI and GUI

Single-Node 1x8 350M

Download Trace from here
View in the Nsight GUI or 👇

Run below to generate single_1x8_350M_trace.txt

nsys stats --filter-time="4s420ms/35s390ms" -r cuda_gpu_trace --timeunit usec --format column --output @"grep -E (Start*|ncclKernel_SendRecv_RING*)" single_1x8_350M.nsys-rep

Run below to generate single_1x8_filtered_sum.txt

nsys stats --filter-time="4s420ms/35s390ms" -r cuda_gpu_sum --timeunit usec --format column single_1x8_350M.nsys-rep

Multi-Node 8x4

Download 1.3B Trace from here
Download 350M Trace from here
View in the Nsight GUI or 👇

Run below to generate multi_8x4_1.3B_trace.txt

nsys stats --filter-time="3s720ms/12s450ms" -r cuda_gpu_trace --timeunit usec --format column --output @"grep -E (Start*|ncclKernel_SendRecv_RING*)" multi_8x4_1.3B.nsys-rep

Run below to generate multi_8x4_1.3B_sum.txt

nsys stats --filter-time="3s720ms/12s450ms" -r cuda_gpu_sum --timeunit usec --format column multi_8x4_1.3B.nsys-rep

Run below to generate multi_8x4_350M_trace.txt

nsys stats --filter-time="0s510ms/13s870ms" -r cuda_gpu_trace --timeunit usec --format column --output @"grep -E (Start*|ncclKernel_SendRecv_RING*)" multi_8x4_350M.nsys-rep

Run below to generate multi_8x4_350M_sum.txt

nsys stats --filter-time="0s510ms/13s870ms" -r cuda_gpu_sum --timeunit usec --format column multi_8x4_350M.nsys-rep

Single Node Profiling

We profiled using the below command. You can change delay or duration.

nsys profile -s none --delay 200 --duration 40 --cpuctxsw none -t cuda,nvtx,cudnn,cublas,cusparse --cuda-graph-trace=node <training_script_name>

Perlmutter Profiling

For multi-node training, we used the below script and executed as: srun /bin/bash <the below script>

#!/bin/bash
SCRIBE=1 # not node 0.
if [ "${SLURM_PROCID}" -eq "${SCRIBE}" ]; then
        echo "Node ${SLURM_PROCID} will profile!"
        nsys profile --kill none -s none --delay 120 --duration 15 --cpuctxsw none -t cuda,nvtx,cudnn,cublas,cusparse --cuda-graph-trace=node -o report_${SLURM_PROCID} <training_script_name>
else
        echo "Node ${SLURM_PROCID} will NOT profile!"
        <training_script_name>
fi

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
figures		figures
.gitignore		.gitignore
README.MD		README.MD
multi_node_ep_times_ari.pdf		multi_node_ep_times_ari.pdf
nvlink_transfer.pdf		nvlink_transfer.pdf
remote_transfer.pdf		remote_transfer.pdf
research_figures.ipynb		research_figures.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Cruncher

Nsight Traces

Note

Requirements

Single-Node 1x8 350M

Multi-Node 8x4

Single Node Profiling

Perlmutter Profiling

About

Releases

Packages

Languages

osayamenja/DataCruncher

Folders and files

Latest commit

History

Repository files navigation

Data Cruncher

Nsight Traces

Note

Requirements

Single-Node 1x8 350M

Multi-Node 8x4

Single Node Profiling

Perlmutter Profiling

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages