Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the unit of time in evaluation results of the analytical model #93

Open
huanyuqu opened this issue Feb 16, 2025 · 1 comment

Comments

@huanyuqu
Copy link

Wordload

HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 4 ep: 1 pp: 1 vpp: 4 ga: 1 all_gpus: 4 checkpoints: 0 checkpoint_initiates: 0 pp_comm: 0 21 grad_gather -1 1 NONE 0 1 NONE 0 1 ALLGATHER 4919918592 100 grad_param_comm -1 1 NONE 0 1 NONE 0 1 REDUCESCATTER 9839837184 100 grad_param_compute -1 1 NONE 0 34021000 NONE 0 1 NONE 0 100 layernorm -1 1 NONE 0 1 ALLREDUCE 4919918592 1 NONE 0 100 embedding_grads -1 1 NONE 0 1 ALLREDUCE 402653184 1 NONE 0 100 embedding_layer -1 799000 ALLREDUCE 402653184 1 NONE 0 17374000 NONE 0 100 attention_layer -1 1820000 ALLREDUCE 402653184 1820000 NONE 0 1820000 NONE 0 100 mlp_layer -1 2478000 ALLREDUCE 402653184 2478000 NONE 0 2478000 NONE 0 100 attention_layer -1 1820000 ALLREDUCE 402653184 1820000 NONE 0 1820000 NONE 0 100 mlp_layer -1 2478000 ALLREDUCE 402653184 2478000 NONE 0 2478000 NONE 0 100 attention_layer -1 1820000 ALLREDUCE 402653184 1820000 NONE 0 1820000 NONE 0 100 mlp_layer -1 2478000 ALLREDUCE 402653184 2478000 NONE 0 2478000 NONE 0 100 attention_layer -1 1820000 ALLREDUCE 402653184 1820000 NONE 0 1820000 NONE 0 100 mlp_layer -1 2478000 ALLREDUCE 402653184 2478000 NONE 0 2478000 NONE 0 100 cross_entropy1 -1 0 ALLREDUCE 65536 0 NONE 0 0 NONE 0 100 cross_entropy2 -1 0 ALLREDUCE 65536 0 NONE 0 0 NONE 0 100 cross_entropy3 -1 0 ALLREDUCE 65536 0 NONE 0 0 NONE 0 100 optimizer1 -1 0 ALLREDUCE 4 0 NONE 0 0 NONE 0 100 optimizer2 -1 0 ALLREDUCE 4 0 NONE 0 0 NONE 0 100 optimizer3 -1 0 ALLREDUCE 4 0 NONE 0 0 NONE 0 100 optimizer4 -1 0 ALLREDUCE 4 0 NONE 0 0 NONE 0 100

Evaluation

layer_name,Analytical_test,fwd compute,wg compute,ig compute,fwd exposed comm,wg exposed comm,ig exposed comm,fwd total comm,algbw,busbw,wg total comm,algbw,busbw,ig total comm,algbw,busbw grad_gather,Analytical_test,0,0,0,0,0,0,0,-nan,0.00,0,inf,-nan,0,-nan,0.00 grad_param_comm,Analytical_test,0,0,0,0,0,0,0,-nan,0.00,0,inf,-nan,0,-nan,0.00 grad_param_compute,Analytical_test,0,0,34021,0,0,0,0,-nan,0.00,0,-nan,0.00,0,-nan,0.00 layernorm,Analytical_test,0,0,0,0,0,34365,0,-nan,0.00,0,-nan,0.00,34365,133.33,200.00 embedding_grads,Analytical_test,0,0,0,0,0,2812,0,-nan,0.00,0,-nan,0.00,2812,133.33,200.00 embedding_layer,Analytical_test,799,17374,0,2812,0,0,2812,133.33,200.00,0,-nan,0.00,0,-nan,0.00 attention_layer,Analytical_test,1820,1820,1820,2812,0,0,2812,133.33,200.00,0,-nan,0.00,0,-nan,0.00 mlp_layer,Analytical_test,2478,2478,2478,2812,0,0,2812,133.33,200.00,0,-nan,0.00,0,-nan,0.00 attention_layer,Analytical_test,1820,1820,1820,2812,0,0,2812,133.33,200.00,0,-nan,0.00,0,-nan,0.00 mlp_layer,Analytical_test,2478,2478,2478,2812,0,0,2812,133.33,200.00,0,-nan,0.00,0,-nan,0.00 attention_layer,Analytical_test,1820,1820,1820,2812,0,0,2812,133.33,200.00,0,-nan,0.00,0,-nan,0.00 mlp_layer,Analytical_test,2478,2478,2478,2812,0,0,2812,133.33,200.00,0,-nan,0.00,0,-nan,0.00 attention_layer,Analytical_test,1820,1820,1820,2812,0,0,2812,133.33,200.00,0,-nan,0.00,0,-nan,0.00 mlp_layer,Analytical_test,2478,2478,2478,2812,0,0,2812,133.33,200.00,0,-nan,0.00,0,-nan,0.00 cross_entropy1,Analytical_test,0,0,0,0,0,0,0,133.56,200.33,0,-nan,0.00,0,-nan,0.00 cross_entropy2,Analytical_test,0,0,0,0,0,0,0,133.56,200.33,0,-nan,0.00,0,-nan,0.00 cross_entropy3,Analytical_test,0,0,0,0,0,0,0,133.56,200.33,0,-nan,0.00,0,-nan,0.00 optimizer1,Analytical_test,0,0,0,0,0,0,0,inf,inf,0,-nan,0.00,0,-nan,0.00 optimizer2,Analytical_test,0,0,0,0,0,0,0,inf,inf,0,-nan,0.00,0,-nan,0.00 optimizer3,Analytical_test,0,0,0,0,0,0,0,inf,inf,0,-nan,0.00,0,-nan,0.00 optimizer4,Analytical_test,0,0,0,0,0,0,0,inf,inf,0,-nan,0.00,0,-nan,0.00

I have a question about the unit of the evaluation results from SimAI's analytical model. I'm trying to evaluate a 4-layer GPT3-175B model, and I'm seeing some discrepancies between the reported communication times and the theoretical calculations.
Here are the parameters of my model:
Model: GPT3-175B (4 layers)
Batch Size: 8
Sequence Length: 2048
Hidden Dimension: 12288
Tensor Parallelism (TP): 4
Based on these parameters, the total communication volume for the all-reduce operation in the MLP layer should be: 8 * 2048 * 12288 * 2 = 402653184 B
My theoretical calculation using the NCCL model, with an assumed bandwidth (B) of 200 GB/s, is as follows:
t = 2 * (n - 1) * S / (n * B) where n is number of GPUs, and S is the size of data.
t = 2 * (4-1) * 402653184 / (4 * 200 * 10^9)
t ≈ 0.003 seconds = 3000 us
While the reported value is 2816. It seems that the time unit is ns.
Could you please clarify the unit of time used in the evaluation results?

@Yan824
Copy link
Collaborator

Yan824 commented Feb 26, 2025

Hi, the unit of time used in SimAI_Analytical is us. Please see details in the function compute_time(); the return value will be divided by FREQ before being written into csv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants