You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a question about the unit of the evaluation results from SimAI's analytical model. I'm trying to evaluate a 4-layer GPT3-175B model, and I'm seeing some discrepancies between the reported communication times and the theoretical calculations.
Here are the parameters of my model:
Model: GPT3-175B (4 layers)
Batch Size: 8
Sequence Length: 2048
Hidden Dimension: 12288
Tensor Parallelism (TP): 4
Based on these parameters, the total communication volume for the all-reduce operation in the MLP layer should be: 8 * 2048 * 12288 * 2 = 402653184 B
My theoretical calculation using the NCCL model, with an assumed bandwidth (B) of 200 GB/s, is as follows:
t = 2 * (n - 1) * S / (n * B) where n is number of GPUs, and S is the size of data.
t = 2 * (4-1) * 402653184 / (4 * 200 * 10^9)
t ≈ 0.003 seconds = 3000 us
While the reported value is 2816. It seems that the time unit is ns.
Could you please clarify the unit of time used in the evaluation results?
The text was updated successfully, but these errors were encountered:
Hi, the unit of time used in SimAI_Analytical is us. Please see details in the function compute_time(); the return value will be divided by FREQ before being written into csv.
Wordload
HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 4 ep: 1 pp: 1 vpp: 4 ga: 1 all_gpus: 4 checkpoints: 0 checkpoint_initiates: 0 pp_comm: 0 21 grad_gather -1 1 NONE 0 1 NONE 0 1 ALLGATHER 4919918592 100 grad_param_comm -1 1 NONE 0 1 NONE 0 1 REDUCESCATTER 9839837184 100 grad_param_compute -1 1 NONE 0 34021000 NONE 0 1 NONE 0 100 layernorm -1 1 NONE 0 1 ALLREDUCE 4919918592 1 NONE 0 100 embedding_grads -1 1 NONE 0 1 ALLREDUCE 402653184 1 NONE 0 100 embedding_layer -1 799000 ALLREDUCE 402653184 1 NONE 0 17374000 NONE 0 100 attention_layer -1 1820000 ALLREDUCE 402653184 1820000 NONE 0 1820000 NONE 0 100 mlp_layer -1 2478000 ALLREDUCE 402653184 2478000 NONE 0 2478000 NONE 0 100 attention_layer -1 1820000 ALLREDUCE 402653184 1820000 NONE 0 1820000 NONE 0 100 mlp_layer -1 2478000 ALLREDUCE 402653184 2478000 NONE 0 2478000 NONE 0 100 attention_layer -1 1820000 ALLREDUCE 402653184 1820000 NONE 0 1820000 NONE 0 100 mlp_layer -1 2478000 ALLREDUCE 402653184 2478000 NONE 0 2478000 NONE 0 100 attention_layer -1 1820000 ALLREDUCE 402653184 1820000 NONE 0 1820000 NONE 0 100 mlp_layer -1 2478000 ALLREDUCE 402653184 2478000 NONE 0 2478000 NONE 0 100 cross_entropy1 -1 0 ALLREDUCE 65536 0 NONE 0 0 NONE 0 100 cross_entropy2 -1 0 ALLREDUCE 65536 0 NONE 0 0 NONE 0 100 cross_entropy3 -1 0 ALLREDUCE 65536 0 NONE 0 0 NONE 0 100 optimizer1 -1 0 ALLREDUCE 4 0 NONE 0 0 NONE 0 100 optimizer2 -1 0 ALLREDUCE 4 0 NONE 0 0 NONE 0 100 optimizer3 -1 0 ALLREDUCE 4 0 NONE 0 0 NONE 0 100 optimizer4 -1 0 ALLREDUCE 4 0 NONE 0 0 NONE 0 100
Evaluation
layer_name,Analytical_test,fwd compute,wg compute,ig compute,fwd exposed comm,wg exposed comm,ig exposed comm,fwd total comm,algbw,busbw,wg total comm,algbw,busbw,ig total comm,algbw,busbw grad_gather,Analytical_test,0,0,0,0,0,0,0,-nan,0.00,0,inf,-nan,0,-nan,0.00 grad_param_comm,Analytical_test,0,0,0,0,0,0,0,-nan,0.00,0,inf,-nan,0,-nan,0.00 grad_param_compute,Analytical_test,0,0,34021,0,0,0,0,-nan,0.00,0,-nan,0.00,0,-nan,0.00 layernorm,Analytical_test,0,0,0,0,0,34365,0,-nan,0.00,0,-nan,0.00,34365,133.33,200.00 embedding_grads,Analytical_test,0,0,0,0,0,2812,0,-nan,0.00,0,-nan,0.00,2812,133.33,200.00 embedding_layer,Analytical_test,799,17374,0,2812,0,0,2812,133.33,200.00,0,-nan,0.00,0,-nan,0.00 attention_layer,Analytical_test,1820,1820,1820,2812,0,0,2812,133.33,200.00,0,-nan,0.00,0,-nan,0.00 mlp_layer,Analytical_test,2478,2478,2478,2812,0,0,2812,133.33,200.00,0,-nan,0.00,0,-nan,0.00 attention_layer,Analytical_test,1820,1820,1820,2812,0,0,2812,133.33,200.00,0,-nan,0.00,0,-nan,0.00 mlp_layer,Analytical_test,2478,2478,2478,2812,0,0,2812,133.33,200.00,0,-nan,0.00,0,-nan,0.00 attention_layer,Analytical_test,1820,1820,1820,2812,0,0,2812,133.33,200.00,0,-nan,0.00,0,-nan,0.00 mlp_layer,Analytical_test,2478,2478,2478,2812,0,0,2812,133.33,200.00,0,-nan,0.00,0,-nan,0.00 attention_layer,Analytical_test,1820,1820,1820,2812,0,0,2812,133.33,200.00,0,-nan,0.00,0,-nan,0.00 mlp_layer,Analytical_test,2478,2478,2478,2812,0,0,2812,133.33,200.00,0,-nan,0.00,0,-nan,0.00 cross_entropy1,Analytical_test,0,0,0,0,0,0,0,133.56,200.33,0,-nan,0.00,0,-nan,0.00 cross_entropy2,Analytical_test,0,0,0,0,0,0,0,133.56,200.33,0,-nan,0.00,0,-nan,0.00 cross_entropy3,Analytical_test,0,0,0,0,0,0,0,133.56,200.33,0,-nan,0.00,0,-nan,0.00 optimizer1,Analytical_test,0,0,0,0,0,0,0,inf,inf,0,-nan,0.00,0,-nan,0.00 optimizer2,Analytical_test,0,0,0,0,0,0,0,inf,inf,0,-nan,0.00,0,-nan,0.00 optimizer3,Analytical_test,0,0,0,0,0,0,0,inf,inf,0,-nan,0.00,0,-nan,0.00 optimizer4,Analytical_test,0,0,0,0,0,0,0,inf,inf,0,-nan,0.00,0,-nan,0.00
I have a question about the unit of the evaluation results from SimAI's analytical model. I'm trying to evaluate a 4-layer GPT3-175B model, and I'm seeing some discrepancies between the reported communication times and the theoretical calculations.
Here are the parameters of my model:
Model: GPT3-175B (4 layers)
Batch Size: 8
Sequence Length: 2048
Hidden Dimension: 12288
Tensor Parallelism (TP): 4
Based on these parameters, the total communication volume for the all-reduce operation in the MLP layer should be: 8 * 2048 * 12288 * 2 = 402653184 B
My theoretical calculation using the NCCL model, with an assumed bandwidth (B) of 200 GB/s, is as follows:
t = 2 * (n - 1) * S / (n * B) where n is number of GPUs, and S is the size of data.
t = 2 * (4-1) * 402653184 / (4 * 200 * 10^9)
t ≈ 0.003 seconds = 3000 us
While the reported value is 2816. It seems that the time unit is ns.
Could you please clarify the unit of time used in the evaluation results?
The text was updated successfully, but these errors were encountered: