AllGather Message Size vs Workload TXT File Msg Size #74

vamsiDT · 2025-01-15T14:08:21Z

Hi,

The workload file specifies an AllGather communication with message size computed based on the total params of the model. I am assuming this is the size of the data received at each node after gathering?

The simulator reads this message size generates AllGather using it as initial data size. The final data received at each node becomes number of ranks multipled by the initial message size specified in the TXT workload.

Is this intended? I am wondering whether the AllGather specified in the TXT file needs a division by the number of ranks (world_size).

Please let me know.

--
Vamsi

vamsiDT · 2025-01-15T14:18:22Z

For example, in the following workload:

The initial grad_gather specifies a size nearly 1.6GB. There are 256 Gpus in the topology. Based on how the simulator reads this, AllGather generates 1.6GB x 256 amount of data from each node (with ring or direct or halvingDoubling). In total, the simulator generates 1.6GB x 256 x 256 in total from all nodes.

This single operation totally dominates the training time and the rest of the operations just finish quickly.

It makes me wonder whether AllGather sizes need division by number of ranks in the TXT workload generator.

Or this how it really works? 😯

HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 8 ep: 1 pp: 1 vpp: 36 ga: 32 all_gpus: 256 checkpoints: 0 checkpoint_initiates: 0 pp_comm: 0 
2350                                                                                                                                                   
grad_gather     -1      1       NONE    0       1       NONE    0       1       ALLGATHER       1649410048      100                                    
grad_param_comm -1      1       NONE    0       1       NONE    0       1       REDUCESCATTER   3298820096      100                                    
grad_param_compute      -1      1       NONE    0       16011264        NONE    0       1       NONE    0       100                                    
layernorm       -1      1       NONE    0       1       ALLREDUCE       1649410048      1       NONE    0       100                                    
embedding_grads -1      1       NONE    0       1       ALLREDUCE       33554432        1       NONE    0       100                                    
moe_grad_norm1  -1      1       NONE    0       1       NONE    0       1       ALLGATHER_DP_EP 0       100                                            
moe_grad_norm2  -1      1       NONE    0       1       NONE    0       1       REDUCESCATTER_DP_EP     0       100                                    
embedding_layer -1      107004929       ALLREDUCE       33554432        1       NONE    0       8005632 NONE    0       100                            
attention_layer -1      968236  ALLREDUCE       33554432        968236  NONE    0       968236  NONE    0       100                                    
mlp_layer       -1      1013846 ALLREDUCE       33554432        1013846 NONE    0       1013846 NONE    0       100                                    
attention_layer -1      968236  ALLREDUCE       33554432        968236  NONE    0       968236  NONE    0       100                                    
mlp_layer       -1      1013846 ALLREDUCE       33554432        1013846 NONE    0       1013846 NONE    0       100                                    
attention_layer -1      968236  ALLREDUCE       33554432        968236  NONE    0       968236  NONE    0       100                                    
mlp_layer       -1      1013846 ALLREDUCE       33554432        1013846 NONE    0       1013846 NONE    0       100                                    
attention_layer -1      968236  ALLREDUCE       33554432        968236  NONE    0       968236  NONE    0       100                                    
mlp_layer       -1      1013846 ALLREDUCE       33554432        1013846 NONE    0       1013846 NONE    0       100                                    
attention_layer -1      968236  ALLREDUCE       33554432        968236  NONE    0       968236  NONE    0       100                                    
mlp_layer       -1      1013846 ALLREDUCE       33554432        1013846 NONE    0       1013846 NONE    0       100                                    
attention_layer -1      968236  ALLREDUCE       33554432        968236  NONE    0       968236  NONE    0       100                                    
mlp_layer       -1      1013846 ALLREDUCE       33554432        1013846 NONE    0       1013846 NONE    0       100                                    
attention_layer -1      968236  ALLREDUCE       33554432        968236  NONE    0       968236  NONE    0       100

Huoyuan100861 · 2025-01-16T02:28:03Z

Hi,你好

The workload file specifies an AllGather communication with message size computed based on the total params of the model. I am assuming this is the size of the data received at each node after gathering?工作负载文件指定 AllGather 通信，其消息大小根据模型的总参数计算。我假设这是收集后每个节点接收的数据的大小？

The simulator reads this message size generates AllGather using it as initial data size. The final data received at each node becomes number of ranks multipled by the initial message size specified in the TXT workload.模拟器读取此消息大小会生成 AllGather，将其用作初始数据大小。在每个节点接收的最终数据将成为排名数乘以 TXT 工作负载中指定的初始消息大小。

Is this intended? I am wondering whether the AllGather specified in the TXT file needs a division by the number of ranks (world_size).这是有意为之的吗？我想知道 TXT 文件中指定的 AllGather 是否需要除以等级数（world_size）。

Please let me know.请告诉我。

-- Vamsi瓦姆西

In SimAI's workload with the format [Collective Comm] [Size], the Size refers to the data for Global rank. For instance, in an AllGather operation, the data per rank would be Size divided by the number of ranks. This aligns with the convention in framework discussions about the volume of communication in a collective operation. However, in the NCCL interface, the count specified refers to the amount for a single rank, so you need to be aware of this distinction.

vamsiDT · 2025-01-16T02:39:56Z

Thanks for the response.

Is this correct then?

Workload generator specifies AllGather size based on the total model params: https://github.com/aliyun/aicb/blob/d9b4f5cd7d9d34a80cfbb0389831a16c7fe3ed7b/workload_generator/AIOB_simAI_workload_generator.py#L571C1-L572C1

SimAI's Workload class reads this directly and for example NcclTreeFlowModel uses it as initial size for AllGather:

SimAI/astra-sim-alibabacloud/astra-sim/system/collective/NcclTreeFlowModel.cc

Line 110 in ed5c490

this->final_data_size = data_size * nodes_in_ring;

Either Aicb or the AllGather input in the simulator might need a correction, isn't it? Unless, the training process actually requires every GPU to send out entire model params. I am not sure how it works, given the parallelism strategy.

Any clarification would be very helpful to understand. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AllGather Message Size vs Workload TXT File Msg Size #74

AllGather Message Size vs Workload TXT File Msg Size #74

vamsiDT commented Jan 15, 2025

vamsiDT commented Jan 15, 2025 •

edited

Loading

Huoyuan100861 commented Jan 16, 2025

vamsiDT commented Jan 16, 2025 •

edited

Loading

AllGather Message Size vs Workload TXT File Msg Size #74

AllGather Message Size vs Workload TXT File Msg Size #74

Comments

vamsiDT commented Jan 15, 2025

vamsiDT commented Jan 15, 2025 • edited Loading

Huoyuan100861 commented Jan 16, 2025

vamsiDT commented Jan 16, 2025 • edited Loading

vamsiDT commented Jan 15, 2025 •

edited

Loading

vamsiDT commented Jan 16, 2025 •

edited

Loading