-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AllGather Message Size vs Workload TXT File Msg Size #74
Comments
For example, in the following workload: The initial This single operation totally dominates the training time and the rest of the operations just finish quickly. It makes me wonder whether AllGather sizes need division by number of ranks in the TXT workload generator. Or this how it really works? 😯
|
In SimAI's workload with the format [Collective Comm] [Size], the Size refers to the data for Global rank. For instance, in an AllGather operation, the data per rank would be Size divided by the number of ranks. This aligns with the convention in framework discussions about the volume of communication in a collective operation. However, in the NCCL interface, the count specified refers to the amount for a single rank, so you need to be aware of this distinction. |
Thanks for the response. Is this correct then? Workload generator specifies AllGather size based on the total model params: https://github.com/aliyun/aicb/blob/d9b4f5cd7d9d34a80cfbb0389831a16c7fe3ed7b/workload_generator/AIOB_simAI_workload_generator.py#L571C1-L572C1 SimAI's Workload class reads this directly and for example NcclTreeFlowModel uses it as initial size for AllGather:
Either Aicb or the AllGather input in the simulator might need a correction, isn't it? Unless, the training process actually requires every GPU to send out entire model params. I am not sure how it works, given the parallelism strategy. Any clarification would be very helpful to understand. Thanks! |
Hi,
The workload file specifies an AllGather communication with message size computed based on the total params of the model. I am assuming this is the size of the data received at each node after gathering?
The simulator reads this message size generates AllGather using it as initial data size. The final data received at each node becomes number of ranks multipled by the initial message size specified in the TXT workload.
Is this intended? I am wondering whether the AllGather specified in the TXT file needs a division by the number of ranks (world_size).
Please let me know.
--
Vamsi
The text was updated successfully, but these errors were encountered: