-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
模拟过程中异常终止 #75
Comments
可否把 workload 也发一下?方便我们debug定位 |
hi , workload见附件,麻烦帮忙看一下。感谢! gpt_7B-world_size32-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt |
You can try removing the part of the collective communication size < 1024 first (the impact here is minimal, and a fixed latency will replace it later). |
hi,请问您那里有可以成功运行的这几个文件吗?我想参考一下,谢谢! |
我在模拟4机32卡运行GPT7B的情况,运行到chunk size is: 4 , size is: 4 , layer_num is: 74762 , node: 0时,运行结束没有任何报错或警告,但生成的ncclFlowModel_EndToEnd.csv、ncclFlowModel_detailed_36.csv中是空的。换成其他大模型也有类似情况,运行microAllReduce.txt是没问题的。
运行命令如下:
AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t
nproc
-w ./gpt_7B-world_size32-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_32_gpus_8_in_one_server_with_single_plane_100Gbps_A100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf | tee gpt7b.log生成的log见附件
gpt7b.log
The text was updated successfully, but these errors were encountered: