Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

模拟过程中异常终止 #75

Open
ykk199 opened this issue Jan 16, 2025 · 4 comments
Open

模拟过程中异常终止 #75

ykk199 opened this issue Jan 16, 2025 · 4 comments

Comments

@ykk199
Copy link

ykk199 commented Jan 16, 2025

我在模拟4机32卡运行GPT7B的情况,运行到chunk size is: 4 , size is: 4 , layer_num is: 74762 , node: 0时,运行结束没有任何报错或警告,但生成的ncclFlowModel_EndToEnd.csv、ncclFlowModel_detailed_36.csv中是空的。换成其他大模型也有类似情况,运行microAllReduce.txt是没问题的。

运行命令如下:
AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t nproc -w ./gpt_7B-world_size32-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_32_gpus_8_in_one_server_with_single_plane_100Gbps_A100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf | tee gpt7b.log

生成的log见附件
gpt7b.log

@Huoyuan100861
Copy link
Collaborator

我在模拟4机32卡运行GPT7B的情况,运行到chunk size is: 4 , size is: 4 , layer_num is: 74762 , node: 0时,运行结束没有任何报错或警告,但生成的ncclFlowModel_EndToEnd.csv、ncclFlowModel_detailed_36.csv中是空的。换成其他大模型也有类似情况,运行microAllReduce.txt是没问题的。

运行命令如下: AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t nproc -w ./gpt_7B-world_size32-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_32_gpus_8_in_one_server_with_single_plane_100Gbps_A100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf | tee gpt7b.log

生成的log见附件 gpt7b.log

可否把 workload 也发一下?方便我们debug定位

@ykk199
Copy link
Author

ykk199 commented Jan 17, 2025

我在模拟4机32卡运行GPT7B的情况,运行到chunk size is: 4 , size is: 4 , layer_num is: 74762 , node: 0时,运行结束没有任何报错或警告,但生成的ncclFlowModel_EndToEnd.csv、ncclFlowModel_detailed_36.csv中是空的。换成其他大模型也有类似情况,运行microAllReduce.txt是没问题的。
运行命令如下: AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t nproc -w ./gpt_7B-world_size32-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_32_gpus_8_in_one_server_with_single_plane_100Gbps_A100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf | tee gpt7b.log
生成的log见附件 gpt7b.log

可否把 workload 也发一下?方便我们debug定位

hi , workload见附件,麻烦帮忙看一下。感谢!

gpt_7B-world_size32-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt

@Huoyuan100861
Copy link
Collaborator

我在模拟4机32卡运行GPT7B的情况,运行到chunk size is: 4 , size is: 4 , layer_num is: 74762 , node: 0时,运行结束没有任何报错或警告,但生成的ncclFlowModel_EndToEnd.csv、ncclFlowModel_detailed_36.csv中是空的。换成其他大模型也有类似情况,运行microAllReduce.txt是没问题的。
运行命令如下: AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t nproc -w ./gpt_7B-world_size32-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_32_gpus_8_in_one_server_with_single_plane_100Gbps_A100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf | tee gpt7b.log
生成的log见附件 gpt7b.log

可否把 workload 也发一下?方便我们debug定位

hi , workload见附件,麻烦帮忙看一下。感谢!

gpt_7B-world_size32-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt

You can try removing the part of the collective communication size < 1024 first (the impact here is minimal, and a fixed latency will replace it later).

@ykk199
Copy link
Author

ykk199 commented Feb 12, 2025

我在模拟4机32卡运行GPT7B的情况,运行到chunk size is: 4 , size is: 4 , layer_num is: 74762 , node: 0时,运行结束没有任何报错或警告,但生成的ncclFlowModel_EndToEnd.csv、ncclFlowModel_detailed_36.csv中是空的。换成其他大模型也有类似情况,运行microAllReduce.txt是没问题的。
运行命令如下: AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t nproc -w ./gpt_7B-world_size32-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_32_gpus_8_in_one_server_with_single_plane_100Gbps_A100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf | tee gpt7b.log
生成的log见附件 gpt7b.log

可否把 workload 也发一下?方便我们debug定位

hi , workload见附件,麻烦帮忙看一下。感谢!
gpt_7B-world_size32-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt

You can try removing the part of the collective communication size < 1024 first (the impact here is minimal, and a fixed latency will replace it later).

hi,请问您那里有可以成功运行的这几个文件吗?我想参考一下,谢谢!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants