模拟过程中异常终止 #75

ykk199 · 2025-01-16T03:14:39Z

我在模拟4机32卡运行GPT7B的情况，运行到chunk size is: 4 , size is: 4 , layer_num is: 74762 , node: 0时，运行结束没有任何报错或警告，但生成的ncclFlowModel_EndToEnd.csv、ncclFlowModel_detailed_36.csv中是空的。换成其他大模型也有类似情况，运行microAllReduce.txt是没问题的。

运行命令如下：
AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t nproc -w ./gpt_7B-world_size32-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_32_gpus_8_in_one_server_with_single_plane_100Gbps_A100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf | tee gpt7b.log

生成的log见附件
gpt7b.log

The text was updated successfully, but these errors were encountered:

Huoyuan100861 · 2025-01-17T03:22:12Z

我在模拟4机32卡运行GPT7B的情况，运行到chunk size is: 4 , size is: 4 , layer_num is: 74762 , node: 0时，运行结束没有任何报错或警告，但生成的ncclFlowModel_EndToEnd.csv、ncclFlowModel_detailed_36.csv中是空的。换成其他大模型也有类似情况，运行microAllReduce.txt是没问题的。

运行命令如下： AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t nproc -w ./gpt_7B-world_size32-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_32_gpus_8_in_one_server_with_single_plane_100Gbps_A100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf | tee gpt7b.log

生成的log见附件 gpt7b.log

可否把 workload 也发一下？方便我们debug定位

ykk199 · 2025-01-17T07:37:11Z

我在模拟4机32卡运行GPT7B的情况，运行到chunk size is: 4 , size is: 4 , layer_num is: 74762 , node: 0时，运行结束没有任何报错或警告，但生成的ncclFlowModel_EndToEnd.csv、ncclFlowModel_detailed_36.csv中是空的。换成其他大模型也有类似情况，运行microAllReduce.txt是没问题的。
运行命令如下： AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t nproc -w ./gpt_7B-world_size32-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_32_gpus_8_in_one_server_with_single_plane_100Gbps_A100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf | tee gpt7b.log
生成的log见附件 gpt7b.log

可否把 workload 也发一下？方便我们debug定位

hi , workload见附件，麻烦帮忙看一下。感谢！

gpt_7B-world_size32-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt

Huoyuan100861 · 2025-02-12T08:43:54Z

我在模拟4机32卡运行GPT7B的情况，运行到chunk size is: 4 , size is: 4 , layer_num is: 74762 , node: 0时，运行结束没有任何报错或警告，但生成的ncclFlowModel_EndToEnd.csv、ncclFlowModel_detailed_36.csv中是空的。换成其他大模型也有类似情况，运行microAllReduce.txt是没问题的。
运行命令如下： AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t nproc -w ./gpt_7B-world_size32-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_32_gpus_8_in_one_server_with_single_plane_100Gbps_A100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf | tee gpt7b.log
生成的log见附件 gpt7b.log

可否把 workload 也发一下？方便我们debug定位

hi , workload见附件，麻烦帮忙看一下。感谢！

gpt_7B-world_size32-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt

You can try removing the part of the collective communication size < 1024 first (the impact here is minimal, and a fixed latency will replace it later).

ykk199 · 2025-02-12T09:32:00Z

我在模拟4机32卡运行GPT7B的情况，运行到chunk size is: 4 , size is: 4 , layer_num is: 74762 , node: 0时，运行结束没有任何报错或警告，但生成的ncclFlowModel_EndToEnd.csv、ncclFlowModel_detailed_36.csv中是空的。换成其他大模型也有类似情况，运行microAllReduce.txt是没问题的。
运行命令如下： AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t nproc -w ./gpt_7B-world_size32-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_32_gpus_8_in_one_server_with_single_plane_100Gbps_A100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf | tee gpt7b.log
生成的log见附件 gpt7b.log

可否把 workload 也发一下？方便我们debug定位

hi , workload见附件，麻烦帮忙看一下。感谢！
gpt_7B-world_size32-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt

You can try removing the part of the collective communication size < 1024 first (the impact here is minimal, and a fixed latency will replace it later).

hi，请问您那里有可以成功运行的这几个文件吗？我想参考一下，谢谢！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

模拟过程中异常终止 #75

模拟过程中异常终止 #75

ykk199 commented Jan 16, 2025

Huoyuan100861 commented Jan 17, 2025

ykk199 commented Jan 17, 2025

Huoyuan100861 commented Feb 12, 2025

ykk199 commented Feb 12, 2025

模拟过程中异常终止 #75

模拟过程中异常终止 #75

Comments

ykk199 commented Jan 16, 2025

Huoyuan100861 commented Jan 17, 2025

ykk199 commented Jan 17, 2025

Huoyuan100861 commented Feb 12, 2025

ykk199 commented Feb 12, 2025