Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor comms trace parser and deprecate support for basic and Kineto traces #155

Closed
wants to merge 1 commit into from

Conversation

shengfukevin
Copy link
Contributor

Summary

Refactored the commsTraceParser to enhance readability and maintainability.
Removed deprecated support for parsing "basic" and "Kineto" traces; only Chakra host execution traces are now supported.
Added detailed logging to handle undecimal process group names with warnings instead of exceptions.
Modified _parse_proc_group_info and _parse_comms_op_node for improved processing of process group info and communication operations.
Updated comms_utils to support additional data types (signed char, unsigned char).
Adjusted the command-line interface to reflect the updated trace type options.

Test Plan

$ comm_replay --trace-type et --trace-path /home/sanshang/021_debug/000_code/param/trace/traces_megatronlm_gpt_43B_32ranks_pytnightly0703/execution_trace

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 9, 2024
@facebook-github-bot
Copy link
Contributor

@shengfukevin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Aug 9, 2024
…o traces (#155)

Summary:
Refactored the commsTraceParser to enhance readability and maintainability.
Removed deprecated support for parsing "basic" and "Kineto" traces; only Chakra host execution traces are now supported.
Added detailed logging to handle undecimal process group names with warnings instead of exceptions.
Modified _parse_proc_group_info and _parse_comms_op_node for improved processing of process group info and communication operations.
Updated comms_utils to support additional data types (signed char, unsigned char).
Adjusted the command-line interface to reflect the updated trace type options.


Test Plan: $ comm_replay --trace-type et --trace-path /home/sanshang/021_debug/000_code/param/trace/traces_megatronlm_gpt_43B_32ranks_pytnightly0703/execution_trace

Differential Revision: D61025278
@facebook-github-bot facebook-github-bot force-pushed the shengf/refactor-comm-trace-parser branch from afa7558 to 4c96222 Compare August 9, 2024 17:57
@facebook-github-bot
Copy link
Contributor

@shengfukevin has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61025278

@shengfukevin
Copy link
Contributor Author

@TaekyungHeo and @Sergei-Lebedev, I fixed the parser for version 1.1.1, which missed passing commArgs. Also I saw _getTensorInfoFromPyTorchETEntry and tensorDtypeMap are no longer used, shall we clean them up. Thanks

@shengfukevin
Copy link
Contributor Author

@TaekyungHeo, test 2 rank Resnet run failed with the following call stack:

ank0]: File "/data/users/shengfu/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/param_bench/et_replay/comm_replay/comm_replay#link-tree/run_lpar_main.py", line 73, in
[rank0]: __invoke_main()
[rank0]: File "/data/users/shengfu/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/param_bench/et_replay/comm_replay/comm_replay#link-tree/run_lpar_main.py", line 35, in __invoke_main
[rank0]: run_as_main(module, main_function)
[rank0]: File "/data/users/shengfu/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/param_bench/et_replay/comm_replay/comm_replay#link-tree/par/meta_only/bootstrap.py", line 98, in run_as_main
[rank0]: oss_run_as_main(
[rank0]: File "/data/users/shengfu/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/param_bench/et_replay/comm_replay/comm_replay#link-tree/par/bootstrap.py", line 94, in run_as_main
[rank0]: main()
[rank0]: File "/data/users/shengfu/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/param_bench/et_replay/comm_replay/comm_replay#link-tree/et_replay/tools/comm_replay.py", line 1682, in main
[rank0]: traceBench.runBench(commsParams)
[rank0]: File "/data/users/shengfu/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/param_bench/et_replay/comm_replay/comm_replay#link-tree/et_replay/tools/comm_replay.py", line 1330, in runBench
[rank0]: self.benchTime(commsParams)
[rank0]: File "/data/users/shengfu/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/param_bench/et_replay/comm_replay/comm_replay#link-tree/et_replay/tools/comm_replay.py", line 1242, in benchTime
[rank0]: self.replayTrace(commsParams=commsParams, warmup=True)
[rank0]: File "/data/users/shengfu/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/param_bench/et_replay/comm_replay/comm_replay#link-tree/et_replay/tools/comm_replay.py", line 1069, in replayTrace
[rank0]: (latency, global_latency) = self.runComms(
[rank0]: File "/data/users/shengfu/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/param_bench/et_replay/comm_replay/comm_replay#link-tree/et_replay/tools/comm_replay.py", line 808, in runComms
[rank0]: self.collectiveArgs.srcOrDst = curComm.srcOrDst
[rank0]: AttributeError: 'commsArgs' object has no attribute 'srcOrDst'

Please take a look. Thanks

@TaekyungHeo
Copy link
Contributor

Thanks, @shengfukevin . We will take a look.

@GSSBMW
Copy link

GSSBMW commented Aug 12, 2024

This commit should be moved to this PR for fix:
cacba6d

@TaekyungHeo
Copy link
Contributor

Please see if this works: #156

…o traces (#155)

Summary:
Refactored the commsTraceParser to enhance readability and maintainability.
Removed deprecated support for parsing "basic" and "Kineto" traces; only Chakra host execution traces are now supported.
Added detailed logging to handle undecimal process group names with warnings instead of exceptions.
Modified _parse_proc_group_info and _parse_comms_op_node for improved processing of process group info and communication operations.
Updated comms_utils to support additional data types (signed char, unsigned char).
Adjusted the command-line interface to reflect the updated trace type options.


Test Plan: $ comm_replay --trace-type et --trace-path /home/sanshang/021_debug/000_code/param/trace/traces_megatronlm_gpt_43B_32ranks_pytnightly0703/execution_trace

Reviewed By: briancoutinho

Differential Revision: D61025278

Pulled By: shengfukevin
@facebook-github-bot facebook-github-bot force-pushed the shengf/refactor-comm-trace-parser branch from 4c96222 to b444dbc Compare August 13, 2024 06:20
@facebook-github-bot
Copy link
Contributor

@shengfukevin has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61025278

@facebook-github-bot
Copy link
Contributor

@shengfukevin merged this pull request in c85f1b5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants