-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor comms trace parser and deprecate support for basic and Kineto traces #155
Conversation
@shengfukevin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
…o traces (#155) Summary: Refactored the commsTraceParser to enhance readability and maintainability. Removed deprecated support for parsing "basic" and "Kineto" traces; only Chakra host execution traces are now supported. Added detailed logging to handle undecimal process group names with warnings instead of exceptions. Modified _parse_proc_group_info and _parse_comms_op_node for improved processing of process group info and communication operations. Updated comms_utils to support additional data types (signed char, unsigned char). Adjusted the command-line interface to reflect the updated trace type options. Test Plan: $ comm_replay --trace-type et --trace-path /home/sanshang/021_debug/000_code/param/trace/traces_megatronlm_gpt_43B_32ranks_pytnightly0703/execution_trace Differential Revision: D61025278
afa7558
to
4c96222
Compare
@shengfukevin has updated the pull request. You must reimport the pull request before landing. |
This pull request was exported from Phabricator. Differential Revision: D61025278 |
@TaekyungHeo and @Sergei-Lebedev, I fixed the parser for version 1.1.1, which missed passing commArgs. Also I saw _getTensorInfoFromPyTorchETEntry and tensorDtypeMap are no longer used, shall we clean them up. Thanks |
@TaekyungHeo, test 2 rank Resnet run failed with the following call stack: ank0]: File "/data/users/shengfu/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/param_bench/et_replay/comm_replay/comm_replay#link-tree/run_lpar_main.py", line 73, in Please take a look. Thanks |
Thanks, @shengfukevin . We will take a look. |
This commit should be moved to this PR for fix: |
Please see if this works: #156 |
…o traces (#155) Summary: Refactored the commsTraceParser to enhance readability and maintainability. Removed deprecated support for parsing "basic" and "Kineto" traces; only Chakra host execution traces are now supported. Added detailed logging to handle undecimal process group names with warnings instead of exceptions. Modified _parse_proc_group_info and _parse_comms_op_node for improved processing of process group info and communication operations. Updated comms_utils to support additional data types (signed char, unsigned char). Adjusted the command-line interface to reflect the updated trace type options. Test Plan: $ comm_replay --trace-type et --trace-path /home/sanshang/021_debug/000_code/param/trace/traces_megatronlm_gpt_43B_32ranks_pytnightly0703/execution_trace Reviewed By: briancoutinho Differential Revision: D61025278 Pulled By: shengfukevin
4c96222
to
b444dbc
Compare
@shengfukevin has updated the pull request. You must reimport the pull request before landing. |
This pull request was exported from Phabricator. Differential Revision: D61025278 |
@shengfukevin merged this pull request in c85f1b5. |
Summary
Refactored the commsTraceParser to enhance readability and maintainability.
Removed deprecated support for parsing "basic" and "Kineto" traces; only Chakra host execution traces are now supported.
Added detailed logging to handle undecimal process group names with warnings instead of exceptions.
Modified _parse_proc_group_info and _parse_comms_op_node for improved processing of process group info and communication operations.
Updated comms_utils to support additional data types (signed char, unsigned char).
Adjusted the command-line interface to reflect the updated trace type options.
Test Plan
$ comm_replay --trace-type et --trace-path /home/sanshang/021_debug/000_code/param/trace/traces_megatronlm_gpt_43B_32ranks_pytnightly0703/execution_trace