Skip to content

Commit

Permalink
Update FR tutorial to include file path, writer and print usage (#3058)
Browse files Browse the repository at this point in the history
* Update FR tutorial to include file path, writer and print usage
  • Loading branch information
fduwjj authored Sep 23, 2024
1 parent 0b23f46 commit cd7f684
Showing 1 changed file with 27 additions and 2 deletions.
29 changes: 27 additions & 2 deletions prototype_source/flight_recorder_tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,8 @@ Enabling Flight Recorder
------------------------
There are two required environment variables to get the initial version of Flight Recorder working.

- ``TORCH_NCCL_DEBUG_INFO_TEMP_FILE``: Setting the path where the flight recorder will be dumped with file prefix. One file per
rank. The default value is ``/tmp/nccl_trace_rank_``.
- ``TORCH_NCCL_TRACE_BUFFER_SIZE = (0, N)``: Setting ``N`` to a positive number enables collection.
``N`` represents the number of entries that will be kept internally in a circular buffer.
We recommended to set this value at *2000*.
Expand All @@ -71,6 +73,9 @@ Additional Settings

``fast`` is a new experimental mode that is shown to be much faster than the traditional ``addr2line``.
Use this setting in conjunction with ``TORCH_NCCL_TRACE_CPP_STACK`` to collect C++ traces in the Flight Recorder data.
- If you prefer not to have the flight recorder data dumped into the local disk but rather onto your own storage, you can define your own writer class.
This class should inherit from class ``::c10d::DebugInfoWriter`` and then register the new writer using ``::c10d::DebugInfoWriter::registerWriter``
before we initiate PyTorch distributed.

Retrieving Flight Recorder Data via an API
------------------------------------------
Expand Down Expand Up @@ -169,9 +174,29 @@ To run the convenience script, follow these steps:

2. To run the script, use this command:

.. code:: python
.. code:: shell
python fr_trace.py <dump dir containing trace files> [-o <output file>]
If you install the PyTorch nightly build or build from scratch with ``USE_DISTRIBUTED=1``, you can directly use the following
command directly:

.. code:: shell
torchfrtrace <dump dir containing trace files> [-o <output file>]
Currently, we support two modes for the analyzer script. The first mode allows the script to apply some heuristics to the parsed flight
recorder dumps to generate a report identifying potential culprits for the timeout. The second mode is simply outputs the raw dumps.
By default, the script prints flight recoder dumps for all ranks and all ``ProcessGroups``(PGs). This can be narrowed down to certain
ranks and PGs. An example command is:
Caveat: tabulate module is needed, so you might need pip install it first.
.. code:: shell
python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
python fr_trace.py <dump dir containing trace files> -j [--selected-ranks i j k ...]
torchfrtrace <dump dir containing trace files> -j [--selected-ranks i j k ...]
Conclusion
----------
Expand Down

0 comments on commit cd7f684

Please sign in to comment.