Skip to content

Commit

Permalink
[c10d] update the work progress of PG periodically (pytorch#120438)
Browse files Browse the repository at this point in the history
Summary:
Previously, I added lastEnqueuedSeq_ and lastCompletedSeq_ to store the states of PG progress
but log only when there is timeout detected.

We found it is not enough since the 'straggler' itself might not detect
the timeout and hence there is no log from the 'straggler'.

In this PR, we can log these states periorically so that it would be
much easier for us to identify the straggler by checking which rank
has the smallest number of lastEnqueuedSeq_
Test Plan:
Log adding, build success

Pull Request resolved: pytorch#120438
Approved by: https://github.com/wconstab, https://github.com/XilunWu, https://github.com/kwen2501
  • Loading branch information
shuqiangzhang authored and pytorchmergebot committed Feb 23, 2024
1 parent 8f4ffd3 commit 2b0168a
Show file tree
Hide file tree
Showing 2 changed files with 16 additions and 0 deletions.
14 changes: 14 additions & 0 deletions torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1552,6 +1552,20 @@ void ProcessGroupNCCL::watchdogHandler() {
// Bump up heart beat by one.
heartbeat_++;

// Some versions of GLOG support less-spammy version of LOG_EVERY_MS
// in which case we don't want to spam the logs.
#ifdef LOG_EVERY_MS
// Log the progress of this PG periodically
C10_LOG_EVERY_MS(INFO, kWorkStatusUpdatePeriodMs) << c10::str(
logPrefix(),
"NCCL Work update periodically: ",
"last enqueued NCCL work: ",
lastEnqueuedSeq_,
", last completed NCCL work: ",
lastCompletedSeq_,
".");
#endif

for (auto it = workMetaList_.begin(); it != workMetaList_.end();
/* no increment */) {
auto& work = *it;
Expand Down
2 changes: 2 additions & 0 deletions torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,8 @@ constexpr const char* NCCL_BACKEND_NAME = "nccl";

constexpr const char* TIMEOUT_DUMP = "timeout_dump";

constexpr const int kWorkStatusUpdatePeriodMs = 10 * 1000; // 10 seconds

constexpr auto kProcessGroupNCCLDefaultTimeout =
std::chrono::milliseconds(10 * 60 * 1000);

Expand Down

0 comments on commit 2b0168a

Please sign in to comment.