Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[c10d] update the work progress of PG periodically (pytorch#120438)
Summary: Previously, I added lastEnqueuedSeq_ and lastCompletedSeq_ to store the states of PG progress but log only when there is timeout detected. We found it is not enough since the 'straggler' itself might not detect the timeout and hence there is no log from the 'straggler'. In this PR, we can log these states periorically so that it would be much easier for us to identify the straggler by checking which rank has the smallest number of lastEnqueuedSeq_ Test Plan: Log adding, build success Pull Request resolved: pytorch#120438 Approved by: https://github.com/wconstab, https://github.com/XilunWu, https://github.com/kwen2501
- Loading branch information