You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Support pytorch 2.4.x +. Version 2.4.x has been extensively validated in production, while version 2.5 has undergone limited preliminary validation for usability.
Support python 3.10.
Support XPU_TIMER metric integration and training hang detection(1st edition) with new positive diagnosis implementation.
Support new fast fail strategy for tensorflow scenario under pending timeout case.
More key event(for monitoring) supported.
Refactor resource monitor.
BugFix and Enhancement:
Fixed the issue where the worker fault tolerance count does not meet expectations when the number of workers is less than the default retry count(3).
Fixed the issue where step 0 could not be saved.
Fixed the sporadic issue where concurrent directory deletion could cause an exception.
Fixed the issue in large-scale training scenarios where reading the master address in rdzv occasionally occurs before writing.
Fixed some node management known issues.
Fixed occasional master-address retrieving issue in torch training.
Enhance node heartbeat mechanism under some corner cases.
Fixed unexpected failover failure due to resource quota issue.
Fixed unexpected process leak when using Ascend NPU.(workaround)
Refactor 'job_context' to control all the key state read/write operation.
Fixed known issue related to master fault tolerance(internal feature).