Skip to content

Release 0.4.0

Latest
Compare
Choose a tag to compare
@BalaBalaYi BalaBalaYi released this 20 Jan 06:30

Features:

  • Support pytorch 2.4.x +. Version 2.4.x has been extensively validated in production, while version 2.5 has undergone limited preliminary validation for usability.
  • Support python 3.10.
  • Support XPU_TIMER metric integration and training hang detection(1st edition) with new positive diagnosis implementation.
  • Support new fast fail strategy for tensorflow scenario under pending timeout case.
  • More key event(for monitoring) supported.
  • Refactor resource monitor.

BugFix and Enhancement:

  • Fixed the issue where the worker fault tolerance count does not meet expectations when the number of workers is less than the default retry count(3).
  • Fixed the issue where step 0 could not be saved.
  • Fixed the sporadic issue where concurrent directory deletion could cause an exception.
  • Fixed the issue in large-scale training scenarios where reading the master address in rdzv occasionally occurs before writing.
  • Fixed some node management known issues.
  • Fixed occasional master-address retrieving issue in torch training.
  • Enhance node heartbeat mechanism under some corner cases.
  • Fixed unexpected failover failure due to resource quota issue.
  • Fixed unexpected process leak when using Ascend NPU.(workaround)
  • Refactor 'job_context' to control all the key state read/write operation.
  • Fixed known issue related to master fault tolerance(internal feature).
  • Enhancement for node-check procedure.
  • UT performance improved.
  • Other tiny fixes and enhancements.

Others

  • Code base adjustment.
  • UT performance improved.