Skip to content

Releases: intelligent-machine-learning/dlrover

Release 0.4.0

20 Jan 06:30
Compare
Choose a tag to compare

Features:

  • Support pytorch 2.4.x +. Version 2.4.x has been extensively validated in production, while version 2.5 has undergone limited preliminary validation for usability.
  • Support python 3.10.
  • Support XPU_TIMER metric integration and training hang detection(1st edition) with new positive diagnosis implementation.
  • Support new fast fail strategy for tensorflow scenario under pending timeout case.
  • More key event(for monitoring) supported.
  • Refactor resource monitor.

BugFix and Enhancement:

  • Fixed the issue where the worker fault tolerance count does not meet expectations when the number of workers is less than the default retry count(3).
  • Fixed the issue where step 0 could not be saved.
  • Fixed the sporadic issue where concurrent directory deletion could cause an exception.
  • Fixed the issue in large-scale training scenarios where reading the master address in rdzv occasionally occurs before writing.
  • Fixed some node management known issues.
  • Fixed occasional master-address retrieving issue in torch training.
  • Enhance node heartbeat mechanism under some corner cases.
  • Fixed unexpected failover failure due to resource quota issue.
  • Fixed unexpected process leak when using Ascend NPU.(workaround)
  • Refactor 'job_context' to control all the key state read/write operation.
  • Fixed known issue related to master fault tolerance(internal feature).
  • Enhancement for node-check procedure.
  • UT performance improved.
  • Other tiny fixes and enhancements.

Others

  • Code base adjustment.
  • UT performance improved.

Release 0.3.8

29 Sep 02:02
bdc5ed2
Compare
Choose a tag to compare

Features:

  • Added the basic implementation of the first version of positive diagnostics.
  • Supported 'fast-fail' strategy for training job in some boundary scenarios. e.g. pending case
  • Accelerate(sync -> async) pod creation.
  • Added the basic implementation of structured event logging.

BugFix:

  • Fixed unexpected rendezvous failure in occasional fault-tolerant scenarios.
  • Fixed unexpected socket client creation before socket socket creation.
  • Optimized 'network-check' implementation for 'Ascend NPU'.
  • Optimized some implementations for master-fault-tolerance(internal) scenario.
  • Other numerous known issues fixed and optimized.

Release 0.3.7

13 May 06:02
Compare
Choose a tag to compare

Features:

  • Flash Checkpoint suppors deleting old checkpoints.

BugFix:

  • Save/load the non-params-related variables of dist optimizer in Megatron-LM models.
  • The agent waits for async saving checkpoint finishes before exiting.

Release 0.3.6

24 Apr 06:18
Compare
Choose a tag to compare

Features:

Flash checkpoint provides FlashCkptTrainer to support HuggingFace transforemers.Trainer.
Flash checkpoint supports loading the checkpint of Megatron-LM from the memory.
Flash Checkpoint supports saving and loading FSDP checkpoint with full state dict.
Job master can sort the node ranks by the access switches of the node.

BugFix:

Fix the segment fault when restarting the training process.

Release 0.3.5

29 Mar 07:02
Compare
Choose a tag to compare

Features:

  • Flash checkpoint supports saving and loading Megatron-LM MOE models. #1042
  • APIs to extend the module to check the node with different chips. #1023
  • Automatically mark the node as unschedulable if the node fails. #1025

BugFix:

  • Fix the DDP example of mnist to save and load checkpoint. #1051
  • Fix the checkpoint name of DDP. #1034

Release 0.3.4

21 Feb 07:10
Compare
Choose a tag to compare

Features:

  • Flash checkpoint enables saving and loading Megatron-LM models from multiple ranks in parallel.
  • dlrover-run --auto-config Automatically configure the number of nodes and the number of processes per node.
  • Users can customize the APIs of storage to save the checkpoint into different file systems.
  • Deletion strategy to clean the old checkpoint files.

BugFix:

  • The shared memory does not exist if the size of the checkpoint changes.

Release 0.3.3

25 Jan 02:28
Compare
Choose a tag to compare

Features:

  • Support Python > 3.10.
  • Support restarting the training process on Ascend NPU.
  • Support asynchronously saving the checkpoint of the distributed optimizer of Megatron-LM to the storage.

BugFix:

  • Fix the checkpoint shard inconsistency of all ranks.
  • Fix the bug to asynchronously save the Megatron-LM checkpoint of the job with multi-GPUs on multi-nodes.
  • Fix the bug to load the Megatron-LM checkpoint.

Release 0.3.1

10 Jan 01:54
Compare
Choose a tag to compare

Feature:

  • Users can use flash checkpoint using torchrun or python -m torch.distributed.launch.

Bugfix:

  • The dlrover master cannot print the error message of the fault node in a kubeflow/PytorchJob.

Release 0.3.0

03 Jan 06:54
Compare
Choose a tag to compare

Features:

  • Flash Checkpoint to asynchronously persist checkpoint to storage.
  • Flash Checkpoint recovers failure in memory.
  • Flash Checkpoint supports DDP/FSDP/DeepSpeed/Megatron
  • Node detection supports NPU.

Examples

  • The example of training nanoGPT using DeepSpeed.
  • The example to save/load sharding FSDP checkpoint.

Release 0.2.2

21 Nov 06:41
Compare
Choose a tag to compare

ElasticJob

Features:

  • dlrover-run can run on any distributed jobs with the NODE_RANK and DLROVER_MASTER_ADDR in the environment.
  • DLRover can asynchronously save the checkpoint into the storage which only block the training with a few time.

BugFix:

  • Fix the bug to load the FSDP checkpoint.