20 Jan 06:30

BalaBalaYi

0506521

Release 0.4.0 Latest

Latest

Features:

Support pytorch 2.4.x +. Version 2.4.x has been extensively validated in production, while version 2.5 has undergone limited preliminary validation for usability.
Support python 3.10.
Support XPU_TIMER metric integration and training hang detection(1st edition) with new positive diagnosis implementation.
Support new fast fail strategy for tensorflow scenario under pending timeout case.
More key event(for monitoring) supported.
Refactor resource monitor.

BugFix and Enhancement:

Fixed the issue where the worker fault tolerance count does not meet expectations when the number of workers is less than the default retry count(3).
Fixed the issue where step 0 could not be saved.
Fixed the sporadic issue where concurrent directory deletion could cause an exception.
Fixed the issue in large-scale training scenarios where reading the master address in rdzv occasionally occurs before writing.
Fixed some node management known issues.
Fixed occasional master-address retrieving issue in torch training.
Enhance node heartbeat mechanism under some corner cases.
Fixed unexpected failover failure due to resource quota issue.
Fixed unexpected process leak when using Ascend NPU.(workaround)
Refactor 'job_context' to control all the key state read/write operation.
Fixed known issue related to master fault tolerance(internal feature).
Enhancement for node-check procedure.
UT performance improved.
Other tiny fixes and enhancements.

Others

Code base adjustment.
UT performance improved.

Assets 2

29 Sep 02:02

BalaBalaYi

v0.3.8

bdc5ed2

Release 0.3.8

Features:

Added the basic implementation of the first version of positive diagnostics.
Supported 'fast-fail' strategy for training job in some boundary scenarios. e.g. pending case
Accelerate(sync -> async) pod creation.
Added the basic implementation of structured event logging.

BugFix:

Fixed unexpected rendezvous failure in occasional fault-tolerant scenarios.
Fixed unexpected socket client creation before socket socket creation.
Optimized 'network-check' implementation for 'Ascend NPU'.
Optimized some implementations for master-fault-tolerance(internal) scenario.
Other numerous known issues fixed and optimized.

Assets 2

13 May 06:02

workingloong

v0.3.7

2160cdc

Release 0.3.7

Features:

Flash Checkpoint suppors deleting old checkpoints.

BugFix:

Save/load the non-params-related variables of dist optimizer in Megatron-LM models.
The agent waits for async saving checkpoint finishes before exiting.

Assets 2

24 Apr 06:18

workingloong

v0.3.6

c0134ec

Release 0.3.6

Features:

Flash checkpoint provides FlashCkptTrainer to support HuggingFace transforemers.Trainer.
Flash checkpoint supports loading the checkpint of Megatron-LM from the memory.
Flash Checkpoint supports saving and loading FSDP checkpoint with full state dict.
Job master can sort the node ranks by the access switches of the node.

BugFix:

Fix the segment fault when restarting the training process.

Assets 2

29 Mar 07:02

workingloong

v0.3.5

af5fdbc

Release 0.3.5

Features:

Flash checkpoint supports saving and loading Megatron-LM MOE models. #1042
APIs to extend the module to check the node with different chips. #1023
Automatically mark the node as unschedulable if the node fails. #1025

BugFix:

Fix the DDP example of mnist to save and load checkpoint. #1051
Fix the checkpoint name of DDP. #1034

Assets 2

21 Feb 07:10

workingloong

v0.3.4

185d871

Release 0.3.4

Features:

Flash checkpoint enables saving and loading Megatron-LM models from multiple ranks in parallel.
dlrover-run --auto-config Automatically configure the number of nodes and the number of processes per node.
Users can customize the APIs of storage to save the checkpoint into different file systems.
Deletion strategy to clean the old checkpoint files.

BugFix:

The shared memory does not exist if the size of the checkpoint changes.

Assets 2

25 Jan 02:28

workingloong

v0.3.3

654240d

Release 0.3.3

Features:

Support Python > 3.10.
Support restarting the training process on Ascend NPU.
Support asynchronously saving the checkpoint of the distributed optimizer of Megatron-LM to the storage.

BugFix:

Fix the checkpoint shard inconsistency of all ranks.
Fix the bug to asynchronously save the Megatron-LM checkpoint of the job with multi-GPUs on multi-nodes.
Fix the bug to load the Megatron-LM checkpoint.

Assets 2

10 Jan 01:54

workingloong

v0.3.1

222edf7

Release 0.3.1

Feature:

Users can use flash checkpoint using torchrun or python -m torch.distributed.launch.

Bugfix:

The dlrover master cannot print the error message of the fault node in a kubeflow/PytorchJob.

Assets 2

03 Jan 06:54

workingloong

v0.3.0

ce88437

Release 0.3.0

Features:

Flash Checkpoint to asynchronously persist checkpoint to storage.
Flash Checkpoint recovers failure in memory.
Flash Checkpoint supports DDP/FSDP/DeepSpeed/Megatron
Node detection supports NPU.

Examples

The example of training nanoGPT using DeepSpeed.
The example to save/load sharding FSDP checkpoint.

Assets 2

21 Nov 06:41

workingloong

v0.2.2

8736094

Release 0.2.2

ElasticJob

Features:

dlrover-run can run on any distributed jobs with the NODE_RANK and DLROVER_MASTER_ADDR in the environment.
DLRover can asynchronously save the checkpoint into the storage which only block the training with a few time.

BugFix:

Fix the bug to load the FSDP checkpoint.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features:

BugFix and Enhancement:

Others

Features:

BugFix:

Features:

BugFix:

Features:

BugFix:

Features:

BugFix:

Features:

Features:

BugFix:

Feature:

Bugfix:

Features:

Examples

ElasticJob

Features:

BugFix:

Releases: intelligent-machine-learning/dlrover

Release 0.4.0

Features:

BugFix and Enhancement:

Others

Release 0.3.8

Features:

BugFix:

Release 0.3.7

Features:

BugFix:

Release 0.3.6

Features:

BugFix:

Release 0.3.5

Features:

BugFix:

Release 0.3.4

Features:

Release 0.3.3

Features:

BugFix:

Release 0.3.1

Feature:

Bugfix:

Release 0.3.0

Features:

Examples

Release 0.2.2

ElasticJob

Features:

BugFix: