Releases: intelligent-machine-learning/dlrover
Releases · intelligent-machine-learning/dlrover
Release 0.4.0
Features:
- Support pytorch 2.4.x +. Version 2.4.x has been extensively validated in production, while version 2.5 has undergone limited preliminary validation for usability.
- Support python 3.10.
- Support XPU_TIMER metric integration and training hang detection(1st edition) with new positive diagnosis implementation.
- Support new fast fail strategy for tensorflow scenario under pending timeout case.
- More key event(for monitoring) supported.
- Refactor resource monitor.
BugFix and Enhancement:
- Fixed the issue where the worker fault tolerance count does not meet expectations when the number of workers is less than the default retry count(3).
- Fixed the issue where step 0 could not be saved.
- Fixed the sporadic issue where concurrent directory deletion could cause an exception.
- Fixed the issue in large-scale training scenarios where reading the master address in rdzv occasionally occurs before writing.
- Fixed some node management known issues.
- Fixed occasional master-address retrieving issue in torch training.
- Enhance node heartbeat mechanism under some corner cases.
- Fixed unexpected failover failure due to resource quota issue.
- Fixed unexpected process leak when using Ascend NPU.(workaround)
- Refactor 'job_context' to control all the key state read/write operation.
- Fixed known issue related to master fault tolerance(internal feature).
- Enhancement for node-check procedure.
- UT performance improved.
- Other tiny fixes and enhancements.
Others
- Code base adjustment.
- UT performance improved.
Release 0.3.8
Features:
- Added the basic implementation of the first version of positive diagnostics.
- Supported 'fast-fail' strategy for training job in some boundary scenarios. e.g. pending case
- Accelerate(sync -> async) pod creation.
- Added the basic implementation of structured event logging.
BugFix:
- Fixed unexpected rendezvous failure in occasional fault-tolerant scenarios.
- Fixed unexpected socket client creation before socket socket creation.
- Optimized 'network-check' implementation for 'Ascend NPU'.
- Optimized some implementations for master-fault-tolerance(internal) scenario.
- Other numerous known issues fixed and optimized.
Release 0.3.7
Features:
- Flash Checkpoint suppors deleting old checkpoints.
BugFix:
- Save/load the non-params-related variables of dist optimizer in Megatron-LM models.
- The agent waits for async saving checkpoint finishes before exiting.
Release 0.3.6
Features:
Flash checkpoint provides FlashCkptTrainer
to support HuggingFace transforemers.Trainer
.
Flash checkpoint supports loading the checkpint of Megatron-LM from the memory.
Flash Checkpoint supports saving and loading FSDP checkpoint with full state dict.
Job master can sort the node ranks by the access switches of the node.
BugFix:
Fix the segment fault when restarting the training process.
Release 0.3.5
Release 0.3.4
Features:
- Flash checkpoint enables saving and loading Megatron-LM models from multiple ranks in parallel.
dlrover-run --auto-config
Automatically configure the number of nodes and the number of processes per node.- Users can customize the APIs of storage to save the checkpoint into different file systems.
- Deletion strategy to clean the old checkpoint files.
BugFix:
- The shared memory does not exist if the size of the checkpoint changes.
Release 0.3.3
Features:
- Support Python > 3.10.
- Support restarting the training process on Ascend NPU.
- Support asynchronously saving the checkpoint of the distributed optimizer of Megatron-LM to the storage.
BugFix:
- Fix the checkpoint shard inconsistency of all ranks.
- Fix the bug to asynchronously save the Megatron-LM checkpoint of the job with multi-GPUs on multi-nodes.
- Fix the bug to load the Megatron-LM checkpoint.
Release 0.3.1
Feature:
- Users can use flash checkpoint using
torchrun
orpython -m torch.distributed.launch
.
Bugfix:
- The dlrover master cannot print the error message of the fault node in a kubeflow/PytorchJob.
Release 0.3.0
Features:
- Flash Checkpoint to asynchronously persist checkpoint to storage.
- Flash Checkpoint recovers failure in memory.
- Flash Checkpoint supports DDP/FSDP/DeepSpeed/Megatron
- Node detection supports NPU.
Examples
- The example of training nanoGPT using DeepSpeed.
- The example to save/load sharding FSDP checkpoint.
Release 0.2.2
ElasticJob
Features:
dlrover-run
can run on any distributed jobs with theNODE_RANK
andDLROVER_MASTER_ADDR
in the environment.- DLRover can asynchronously save the checkpoint into the storage which only block the training with a few time.
BugFix:
- Fix the bug to load the FSDP checkpoint.