v0.25.0
What's New
1. Torch 2.4.1 Compatibility (#3609)
We've added support for torch 2.4.1, including necessary patches to Torch.
Deprecations and breaking changes
1. Microbatch device movement (#3567)
Instead of moving the entire batch to device at once, we now move each microbatch to device. This saves memory for large inputs, e.g. multimodal data, when training with many microbatches.
This change may affect certain callbacks which run operations on the batch which require it to be moved to an accelerator ahead of time, such as the two changed in this PR. There shouldn't be too many of these callbacks, so we anticipate this change will be relatively safe.
2. DeepSpeed deprecation version (#3634)
We have update the Composer version that we will remove support for DeepSpeed to 0.27.0. Please reach out on GitHub if you have any concerns about this.
3. PyTorch legacy sharded checkpoint format
PyTorch briefly used a different sharded checkpoint format than the current one, which was quickly deprecated by PyTorch. We have continued to support loading legacy format checkpoints for a while, but we will likely be removing support for this format entirely in an upcoming release. We initially removed support for saving in this format in #2262, and the original feature was added in #1902. Please reach out if you have concerns or need help converting your checkpoints to the new format.
What's Changed
- Set dev version back to 0.25.0.dev0 by @snarayan21 in #3582
- Microbatch Device Movement by @mvpatel2000 in #3567
- Init Dist Default None by @mvpatel2000 in #3585
- Explicit None Check in get_device by @mvpatel2000 in #3586
- Update protobuf requirement from <5.28 to <5.29 by @dependabot in #3591
- Bump databricks-sdk from 0.30.0 to 0.31.1 by @dependabot in #3592
- Update ci-testing to 0.2.2 by @dakinggg in #3590
- Bump Mellanox Tools by @mvpatel2000 in #3597
- Roll back ci-testing for daillies by @mvpatel2000 in #3598
- Revert driver changes by @mvpatel2000 in #3599
- Remove step in log_image for MLFlow by @mvpatel2000 in #3601
- Reduce system metrics logging frequency by @chenmoneygithub in #3604
- Bump databricks-sdk from 0.31.1 to 0.32.0 by @dependabot in #3608
- torch2.4.1 by @bigning in #3609
- Test with torch2.4.1 image by @bigning in #3610
- fix 2.4.1 test by @bigning in #3612
- Remove tensor option for _global_exception_occured by @irenedea in #3611
- Update error message for overwrite to be more user friendly by @mvpatel2000 in #3619
- Update wandb requirement from <0.18,>=0.13.2 to >=0.13.2,<0.19 by @dependabot in #3615
- Fix RNG key checking by @dakinggg in #3623
- Update datasets requirement from <3,>=2.4 to >=2.4,<4 by @dependabot in #3626
- Disable exceptions for MosaicML Logger by @mvpatel2000 in #3627
- Fix CPU dailies by @mvpatel2000 in #3628
- fix 2.4.1ckpt by @bigning in #3629
- More checkpoint debug logs by @mvpatel2000 in #3632
- Lower DeepSpeed deprecation version by @mvpatel2000 in #3634
- Bump version 25 by @dakinggg in #3633
Full Changelog: v0.24.1...v0.25.0