[cp] apply fsdp to model when CP is enabled without DP for correct loss and lower mem usage #685

XilunWu · 2024-11-20T01:00:24Z

Stack from ghstack (oldest at bottom):

Summary
Previously CP forgot to shard the model via apply_fsdp when DP is not combined with CP. This leads to high peak memory usage and diverging loss.

Test

modify train_configs/llama3_8b.toml

steps = 20
context_parallel_degree = 8

run training on 8xH100 GPUs
CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=8 LOG_RANK=0,1,2,3,4,5,6,7 ./run_llama_train.sh
Before: CUDA OutOfMemory
After: successful 20-steps training

…m usage [ghstack-poisoned]

…ss and lower mem usage [ghstack-poisoned]

…with DP" [ghstack-poisoned]

… correct loss and lower mem usage" [ghstack-poisoned]

torchtitan/parallelisms/parallelize_llama.py

torchtitan/parallelisms/parallel_dims.py

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #684 * #685 * __->__ #683

…d without DP for correct loss and lower mem usage" **Summary** Previously CP forgot to shard the model via `apply_fsdp` when DP is not combined with CP. This leads to high peak memory usage and diverging loss. **Test** 1. modify `train_configs/llama3_8b.toml` ``` steps = 20 context_parallel_degree = 8 ``` 2. run training on 8xH100 GPUs `CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=8 LOG_RANK=0,1,2,3,4,5,6,7 ./run_llama_train.sh` Before: CUDA OutOfMemory After: successful 20-steps training [ghstack-poisoned]

… correct loss and lower mem usage" **Summary** Previously CP forgot to shard the model via `apply_fsdp` when DP is not combined with CP. This leads to high peak memory usage and diverging loss. **Test** 1. modify `train_configs/llama3_8b.toml` ``` steps = 20 context_parallel_degree = 8 ``` 2. run training on 8xH100 GPUs `CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=8 LOG_RANK=0,1,2,3,4,5,6,7 ./run_llama_train.sh` Before: CUDA OutOfMemory After: successful 20-steps training [ghstack-poisoned]

…d without DP for correct loss and lower mem usage" **Summary** Previously CP forgot to shard the model via `apply_fsdp` when DP is not combined with CP. This leads to high peak memory usage and diverging loss. **Test** 1. modify `train_configs/llama3_8b.toml` ``` steps = 20 context_parallel_degree = 8 ``` 2. run training on 8xH100 GPUs `CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=8 LOG_RANK=0,1,2,3,4,5,6,7 ./run_llama_train.sh` Before: CUDA OutOfMemory After: successful 20-steps training [ghstack-poisoned]

… correct loss and lower mem usage" **Summary** Previously CP forgot to shard the model via `apply_fsdp` when DP is not combined with CP. This leads to high peak memory usage and diverging loss. **Test** 1. modify `train_configs/llama3_8b.toml` ``` steps = 20 context_parallel_degree = 8 ``` 2. run training on 8xH100 GPUs `CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=8 LOG_RANK=0,1,2,3,4,5,6,7 ./run_llama_train.sh` Before: CUDA OutOfMemory After: successful 20-steps training [ghstack-poisoned]

…d without DP for correct loss and lower mem usage" **Summary** Previously CP forgot to shard the model via `apply_fsdp` when DP is not combined with CP. This leads to high peak memory usage and diverging loss. **Test** 1. modify `train_configs/llama3_8b.toml` ``` steps = 20 context_parallel_degree = 8 ``` 2. run training on 8xH100 GPUs `CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=8 LOG_RANK=0,1,2,3,4,5,6,7 ./run_llama_train.sh` Before: CUDA OutOfMemory After: successful 20-steps training [ghstack-poisoned]

… correct loss and lower mem usage" **Summary** Previously CP forgot to shard the model via `apply_fsdp` when DP is not combined with CP. This leads to high peak memory usage and diverging loss. **Test** 1. modify `train_configs/llama3_8b.toml` ``` steps = 20 context_parallel_degree = 8 ``` 2. run training on 8xH100 GPUs `CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=8 LOG_RANK=0,1,2,3,4,5,6,7 ./run_llama_train.sh` Before: CUDA OutOfMemory After: successful 20-steps training [ghstack-poisoned]

…d without DP for correct loss and lower mem usage" **Summary** Previously CP forgot to shard the model via `apply_fsdp` when DP is not combined with CP. This leads to high peak memory usage and diverging loss. **Test** 1. modify `train_configs/llama3_8b.toml` ``` steps = 20 context_parallel_degree = 8 ``` 2. run training on 8xH100 GPUs `CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=8 LOG_RANK=0,1,2,3,4,5,6,7 ./run_llama_train.sh` Before: CUDA OutOfMemory After: successful 20-steps training [ghstack-poisoned]

… correct loss and lower mem usage" **Summary** Previously CP forgot to shard the model via `apply_fsdp` when DP is not combined with CP. This leads to high peak memory usage and diverging loss. **Test** 1. modify `train_configs/llama3_8b.toml` ``` steps = 20 context_parallel_degree = 8 ``` 2. run training on 8xH100 GPUs `CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=8 LOG_RANK=0,1,2,3,4,5,6,7 ./run_llama_train.sh` Before: CUDA OutOfMemory After: successful 20-steps training [ghstack-poisoned]

tianyu-l

I wonder if we should take a more systematic approach for creating & naming of sub-meshes in parallel_dims.py:
There are two layers sub-meshes: atomic ones and derived ones

The atomic ones include those at minimal granularity: dp_shard, dp_replicate, tp, pp, cp.
The derived ones include: dp (for data loading), dpshard_cp (for FSDP param sharding).

We can create / skip the atomic ones first, and then always create the derived ones. E.g. if dp_shard enabled, dp_replicate not enabled, cp not enabled, then we still create dp and dpshard_cp by flattening dp_shard alone.

This way the code would be simpler and more readable. The creation in parallel_dims would also be less confusing.

torchtitan/parallelisms/utils.py

fegin · 2024-12-04T07:47:19Z

After looking into the table, I feel we should use dp_shard_cp for fully_shard purpose and leave dp for the data loader. This will be more consistent.

I also think @tianyu-l comment is a good approach. But I'm not sure if this will become more complicated when MoE is involved.

tianyu-l · 2024-12-04T23:25:12Z

But I'm not sure if this will become more complicated when MoE is involved.

@fegin
I feel the idea could carry over to any new parallelisms -- always initialize the lowest level dimensions, and then gradually build up, with carefully chosen names though.

…d without DP for correct loss and lower mem usage" **Summary** Previously CP forgot to shard the model via `apply_fsdp` when DP is not combined with CP. This leads to high peak memory usage and diverging loss. **Test** 1. modify `train_configs/llama3_8b.toml` ``` steps = 20 context_parallel_degree = 8 ``` 2. run training on 8xH100 GPUs `CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=8 LOG_RANK=0,1,2,3,4,5,6,7 ./run_llama_train.sh` Before: CUDA OutOfMemory After: successful 20-steps training [ghstack-poisoned]

… correct loss and lower mem usage" **Summary** Previously CP forgot to shard the model via `apply_fsdp` when DP is not combined with CP. This leads to high peak memory usage and diverging loss. **Test** 1. modify `train_configs/llama3_8b.toml` ``` steps = 20 context_parallel_degree = 8 ``` 2. run training on 8xH100 GPUs `CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=8 LOG_RANK=0,1,2,3,4,5,6,7 ./run_llama_train.sh` Before: CUDA OutOfMemory After: successful 20-steps training [ghstack-poisoned]

…d without DP for correct loss and lower mem usage" **Summary** Previously CP forgot to shard the model via `apply_fsdp` when DP is not combined with CP. This leads to high peak memory usage and diverging loss. **Test** 1. modify `train_configs/llama3_8b.toml` ``` steps = 20 context_parallel_degree = 8 ``` 2. run training on 8xH100 GPUs `CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=8 LOG_RANK=0,1,2,3,4,5,6,7 ./run_llama_train.sh` Before: CUDA OutOfMemory After: successful 20-steps training [ghstack-poisoned]

… correct loss and lower mem usage" **Summary** Previously CP forgot to shard the model via `apply_fsdp` when DP is not combined with CP. This leads to high peak memory usage and diverging loss. **Test** 1. modify `train_configs/llama3_8b.toml` ``` steps = 20 context_parallel_degree = 8 ``` 2. run training on 8xH100 GPUs `CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=8 LOG_RANK=0,1,2,3,4,5,6,7 ./run_llama_train.sh` Before: CUDA OutOfMemory After: successful 20-steps training [ghstack-poisoned]

…d without DP for correct loss and lower mem usage" **Summary** Previously CP forgot to shard the model via `apply_fsdp` when DP is not combined with CP. This leads to high peak memory usage and diverging loss. **Test** 1. modify `train_configs/llama3_8b.toml` ``` steps = 20 context_parallel_degree = 8 ``` 2. run training on 8xH100 GPUs `CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=8 LOG_RANK=0,1,2,3,4,5,6,7 ./run_llama_train.sh` Before: CUDA OutOfMemory After: successful 20-steps training [ghstack-poisoned]

… correct loss and lower mem usage" **Summary** Previously CP forgot to shard the model via `apply_fsdp` when DP is not combined with CP. This leads to high peak memory usage and diverging loss. **Test** 1. modify `train_configs/llama3_8b.toml` ``` steps = 20 context_parallel_degree = 8 ``` 2. run training on 8xH100 GPUs `CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=8 LOG_RANK=0,1,2,3,4,5,6,7 ./run_llama_train.sh` Before: CUDA OutOfMemory After: successful 20-steps training [ghstack-poisoned]

tianyu-l

lgtm!

…d without DP for correct loss and lower mem usage" **Summary** Previously CP forgot to shard the model via `apply_fsdp` when DP is not combined with CP. This leads to high peak memory usage and diverging loss. **Test** 1. modify `train_configs/llama3_8b.toml` ``` steps = 20 context_parallel_degree = 8 ``` 2. run training on 8xH100 GPUs `CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=8 LOG_RANK=0,1,2,3,4,5,6,7 ./run_llama_train.sh` Before: CUDA OutOfMemory After: successful 20-steps training [ghstack-poisoned]

… correct loss and lower mem usage" **Summary** Previously CP forgot to shard the model via `apply_fsdp` when DP is not combined with CP. This leads to high peak memory usage and diverging loss. **Test** 1. modify `train_configs/llama3_8b.toml` ``` steps = 20 context_parallel_degree = 8 ``` 2. run training on 8xH100 GPUs `CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=8 LOG_RANK=0,1,2,3,4,5,6,7 ./run_llama_train.sh` Before: CUDA OutOfMemory After: successful 20-steps training [ghstack-poisoned]

…ent (#720) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #684 * #685 * __->__ #720 **Summary** This PR improves the design of DeviceMesh hierarchy in torchtitan. Now, we define all device meshes except `world_mesh` into 2 categories: 1. Basic mesh: those meshes defined in job `.toml` file by users. This include `pp` (`pipeline_parallel_degree`), `dp_replicate` (`data_parallel_replicate_degree`), `dp_shard` (`data_parallel_shard_degree`), `tp` (`tensor_parallel_degree`), and `cp`(`context_parallel_degree`). 2. Synthesized mesh (or called "derived mesh"): meshes that are synthesized from basic mesh by `_flatten()`. If the mesh in synthesized from a single mesh, then it is equivalent to aliasing. So far we utilize 2 synthesized meshes: `dp` and `dp_shard_cp`. The `dp` mesh is used for data loading and the `dp_shard_cp` mesh is used for model params sharding. **Test** CI

…d without DP for correct loss and lower mem usage" **Summary** Previously CP forgot to shard the model via `apply_fsdp` when DP is not combined with CP. This leads to high peak memory usage and diverging loss. **Test** 1. modify `train_configs/llama3_8b.toml` ``` steps = 20 context_parallel_degree = 8 ``` 2. run training on 8xH100 GPUs `CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=8 LOG_RANK=0,1,2,3,4,5,6,7 ./run_llama_train.sh` Before: CUDA OutOfMemory After: successful 20-steps training [ghstack-poisoned]

… correct loss and lower mem usage" **Summary** Previously CP forgot to shard the model via `apply_fsdp` when DP is not combined with CP. This leads to high peak memory usage and diverging loss. **Test** 1. modify `train_configs/llama3_8b.toml` ``` steps = 20 context_parallel_degree = 8 ``` 2. run training on 8xH100 GPUs `CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=8 LOG_RANK=0,1,2,3,4,5,6,7 ./run_llama_train.sh` Before: CUDA OutOfMemory After: successful 20-steps training [ghstack-poisoned]

XilunWu added 3 commits November 19, 2024 16:02

[cp] apply fsdp when only CP is present for correct loss and lower me…

ac2c7e1

…m usage [ghstack-poisoned]

[cp] apply fsdp to model when CP is enabled without DP for correct lo…

3f46a60

…ss and lower mem usage [ghstack-poisoned]

Update on "[cp] fix the device mesh access issue when CP is not used …

9736890

…with DP" [ghstack-poisoned]

This was referenced Nov 20, 2024

[cp] fix the device mesh access issue when CP is not used with DP #683

Merged

[cp] add option to choose kv shards rotation method #684

Closed

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 20, 2024

Update on "[cp] apply fsdp to model when CP is enabled without DP for…

7f3fd87

… correct loss and lower mem usage" [ghstack-poisoned]

XilunWu requested review from tianyu-l and fegin November 20, 2024 01:06

XilunWu mentioned this pull request Nov 20, 2024

Any suggestion for Llama-3.1-70b(128k seq len) deploy mesh with torchtian? #678

Closed

tianyu-l reviewed Nov 20, 2024

View reviewed changes

torchtitan/parallelisms/parallelize_llama.py Outdated Show resolved Hide resolved

XilunWu changed the base branch from gh/XilunWu/12/base to main November 20, 2024 01:31

fegin reviewed Nov 20, 2024

View reviewed changes

torchtitan/parallelisms/parallel_dims.py Outdated Show resolved Hide resolved

XilunWu added a commit that referenced this pull request Nov 29, 2024

[cp] fix the device mesh access issue when CP is not used with DP (#683)

3e3909a

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #684 * #685 * __->__ #683

XilunWu added 6 commits December 3, 2024 17:27

XilunWu requested review from tianyu-l and fegin December 4, 2024 01:36

XilunWu added 2 commits December 3, 2024 17:41

tianyu-l reviewed Dec 4, 2024

View reviewed changes

torchtitan/parallelisms/utils.py Outdated Show resolved Hide resolved

torchtitan/parallelisms/utils.py Outdated Show resolved Hide resolved

torchtitan/parallelisms/utils.py Outdated Show resolved Hide resolved

XilunWu added 2 commits December 4, 2024 16:47

XilunWu mentioned this pull request Dec 5, 2024

[DP] change device mesh dim naming convention to make it more consistent #720

Merged

XilunWu added 2 commits December 4, 2024 16:51

XilunWu requested a review from tianyu-l December 5, 2024 01:03

XilunWu added 2 commits December 4, 2024 17:23

tianyu-l approved these changes Dec 5, 2024

View reviewed changes

tianyu-l mentioned this pull request Dec 10, 2024

[PoC][MoE & EP] integrate with FSDP & CP #726

Closed

XilunWu added 2 commits December 11, 2024 14:16

XilunWu added 2 commits December 11, 2024 14:44

XilunWu merged commit 40a0873 into main Dec 11, 2024
2 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cp] apply fsdp to model when CP is enabled without DP for correct loss and lower mem usage #685

[cp] apply fsdp to model when CP is enabled without DP for correct loss and lower mem usage #685

XilunWu commented Nov 20, 2024 •

edited

Loading

tianyu-l left a comment

fegin commented Dec 4, 2024 •

edited

Loading

tianyu-l commented Dec 4, 2024

tianyu-l left a comment

[cp] apply fsdp to model when CP is enabled without DP for correct loss and lower mem usage #685

[cp] apply fsdp to model when CP is enabled without DP for correct loss and lower mem usage #685

Conversation

XilunWu commented Nov 20, 2024 • edited Loading

tianyu-l left a comment

Choose a reason for hiding this comment

fegin commented Dec 4, 2024 • edited Loading

tianyu-l commented Dec 4, 2024

tianyu-l left a comment

Choose a reason for hiding this comment

XilunWu commented Nov 20, 2024 •

edited

Loading

fegin commented Dec 4, 2024 •

edited

Loading