enable Context Parallel #592

XilunWu · 2024-09-30T19:17:00Z

Stack from ghstack (oldest at bottom):

-> enable Context Parallel #592

Summary
This PR adds DTensor Context Parallel (pytorch/pytorch#131351) option to torchtitan.

Instructions
Find the context_parallel_degree option in the experimental section in .toml file, assign the desired Context Parallel degree (i.e. the # of GPUs along the Context Parallel dimension).

Benchmark Settings
The benchmark uses train_configs/llama3_8b.toml with extra changes as below:

norm_type = "fused_rmsnorm"
mode = 'full' (using full activation checkpoint allows us to find the longest possible input sequence length on a device)

Benchmark Results

max seq_len scalability
The first set of benchmark shows that with CP, adding more GPUs on the CP dimension allows longer input seq_len and this growth is linearly scalable. We use FSDP to shard the model and fix data_parallel_shard_degree to be 8, and we use 8, 16, 32, 64 H100 GPUs (i.e. CP degree=1,2,4,8) to measure the longest possible input seq_len w/o GPU OOM. We adjust the CP degree by modifying the context_parallel_degree in the experimental section in .toml file.

As Context Parallel degrees goes up (1, 2, 4, 8), the max seq_len also increases (32k, 80k, 144k, 300k)

While increasing Context Parallel degree increases the max seq_len, it also decreases the WPS on each device. This is the price paid for longer input

However, increasing CP degree doesn't significantly affect MFU.

loss curve convergence
The second set of benchmark proves that our implementation is correct by showing the loss curve converges just as FSDP + TP. This experiment uses 64 H100 GPUs and fix TP degree and local batch size to 8 (tensor_parallel_degree=8, batch_size=8). As we increase CP degree, we decrease DP degree to make sure that CP * DP = 8. Besides, we also make sure seq_len increases such that seq_len = 8192 * CP.

The training loss curve matches among the 4 combination of parallelisms: DP=8; DP=4, CP=2; DP=2, CP=4; CP=8 (TP is fixed to 8).

The only observed difference is among the warm-up steps. This is due to our pick of the number of steps and picking an appropriate number of warm-up steps should smooth out this.

max seq_len vs. local WPS on a fixed number of GPUs
The third set of benchmark gives users a guide on what CP degree to choose and what WPS to expect on a specific set of devices (e.g. number of devices, device types, etc). Here we use H100 as an example but it is expected to observer similar trade-off curves on other hardware.

Increasing CP degree allows longer input sequence length at the cost of reducing WPS on each device. The exception is pure CP -- the seq_len increase is very limited compared to DP=2, CP=4 on 8 GPUs. This is due to our implementation, and we're working on enabling performant CP to support ultra-long input sequence.

TODO:

An option for CP tensor rotate implementation between all-to-all and all-gather will be added once [CP] Implement AllGather based context parallelism pytorch#132820 is landed.
Investigate the limitation in pure CP, to enable ultra-long input sequence.

[ghstack-poisoned]

ghstack-source-id: b76e0d183826dad8c4c76426fe62abaf9ad43f2f Pull Request resolved: #592

[ghstack-poisoned]

ghstack-source-id: 90f1bde378561c9bd1dee3ac82990f9d91ba59ab Pull Request resolved: #592

fegin · 2024-10-01T17:53:40Z

torchtitan/models/llama/model.py

-            # (use 2x max sequence length to be safe)
-            self.model_args.max_seq_len * 2,
+            # Note: removed the 2x relaxing in CP enablement
+            self.model_args.max_seq_len,


cc., @tianyu-l Want to understand is this okay?

For a general use case, we can also expand the CP to support stride-like feature.

Could you please elaborate a bit on why this change was needed by CP?

@tianyu-l CP parallelize on the sequence dimension, anything related to the sequence dimension needs to be shard. So freqs_cis is the positional embedding and is required to be sharded according to the sequence length. So it is easier to support CP if everything has the same sequence length.

Sounds reasonable to me. @awgu to confirm this is OK.

Also we need to add a note in docs/composability.md to clarify why this (model change) is needed. It can be addressed in a separate PR; in that case please create issue / leave TODO.

torchtitan/parallelisms/parallelize_llama.py

[ghstack-poisoned]

ghstack-source-id: f9bc24ff92c0abce98dc3a0f847fc874fa77788c Pull Request resolved: #592

[ghstack-poisoned]

ghstack-source-id: 51288d0a142c839291d6035e6dddcc915e5e5a08 Pull Request resolved: #592

[ghstack-poisoned]

ghstack-source-id: 6126585b13e49131e8b2d9e05a5ef1f736a0c4d9 Pull Request resolved: #592

[ghstack-poisoned]

ghstack-source-id: 9107fbfa09b4cf858ae4943ce9cb8180b28e5ea8 Pull Request resolved: #592

[ghstack-poisoned]

ghstack-source-id: a0832f24bf6cfacb5e74dcdc3bca3fb58caca4aa Pull Request resolved: #592

**Summary** (WIP) This PR adds DTensor Context Parallel (pytorch/pytorch#131351) option to torchtitan. TODO: 1. Add seq_len scalability, loss convergence, and WPS performance benchmark. 2. An option for CP tensor rotate implementation between all-to-all and all-gather will be added once pytorch/pytorch#132820 is landed. [ghstack-poisoned]

tianyu-l

Looks awesome! Still have a few questions, but I think it's mostly ready.

torchtitan/utils.py

tianyu-l · 2024-10-22T21:08:42Z

test_runner.py

+            "FSDP+TP+CP",
+            "fsdp+tp+cp",
+            ngpu=8,
+        ),


Question: Looks like FSDP/HSDP + TP + CP is working. How about PP? We can also mention progress in the .md doc later.

Right, the next step is to test 4D/5D (w/ PP and HSDP)

torchtitan/parallelisms/parallelize_llama.py

tianyu-l · 2024-10-22T21:17:19Z

torchtitan/parallelisms/parallelize_llama.py

+            else ("dp",)
+        )
+        dp_mesh = (
+            world_mesh[(*dp_mesh_dim_names, "cp")]._flatten("dp_cp")


_flatten seems called twice, another one in parallel_dims.py. I wonder how this API works?

tianyu-l · 2024-10-22T21:18:43Z

torchtitan/parallelisms/parallelize_llama.py

+            else ("dp",)
+        )
+        dp_mesh = (
+            world_mesh[(*dp_mesh_dim_names, "cp")]._flatten("dp_cp")


another question is: given the current implementation, does it mean DP and CP have to be adjacent to each other in the device mesh? E.g. it seems we can't do [TP, CP, PP, DP] (from inner to outer) as in Llama 3.1 paper. Is that correct?

XilunWu · 2024-10-22T21:22:17Z

@awgu thanks for catching it! I forgot to paste the 5000-step loss curve screenshot but a zoom-in one for warm-up steps. Updated.

tianyu-l · 2024-10-22T21:27:53Z

While increasing Context Parallel degree increases the max seq_len, it also decreases the WPS on each device. This is the price paid for longer input. However, increasing CP degree doesn't significantly affect MFU.

By looking into how token/sec and MFU change when increasing seq_len, here's my explanation to this phenomenon.

There are two parts of flops computation
https://github.com/pytorch/torchtitan/blob/main/torchtitan/utils.py#L150

(I) matmul scales linearly with number of tokens
(II) attention scales quadratically with number of tokens

When doubling seq_len together with CP degree (also doubling number of GPUs since other degrees are fixed)

(I) matmul computation doubles; (II) attention computation quadruples.
Even CP doesn’t bring any overhead (e.g. extra comm exposure), we shoudn’t expect the same WPS per GPU because computation resource is doubled, but computation load is between 2x – 4x. The throughput per GPU should be between 0.5x and 1x, assuming perfect scaling.
The MFU change depends on the actual config, more specifically between the ratio of (I) and (II). If (I) dominates, the MFU/WPS ratio wouldn’t change as we increase seq_len; if (II) dominates, the MFU/WPS ratio would increase linearly with seq_len. For 8B and seq_len around 32k/64k, (II) is larger than (I) so that’s why we are seeing WPS drop, but MFU stays (because the ratio increases).

My conclusion is that the perf scaling we observed makes sense and look good to me.

**Summary** This PR adds DTensor Context Parallel (pytorch/pytorch#131351) option to torchtitan. **Instructions** Find the `context_parallel_degree` option in the `experimental` section in `.toml` file, assign the desired Context Parallel degree (i.e. the # of GPUs along the Context Parallel dimension). **Benchmark Settings** The benchmark uses `train_configs/llama3_8b.toml` with extra changes as below: - norm_type = "fused_rmsnorm" - mode = 'full' (using full activation checkpoint allows us to find the longest possible input sequence length on a device) **Benchmark Results** 1. *max seq_len scalability* The first set of benchmark shows that with CP, adding more GPUs on the CP dimension allows longer input seq_len and this growth is linearly scalable. We use FSDP to shard the model and fix `data_parallel_shard_degree` to be 8, and we use 8, 16, 32, 64 H100 GPUs (i.e. CP degree=1,2,4,8) to measure the longest possible input seq_len w/o GPU OOM. We adjust the CP degree by modifying the `context_parallel_degree` in the `experimental` section in `.toml` file. <img width="600" alt="image" src="https://github.com/user-attachments/assets/03f68062-3368-4ea5-a324-9d2346806af3"> *As Context Parallel degrees goes up (1, 2, 4, 8), the max seq_len also increases (32k, 80k, 144k, 300k)* <img width="597" alt="image" src="https://github.com/user-attachments/assets/b48078d3-ab4d-40dc-97cf-2d477b713e00"> *While increasing Context Parallel degree increases the max seq_len, it also decreases the WPS on each device. This is the price paid for longer input* <img width="593" alt="image" src="https://github.com/user-attachments/assets/edef69ef-f0f0-41ec-9114-f7c420929064"> *However, increasing CP degree doesn't significantly affect MFU.* 2. *loss curve convergence* The second set of benchmark proves that our implementation is correct by showing the loss curve converges just as FSDP + TP. This experiment uses 64 H100 GPUs and fix TP degree and local batch size to 8 (`tensor_parallel_degree=8`, `batch_size=8`). As we increase CP degree, we decrease DP degree to make sure that `CP * DP = 8`. Besides, we also make sure `seq_len` increases such that `seq_len = 8192 * CP`. <img width="1061" alt="image" src="https://github.com/user-attachments/assets/8a25e0b1-cbd4-46d5-a444-8dd6f5277227"> *The training loss curve matches among the 4 combination of parallelisms: DP=8; DP=4, CP=2; DP=2, CP=4; CP=8 (TP is fixed to 8).* <img width="2292" alt="Screenshot 2024-10-07 at 12 42 37 PM" src="https://github.com/user-attachments/assets/74a62cc8-562b-4441-babf-d601ec9bf375"> *The only observed difference is among the warm-up steps. This is due to our pick of the number of steps and picking an appropriate number of warm-up steps should smooth out this.* 3. *max seq_len vs. local WPS on a fixed number of GPUs* The third set of benchmark gives users a guide on what CP degree to choose and what WPS to expect on a specific set of devices (e.g. number of devices, device types, etc). Here we use H100 as an example but it is expected to observer similar trade-off curves on other hardware. <img width="830" alt="image" src="https://github.com/user-attachments/assets/b1ea2320-e0ae-439f-9434-4194bd4932a0"> *Increasing CP degree allows longer input sequence length at the cost of reducing WPS on each device. The exception is pure CP -- the seq_len increase is very limited compared to DP=2, CP=4 on 8 GPUs. This is due to our implementation, and we're working on enabling performant CP to support ultra-long input sequence.* TODO: 1. An option for CP tensor rotate implementation between all-to-all and all-gather will be added once pytorch/pytorch#132820 is landed. 2. Investigate the limitation in pure CP, to enable ultra-long input sequence. [ghstack-poisoned]

ghstack-source-id: 5a8900c61f5b735aee005aefe95adda0cd678144 Pull Request resolved: #592

**Summary** This PR adds DTensor Context Parallel (pytorch/pytorch#131351) option to torchtitan. **Instructions** Find the `context_parallel_degree` option in the `experimental` section in `.toml` file, assign the desired Context Parallel degree (i.e. the # of GPUs along the Context Parallel dimension). **Benchmark Settings** The benchmark uses `train_configs/llama3_8b.toml` with extra changes as below: - norm_type = "fused_rmsnorm" - mode = 'full' (using full activation checkpoint allows us to find the longest possible input sequence length on a device) **Benchmark Results** 1. *max seq_len scalability* The first set of benchmark shows that with CP, adding more GPUs on the CP dimension allows longer input seq_len and this growth is linearly scalable. We use FSDP to shard the model and fix `data_parallel_shard_degree` to be 8, and we use 8, 16, 32, 64 H100 GPUs (i.e. CP degree=1,2,4,8) to measure the longest possible input seq_len w/o GPU OOM. We adjust the CP degree by modifying the `context_parallel_degree` in the `experimental` section in `.toml` file. <img width="600" alt="image" src="https://github.com/user-attachments/assets/03f68062-3368-4ea5-a324-9d2346806af3"> *As Context Parallel degrees goes up (1, 2, 4, 8), the max seq_len also increases (32k, 80k, 144k, 300k)* <img width="597" alt="image" src="https://github.com/user-attachments/assets/b48078d3-ab4d-40dc-97cf-2d477b713e00"> *While increasing Context Parallel degree increases the max seq_len, it also decreases the WPS on each device. This is the price paid for longer input* <img width="593" alt="image" src="https://github.com/user-attachments/assets/edef69ef-f0f0-41ec-9114-f7c420929064"> *However, increasing CP degree doesn't significantly affect MFU.* 2. *loss curve convergence* The second set of benchmark proves that our implementation is correct by showing the loss curve converges just as FSDP + TP. This experiment uses 64 H100 GPUs and fix TP degree and local batch size to 8 (`tensor_parallel_degree=8`, `batch_size=8`). As we increase CP degree, we decrease DP degree to make sure that `CP * DP = 8`. Besides, we also make sure `seq_len` increases such that `seq_len = 8192 * CP`. <img width="1061" alt="image" src="https://github.com/user-attachments/assets/8a25e0b1-cbd4-46d5-a444-8dd6f5277227"> *The training loss curve matches among the 4 combination of parallelisms: DP=8; DP=4, CP=2; DP=2, CP=4; CP=8 (TP is fixed to 8).* <img width="2292" alt="Screenshot 2024-10-07 at 12 42 37 PM" src="https://github.com/user-attachments/assets/74a62cc8-562b-4441-babf-d601ec9bf375"> *The only observed difference is among the warm-up steps. This is due to our pick of the number of steps and picking an appropriate number of warm-up steps should smooth out this.* 3. *max seq_len vs. local WPS on a fixed number of GPUs* The third set of benchmark gives users a guide on what CP degree to choose and what WPS to expect on a specific set of devices (e.g. number of devices, device types, etc). Here we use H100 as an example but it is expected to observer similar trade-off curves on other hardware. <img width="830" alt="image" src="https://github.com/user-attachments/assets/b1ea2320-e0ae-439f-9434-4194bd4932a0"> *Increasing CP degree allows longer input sequence length at the cost of reducing WPS on each device. The exception is pure CP -- the seq_len increase is very limited compared to DP=2, CP=4 on 8 GPUs. This is due to our implementation, and we're working on enabling performant CP to support ultra-long input sequence.* TODO: 1. An option for CP tensor rotate implementation between all-to-all and all-gather will be added once pytorch/pytorch#132820 is landed. 2. Investigate the limitation in pure CP, to enable ultra-long input sequence. [ghstack-poisoned]

ghstack-source-id: a0ad832fe452f1cc35c37139f498a82c4bbeeae0 Pull Request resolved: #592

tianyu-l

LGTM! Thanks!

XilunWu · 2024-10-22T23:06:59Z

@tianyu-l actually, the 8 GPU test has a failure. This is the HSDP + CP case. I think the flatten logic has some bug when handling mesh["dp_cp"].

Flattening mesh["dp_replicate", "dp_shard", "cp"] into mesh["dp_cp"] is a workaround w/o actual cost. I think we can land this first with a TODO and put the __get_item__ call back once it's fixed in DeviceMesh.

**Summary** This PR adds DTensor Context Parallel (pytorch/pytorch#131351) option to torchtitan. **Instructions** Find the `context_parallel_degree` option in the `experimental` section in `.toml` file, assign the desired Context Parallel degree (i.e. the # of GPUs along the Context Parallel dimension). **Benchmark Settings** The benchmark uses `train_configs/llama3_8b.toml` with extra changes as below: - norm_type = "fused_rmsnorm" - mode = 'full' (using full activation checkpoint allows us to find the longest possible input sequence length on a device) **Benchmark Results** 1. *max seq_len scalability* The first set of benchmark shows that with CP, adding more GPUs on the CP dimension allows longer input seq_len and this growth is linearly scalable. We use FSDP to shard the model and fix `data_parallel_shard_degree` to be 8, and we use 8, 16, 32, 64 H100 GPUs (i.e. CP degree=1,2,4,8) to measure the longest possible input seq_len w/o GPU OOM. We adjust the CP degree by modifying the `context_parallel_degree` in the `experimental` section in `.toml` file. <img width="600" alt="image" src="https://github.com/user-attachments/assets/03f68062-3368-4ea5-a324-9d2346806af3"> *As Context Parallel degrees goes up (1, 2, 4, 8), the max seq_len also increases (32k, 80k, 144k, 300k)* <img width="597" alt="image" src="https://github.com/user-attachments/assets/b48078d3-ab4d-40dc-97cf-2d477b713e00"> *While increasing Context Parallel degree increases the max seq_len, it also decreases the WPS on each device. This is the price paid for longer input* <img width="593" alt="image" src="https://github.com/user-attachments/assets/edef69ef-f0f0-41ec-9114-f7c420929064"> *However, increasing CP degree doesn't significantly affect MFU.* 2. *loss curve convergence* The second set of benchmark proves that our implementation is correct by showing the loss curve converges just as FSDP + TP. This experiment uses 64 H100 GPUs and fix TP degree and local batch size to 8 (`tensor_parallel_degree=8`, `batch_size=8`). As we increase CP degree, we decrease DP degree to make sure that `CP * DP = 8`. Besides, we also make sure `seq_len` increases such that `seq_len = 8192 * CP`. <img width="1061" alt="image" src="https://github.com/user-attachments/assets/8a25e0b1-cbd4-46d5-a444-8dd6f5277227"> *The training loss curve matches among the 4 combination of parallelisms: DP=8; DP=4, CP=2; DP=2, CP=4; CP=8 (TP is fixed to 8).* <img width="2292" alt="Screenshot 2024-10-07 at 12 42 37 PM" src="https://github.com/user-attachments/assets/74a62cc8-562b-4441-babf-d601ec9bf375"> *The only observed difference is among the warm-up steps. This is due to our pick of the number of steps and picking an appropriate number of warm-up steps should smooth out this.* 3. *max seq_len vs. local WPS on a fixed number of GPUs* The third set of benchmark gives users a guide on what CP degree to choose and what WPS to expect on a specific set of devices (e.g. number of devices, device types, etc). Here we use H100 as an example but it is expected to observer similar trade-off curves on other hardware. <img width="830" alt="image" src="https://github.com/user-attachments/assets/b1ea2320-e0ae-439f-9434-4194bd4932a0"> *Increasing CP degree allows longer input sequence length at the cost of reducing WPS on each device. The exception is pure CP -- the seq_len increase is very limited compared to DP=2, CP=4 on 8 GPUs. This is due to our implementation, and we're working on enabling performant CP to support ultra-long input sequence.* TODO: 1. An option for CP tensor rotate implementation between all-to-all and all-gather will be added once pytorch/pytorch#132820 is landed. 2. Investigate the limitation in pure CP, to enable ultra-long input sequence. [ghstack-poisoned]

ghstack-source-id: 584d30016c4598e9a64595c3c2cb227f88de9b00 Pull Request resolved: #592

… with mesh access" **Summary** pytorch/pytorch#138945 fixes DeviceMesh access on flattened mesh which are constructed from more than 2 meshes. Refer to the fix PR for details if interested. In #592 we avoided this issue by calling `_flatten` instead of direct accessing the flattened mesh. We want to turn back to mesh access which is more straightforward since the fix has been merged in PyTorch. [ghstack-poisoned]

**Summary** pytorch/pytorch#138945 fixes DeviceMesh access on flattened mesh which are constructed from more than 2 meshes. Refer to the fix PR for details if interested. In #592 we avoided this issue by calling `_flatten` instead of direct accessing the flattened mesh. We want to turn back to mesh access which is more straightforward since the fix has been merged in PyTorch. [ghstack-poisoned]

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #666 **Summary** pytorch/pytorch#138945 fixes DeviceMesh access on flattened mesh which are constructed from more than 2 meshes. Refer to the fix PR for details if interested. In #592 we avoided this issue by calling `_flatten` instead of direct accessing the flattened mesh. We want to turn back to mesh access which is more straightforward since the fix has been merged in PyTorch.

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #667 Note: This PR is a reland of #666 where the PR was mistakenly merged into a wrong branch. **Summary** pytorch/pytorch#138945 fixes DeviceMesh access on flattened mesh which are constructed from more than 2 meshes. Refer to the fix PR for details if interested. In #592 we avoided this issue by calling `_flatten` instead of direct accessing the flattened mesh. We want to turn back to mesh access which is more straightforward since the fix has been merged in PyTorch.

) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ pytorch#667 Note: This PR is a reland of pytorch#666 where the PR was mistakenly merged into a wrong branch. **Summary** pytorch/pytorch#138945 fixes DeviceMesh access on flattened mesh which are constructed from more than 2 meshes. Refer to the fix PR for details if interested. In pytorch#592 we avoided this issue by calling `_flatten` instead of direct accessing the flattened mesh. We want to turn back to mesh access which is more straightforward since the fix has been merged in PyTorch.

tmm1 · 2025-01-19T06:53:38Z

torchtitan/utils.py

+                from torch.nn.attention import sdpa_kernel, SDPBackend
+
+                # currently we only support these two SDP backends.
+                # TODO (xilunwu): support cuDNN backend


Just curious if you recall what the blocker for CUDNN_ATTENTION is

Hi @tmm1 it's simply cudnn attention has a different op signature. I'm adding support now and should be able to have the PR draft out by next week.

enable Context Parallel

3bf7333

[ghstack-poisoned]

XilunWu added a commit that referenced this pull request Sep 30, 2024

enable Context Parallel

fa1f83b

ghstack-source-id: b76e0d183826dad8c4c76426fe62abaf9ad43f2f Pull Request resolved: #592

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 30, 2024

XilunWu marked this pull request as draft September 30, 2024 20:09

XilunWu requested a review from fegin September 30, 2024 20:21

Update on "enable Context Parallel"

afb1051

[ghstack-poisoned]

XilunWu added a commit that referenced this pull request Sep 30, 2024

enable Context Parallel

35880ff

ghstack-source-id: 90f1bde378561c9bd1dee3ac82990f9d91ba59ab Pull Request resolved: #592

fegin reviewed Oct 1, 2024

View reviewed changes

torchtitan/parallelisms/parallelize_llama.py Outdated Show resolved Hide resolved

fegin mentioned this pull request Oct 1, 2024

Enable CP #433

Closed

Update on "enable Context Parallel"

f99a6f5

[ghstack-poisoned]

XilunWu added a commit that referenced this pull request Oct 3, 2024

enable Context Parallel

c99adca

ghstack-source-id: f9bc24ff92c0abce98dc3a0f847fc874fa77788c Pull Request resolved: #592

Update on "enable Context Parallel"

4ad6881

[ghstack-poisoned]

XilunWu added a commit that referenced this pull request Oct 3, 2024

enable Context Parallel

551028a

ghstack-source-id: 51288d0a142c839291d6035e6dddcc915e5e5a08 Pull Request resolved: #592

Update on "enable Context Parallel"

038b5ce

[ghstack-poisoned]

XilunWu added a commit that referenced this pull request Oct 4, 2024

enable Context Parallel

85aaf70

ghstack-source-id: 6126585b13e49131e8b2d9e05a5ef1f736a0c4d9 Pull Request resolved: #592

XilunWu added 2 commits October 21, 2024 02:02

Update base for Update on "enable Context Parallel"

4758df2

[ghstack-poisoned]

Update on "enable Context Parallel"

a6758dd

[ghstack-poisoned]

XilunWu mentioned this pull request Oct 21, 2024

CP experiment toml #637

Closed

XilunWu added 2 commits October 21, 2024 02:07

Update base for Update on "enable Context Parallel"

f570fa8

[ghstack-poisoned]

Update on "enable Context Parallel"

c102f73

[ghstack-poisoned]

XilunWu added a commit that referenced this pull request Oct 21, 2024

enable Context Parallel

fbe5e41

ghstack-source-id: 9107fbfa09b4cf858ae4943ce9cb8180b28e5ea8 Pull Request resolved: #592

XilunWu added 2 commits October 21, 2024 02:17

Update base for Update on "enable Context Parallel"

83230fd

[ghstack-poisoned]

Update on "enable Context Parallel"

2863907

[ghstack-poisoned]

XilunWu added a commit that referenced this pull request Oct 21, 2024

enable Context Parallel

28aceb1

ghstack-source-id: a0832f24bf6cfacb5e74dcdc3bca3fb58caca4aa Pull Request resolved: #592

XilunWu marked this pull request as ready for review October 21, 2024 09:18

XilunWu requested review from fegin, tianyu-l and lessw2020 October 21, 2024 09:18

tianyu-l reviewed Oct 22, 2024

View reviewed changes

XilunWu added 2 commits October 22, 2024 14:53

XilunWu added a commit that referenced this pull request Oct 22, 2024

enable Context Parallel

aedd0a7

ghstack-source-id: 5a8900c61f5b735aee005aefe95adda0cd678144 Pull Request resolved: #592

XilunWu changed the base branch from gh/XilunWu/6/base to main October 22, 2024 22:02

XilunWu added 2 commits October 22, 2024 15:42

XilunWu added a commit that referenced this pull request Oct 22, 2024

enable Context Parallel

d5dc98e

ghstack-source-id: a0ad832fe452f1cc35c37139f498a82c4bbeeae0 Pull Request resolved: #592

XilunWu requested a review from tianyu-l October 22, 2024 22:44

tianyu-l approved these changes Oct 22, 2024

View reviewed changes

XilunWu added 2 commits October 22, 2024 16:37

XilunWu added a commit that referenced this pull request Oct 22, 2024

enable Context Parallel

3d32552

ghstack-source-id: 584d30016c4598e9a64595c3c2cb227f88de9b00 Pull Request resolved: #592

XilunWu merged commit b19456a into main Oct 23, 2024
5 checks passed

fegin mentioned this pull request Oct 29, 2024

add context parallelism when ready #22

Closed

XilunWu mentioned this pull request Oct 30, 2024

[BE] replace the extra DeviceMesh _flatten with mesh access #666

Merged

XilunWu mentioned this pull request Oct 31, 2024

[BE] replace the extra DeviceMesh _flatten with mesh access #667

Merged

wukaixingxp mentioned this pull request Nov 14, 2024

Long context meta-llama/llama-cookbook#785

Closed

This was referenced Nov 15, 2024

Any suggestion for Llama-3.1-70b(128k seq len) deploy mesh with torchtian? #678

Closed

Question about integration with DeepSpeed-Ulysses #679

Closed

Fine-Tuning Llama Model with Large Context and Customized Dataset Using Torchtitan #677

Closed

tmm1 reviewed Jan 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable Context Parallel #592

enable Context Parallel #592

XilunWu commented Sep 30, 2024 •

edited

Loading

fegin Oct 1, 2024

tianyu-l Oct 3, 2024

fegin Oct 8, 2024

tianyu-l Oct 21, 2024

tianyu-l left a comment

tianyu-l Oct 22, 2024

XilunWu Oct 22, 2024

tianyu-l Oct 22, 2024

tianyu-l Oct 22, 2024

XilunWu commented Oct 22, 2024

tianyu-l commented Oct 22, 2024

tianyu-l left a comment

XilunWu commented Oct 22, 2024

tmm1 Jan 19, 2025

tianyu-l Jan 22, 2025

XilunWu Jan 22, 2025

enable Context Parallel #592

enable Context Parallel #592

Conversation

XilunWu commented Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XilunWu commented Oct 22, 2024

tianyu-l commented Oct 22, 2024

tianyu-l left a comment

Choose a reason for hiding this comment

XilunWu commented Oct 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XilunWu commented Sep 30, 2024 •

edited

Loading