Renwuli/disable te qwen #26

amdrenwuli · 2024-11-21T03:48:10Z

PR a new branch that supports Qwen1.0 and TE is disabled

Fix for newer apex version See merge request ADLR/megatron-lm!1021

Signed-off-by: Selvaraj Anandaraj <[email protected]>

…istributed optimizer checkpoint

Add assert for overlap_param_gather See merge request ADLR/megatron-lm!1029

Signed-off-by: Xiaowei Ren <[email protected]>

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

added mainfest See merge request ADLR/megatron-lm!1035

Fix checkpointing with TransformerEngine See merge request ADLR/megatron-lm!1038

Moved Dataloader processing to the CPU for pin memory usage/ Added a custom torch.split BProp implementation See merge request ADLR/megatron-lm!911

Save checkpoint whenever batch size ramps up See merge request ADLR/megatron-lm!1034

Signed-off-by: Xiaowei Ren <[email protected]>

Signed-off-by: Selvaraj Anandaraj <[email protected]>

…/selvaraja/megatron-lm into atomic_gemm_switch

Signed-off-by: Selvaraj Anandaraj <[email protected]>

…/selvaraja/megatron-lm into atomic_gemm_switch

Fix `qkv_format` in TEDotProductAttention See merge request ADLR/megatron-lm!1078

Add support for masked WordPiece datasets BERT and T5 See merge request ADLR/megatron-lm!1041

Distributed checkpointing implementation for MoE See merge request ADLR/megatron-lm!1055

# Conflicts: # pretrain_bert.py

Fix the case when none token is allocated for local expert(s) with EP>1. See merge request ADLR/megatron-lm!1063

Generate causal mask for local layer spec See merge request ADLR/megatron-lm!1047

Update minor version See merge request ADLR/megatron-lm!1086

Signed-off-by: Jimmy Zhang <[email protected]>

Adding bert local spec test See merge request ADLR/megatron-lm!1072

Feature/Add E2E metrics logging See merge request ADLR/megatron-lm!1049

JET Migration Updates See merge request ADLR/megatron-lm!1066

use TE checkpointing when FP8 See merge request ADLR/megatron-lm!1080

gurpreet-dhami · 2024-11-26T17:22:00Z

@amdrenwuli : Is this PR targetted to be merged into rocm_dev ? Because I see this PR has a different baseline commit.

github-actions · 2025-01-25T18:21:15Z

Marking as stale. No activity in 60 days.

mikolajblaz and others added 30 commits December 20, 2023 10:59

Fix spec test

8e8bee5

Remove print

959527c

Add mapping for MoE and T5

b990f02

Fix extra_state hook

33496fd

Merge branch 'apex_ln_fix' into 'main'

574d4e1

Fix for newer apex version See merge request ADLR/megatron-lm!1021

Moved offloading library to TE

67c105e

Signed-off-by: Selvaraj Anandaraj <[email protected]>

Moved offloading library to TE

b51729e

Signed-off-by: Selvaraj Anandaraj <[email protected]>

Moved offloading library to TE

24c4d3c

Signed-off-by: Selvaraj Anandaraj <[email protected]>

Truncate or pad in load_parameter_state() to support all DP sizes

cee92b8

Improve logging around tensor truncation and expansion when loading d…

5daccbf

…istributed optimizer checkpoint

add assert for overlap_param_gather

c9388c2

Merge branch 'param_gather_assert' into 'main'

e5c1456

Add assert for overlap_param_gather See merge request ADLR/megatron-lm!1029

Save checkpoint whenever batch size ramps up

8ba329f

added mainfest

547929f

fix replica_id by considering CP

b370ecc

Signed-off-by: Xiaowei Ren <[email protected]>

Fix checkpointing with TransformerEngine

f6ac31e

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

Merge branch 'wheels-fix' into 'main'

1c3da65

added mainfest See merge request ADLR/megatron-lm!1035

Merge branch 'fix_te_checkpointing' into 'main'

c7560f6

Fix checkpointing with TransformerEngine See merge request ADLR/megatron-lm!1038

Merge branch 'main' into xren/fix_replicate_id_with_cp

82d46b3

Merge branch 'main' into 'main'

e240b55

Moved Dataloader processing to the CPU for pin memory usage/ Added a custom torch.split BProp implementation See merge request ADLR/megatron-lm!911

Merge branch megatron-lm:main into atomic_gemm_switch

77af81b

Merge branch 'save_checkpoint_with_bs_rampup' into 'main'

268fd9f

Save checkpoint whenever batch size ramps up See merge request ADLR/megatron-lm!1034

Merge branch 'main' into xren/fix_replicate_id_with_cp

88e1165

check if val is None before split in sequence dimension

9472471

Signed-off-by: Xiaowei Ren <[email protected]>

Merge branch 'main' into fuse_rope_swiglu_main

c99f956

Modified description for knobs

a9d3c23

Signed-off-by: Selvaraj Anandaraj <[email protected]>

Merge branch 'atomic_gemm_switch' of https://gitlab-master.nvidia.com…

bcb4bbf

…/selvaraja/megatron-lm into atomic_gemm_switch

Merge branch megatron-lm:main into atomic_gemm_switch

6c1f4f4

Fixed formatting

add23d4

Signed-off-by: Selvaraj Anandaraj <[email protected]>

Merge branch 'atomic_gemm_switch' of https://gitlab-master.nvidia.com…

5acea00

…/selvaraja/megatron-lm into atomic_gemm_switch

ericharper and others added 28 commits January 29, 2024 20:28

Merge branch 'chcui/fix_rope_fusion_config' into 'main'

4fc07c9

Fix `qkv_format` in TEDotProductAttention See merge request ADLR/megatron-lm!1078

Add support for masked WordPiece datasets BERT and T5

bfecec9

Merge branch 'masked-datasets' into 'main'

d1b3a0c

Add support for masked WordPiece datasets BERT and T5 See merge request ADLR/megatron-lm!1041

Remove config file and hardcoded cache path

51b6e5b

Merge branch 'mblaz/moe-0.5-dist-ckpt' into 'main'

e17d2ae

Distributed checkpointing implementation for MoE See merge request ADLR/megatron-lm!1055

Merge branch 'main' into 'local_spec_bert'

cfa33d4

# Conflicts: # pretrain_bert.py

Fix the case when none token is allocated for local expert(s) with EP>1.

29834bd

Merge branch 'moe_gmm_corner_case_fixw' into 'main'

1bd6206

Fix the case when none token is allocated for local expert(s) with EP>1. See merge request ADLR/megatron-lm!1063

Generate causal mask for local layer spec

9ae3164

Merge branch 'jlasek/generate_causal_mask_in_mcore' into 'main'

0600cca

Generate causal mask for local layer spec See merge request ADLR/megatron-lm!1047

Update minor version

89b43b4

Merge branch 'update_minor_version' into 'main'

115cfd6

Update minor version See merge request ADLR/megatron-lm!1086

use TE checkpointing when FP8

188ee1c

Signed-off-by: Jimmy Zhang <[email protected]>

Merge branch megatron-lm:main into fp8_recompute

9ae837a

Merge branch 'local_spec_bert' into 'main'

537a5ec

Adding bert local spec test See merge request ADLR/megatron-lm!1072

Remove unused hashlib

1675c67

Merge branch 'feature/add-e2e-metrics-logging' into 'main'

e2b2378

Feature/Add E2E metrics logging See merge request ADLR/megatron-lm!1049

JET Migration Updates

8434326

Merge branch 'maanug/jet-recipes' into 'main'

b4933b9

JET Migration Updates See merge request ADLR/megatron-lm!1066

Merge branch 'fp8_recompute' into 'main'

c6e541e

use TE checkpointing when FP8 See merge request ADLR/megatron-lm!1080

Remove code pieces of TE

ccb25ed

Fix bugs

30b411f

add llama2 training and compile support

31b17db

Update tune_and_train_llama2.sh

9dba4cd

add qwen train script with gemm tuning + torch.compile

063f0b7

add QWenTokenizer to data processing

32b588a

update input embedding

3aaa80d

add qwen14b

ae11f06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Renwuli/disable te qwen #26

Renwuli/disable te qwen #26

amdrenwuli commented Nov 21, 2024

gurpreet-dhami commented Nov 26, 2024

github-actions bot commented Jan 25, 2025

Renwuli/disable te qwen #26

Are you sure you want to change the base?

Renwuli/disable te qwen #26

Conversation

amdrenwuli commented Nov 21, 2024

gurpreet-dhami commented Nov 26, 2024

github-actions bot commented Jan 25, 2025