-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autotp training #6922
Autotp training #6922
Conversation
…-precision version before the rebase, but the grad norm differs (display issue)
@tjruwase @GuanhuaWang Thank you for your review. I’ve added modifications or explanations. Could you take another look? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @inkcherry, thanks for contributing. Just a heads-up, all the all_reduce call in domino is supposed to be asynchronous, and current LinearAllreduce
and LinearLayer
need to be updated to work with Domino.
For example, in the LinearAllreduce
, we'd like to get the handle from asynchronous all reduce, and synchronize it later to overlap computation.
The Domino work is still in progress, and it's not finalized yet. So, you don't need to worry about the compatibility with Domino at this point. But one thing you can easily support is the async TP, similar to Megatron here. Maybe it can be your next PR.
Thanks for your help!
hi, @tjruwase . I noticed the CI updated the DCO check. Using the suggested rebase method would reintroduce many conflicts https://github.com/deepspeedai/DeepSpeed/pull/6922/checks?check_run_id=36694512120, so I opted for a squash merge with sign-off instead. Hope you can take a look, thank you! |
Same as [this PR](#6922). [affeb88](affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]>
Same as [this PR](#6922). [affeb88](affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]>
Same as [this PR](deepspeedai#6922). [affeb88](deepspeedai@affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]>
Same as [this PR](deepspeedai#6922). [affeb88](deepspeedai@affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]> Signed-off-by: siqi <[email protected]>
Same as [this PR](#6922). [affeb88](affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]> Signed-off-by: Logan Adams <[email protected]>
Same as [this PR](deepspeedai#6922). [affeb88](deepspeedai@affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]>
FYI @tjruwase @GuanhuaWang @delock @skyshine102 context: #5445
changes/support
gather_16bit_weights_on_model_save=True
in ds config).HF trainer dependency:
transformer: https://github.com/inkcherry/transformers/tree/ds_tp
accelerate: https://github.com/inkcherry/accelerate/tree/ds_tp
I could send them once ds support these api.
Usage:
Users do not need to modify the client code, they only need to configure the settings in the config file to achieve the desired functionality.
Below is an example of code for fine-tuning a LLaMA 2 model (SFT). It supports Zero3/FSDP training and enables TP training by simply adjusting the configuration
https://github.com/inkcherry/stanford_alpaca/commits/tp_demo_1127/
![image](https://private-user-images.githubusercontent.com/27563729/399605139-9dd52f58-19d9-44cf-b0e7-ead9932dde49.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyNDEwMzAsIm5iZiI6MTczOTI0MDczMCwicGF0aCI6Ii8yNzU2MzcyOS8zOTk2MDUxMzktOWRkNTJmNTgtMTlkOS00NGNmLWIwZTctZWFkOTkzMmRkZTQ5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDAyMjUzMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWIyNjljODFhMjYzMDk2MGU4NGY3MjYwZTYwMTI1MTI5YzRiYzg5OGRjMjFjNzI0MTc0NWI0MDk3MTAxNGVkYzYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.21m-iiY-ERcyK3kEUwrMxdC3pmfwBIKwpK_2dg1ufb4)
![image](https://private-user-images.githubusercontent.com/27563729/399605108-c3e7d3c0-a90a-4162-a261-6392ac03f68a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyNDEwMzAsIm5iZiI6MTczOTI0MDczMCwicGF0aCI6Ii8yNzU2MzcyOS8zOTk2MDUxMDgtYzNlN2QzYzAtYTkwYS00MTYyLWEyNjEtNjM5MmFjMDNmNjhhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDAyMjUzMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTYxOTAyMjU3NWVjMzAwNzhjMmQzYTM0M2EzMGIzZTI0ZTFhNGE3NTVkYTFhNmVkNTY3ZDg2YzllNzkxYmEyZjkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.ZSDFRf_U7q5XxLdWtE-4Yh9AaRIq-z7wMVpMSal51oE)
This branch contains three commits, with the last two commits added for quick experiments and logging purposes.
results
loss curve(gbs=16):
zero3(baseline)
tp(this)
zero1 with zero1+tp(zero compatible)
![image](https://private-user-images.githubusercontent.com/27563729/399604993-2dd902ab-4f02-4dae-8b16-c942e96cf89f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyNDEwMzAsIm5iZiI6MTczOTI0MDczMCwicGF0aCI6Ii8yNzU2MzcyOS8zOTk2MDQ5OTMtMmRkOTAyYWItNGYwMi00ZGFlLThiMTYtYzk0MmU5NmNmODlmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDAyMjUzMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWJjMDRkYTJhODRlNGVhZDU1YzMwNGExNjcxNWJkZTdiYTRkYWIyNjFiNWNmOTdhZjQ2NTVkNDExNjA2MDU2YzQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.ujF8AWqKIz_2pnX45mcDJtDt_9hNJ_GgGu2a6h6YaXQ)
performance(For your reference only.):
zero3(not enabled any acceleration.) : 18GB 2.3s/it
zero1:38GB 1.30s/it
zero1+tp: 24GB 1.66s/it
extension:
I think async-TP/domino .etc. can be implemented by inheriting a class and overriding the fwd/bwd methods. The logic for gather/partition can be reused to achieve this.(please correct me if I am wrong)
Complex sharding can also be achieved through independent partitioning and gathering. Partitioning is mandatory, while gathering is required for training.
TODO:
embedding vocab parallel
Currently, the parallelism for embeddings is primarily based on hidden_dim parallel combined with allreduce. This approach takes advantage of efficient reduction kernels. and it is not forced to use.
In training, however, the more common method is vocab parallelism. Enabling by default can save a certain amount of GPU memory.
thanks for @delock guidance.
I also verified inference with cpu-inference workloads(Optimized Model List in https://github.com/intel/intel-extension-for-pytorch/tree/main).
many thanks for @xuguangxin @ikurtchen @rogerxfeng8 ,@Yejing-Lai ,@ys950902 .etc. Help review and address matters related to inference.