Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auto Parallel] Update Gradient Synchronization in Static Mode #59057

Merged
merged 13 commits into from
Dec 4, 2023

Conversation

JZ-LIANG
Copy link
Contributor

@JZ-LIANG JZ-LIANG commented Nov 16, 2023

PR types

Function optimization

PR changes

Others

Description

Pcard-76459

Many Frameworks handle the gradient synchronization using different mechnism: DDP, FSDP, Dtensor, Extra_sync_hook_for_sp , etc.

Before this PR, the gradient synchronization mechanism in AutoParallel static mode is problematic.
It only take into account the "Batch Dimension" (hard coded as the first dimension) of input tensor and uses a hard rule to conduct the synchronization:

If the "Batch" dimension of input activation is "sharded" on rank_groupA, the gradient of parameter need to be synchronized across rank_groupA in backward phase.
The above mechanism works OK when "Batch Dimension" is the only broadcast dimension that would be sharded (narrow-sense Data Parallel).

BUT it would fail for Sequence Parallel(SP)/Context Parallel(CP) and other more general cases of "Data Parallel" where other broadcast dimensions of input tensor are "sharded". The parameter gradient synchronization need in those cases would be missing, since the framework only considers the "batch" axis but ingores other "broadcast" axes.

This PR fixed the problem using "Partial --> Replicated" mechnism.

In any case, a broadcast dimension of input tensor is sharded, the gradient of parameter generated would be in Partial status, indicates that each value of that gradient is "partial" in term to the actual value in logic view.
And before the gradient is feed to optimizer, a Reshard operation (Allreduce) would be performed to convert the status gradient tensor from Partial to Replicated.

No synchronization would be missing.

image

@JZ-LIANG JZ-LIANG changed the title [Auto Parallel] Unify Gradient Synchronization Using Partial Mechanism [Auto Parallel] Unify Gradient Synchronization Using Partial->Replicated Mechanism Nov 16, 2023
Copy link

paddle-ci-bot bot commented Nov 29, 2023

Sorry to inform you that 9535157's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@JZ-LIANG JZ-LIANG changed the title [Auto Parallel] Unify Gradient Synchronization Using Partial->Replicated Mechanism [Auto Parallel] Unify Gradient Synchronization in Static Mode Nov 29, 2023
@JZ-LIANG JZ-LIANG changed the title [Auto Parallel] Unify Gradient Synchronization in Static Mode [Auto Parallel] Unify Gradient Synchronization Mechanism in Static Mode Nov 29, 2023
@JZ-LIANG JZ-LIANG changed the title [Auto Parallel] Unify Gradient Synchronization Mechanism in Static Mode [Auto Parallel] Update Gradient Synchronization in Static Mode Dec 4, 2023
Copy link
Contributor

@wanghuancoder wanghuancoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@heavyrain-lzy heavyrain-lzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for SP

@JZ-LIANG JZ-LIANG merged commit 7e5f101 into PaddlePaddle:develop Dec 4, 2023
SigureMo pushed a commit to gouzil/Paddle that referenced this pull request Dec 5, 2023
…ePaddle#59057)

* completion bw partial

* debug

* bugfix

* insert param grad allreduce by partial

* reorder allreduce for opt

* fix typoes

* add grad sync unitest

* sp unitest

* fixed unitest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants