[Auto Parallel] Update Gradient Synchronization in Static Mode #59057

JZ-LIANG · 2023-11-16T08:42:43Z

PR types

Function optimization

PR changes

Others

Description

Pcard-76459

Many Frameworks handle the gradient synchronization using different mechnism: DDP, FSDP, Dtensor, Extra_sync_hook_for_sp , etc.

Before this PR, the gradient synchronization mechanism in AutoParallel static mode is problematic.
It only take into account the "Batch Dimension" (hard coded as the first dimension) of input tensor and uses a hard rule to conduct the synchronization:

If the "Batch" dimension of input activation is "sharded" on rank_groupA, the gradient of parameter need to be synchronized across rank_groupA in backward phase.
The above mechanism works OK when "Batch Dimension" is the only broadcast dimension that would be sharded (narrow-sense Data Parallel).

BUT it would fail for Sequence Parallel(SP)/Context Parallel(CP) and other more general cases of "Data Parallel" where other broadcast dimensions of input tensor are "sharded". The parameter gradient synchronization need in those cases would be missing, since the framework only considers the "batch" axis but ingores other "broadcast" axes.

This PR fixed the problem using "Partial --> Replicated" mechnism.

In any case, a broadcast dimension of input tensor is sharded, the gradient of parameter generated would be in Partial status, indicates that each value of that gradient is "partial" in term to the actual value in logic view.
And before the gradient is feed to optimizer, a Reshard operation (Allreduce) would be performed to convert the status gradient tensor from Partial to Replicated.

No synchronization would be missing.

…grad-sync

paddle-ci-bot · 2023-11-29T03:11:58Z

Sorry to inform you that 9535157's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

…grad-sync

wanghuancoder

LGTM

heavyrain-lzy

LGTM for SP

…ePaddle#59057) * completion bw partial * debug * bugfix * insert param grad allreduce by partial * reorder allreduce for opt * fix typoes * add grad sync unitest * sp unitest * fixed unitest

JZ-LIANG added 4 commits November 15, 2023 15:51

completion bw partial

87968ef

Merge remote-tracking branch 'upstream/develop' into semi-auto/unify-…

b9af95c

…grad-sync

Merge remote-tracking branch 'upstream/develop' into semi-auto/unify-…

dbd15a4

…grad-sync

debug

40bbfd4

JZ-LIANG changed the title ~~[Auto Parallel] Unify Gradient Synchronization Using Partial Mechanism~~ [Auto Parallel] Unify Gradient Synchronization Using Partial->Replicated Mechanism Nov 16, 2023

JZ-LIANG added 4 commits November 20, 2023 20:29

bugfix

f806aaf

insert param grad allreduce by partial

a3ec209

reorder allreduce for opt

b3e655c

fix typoes

9535157

JZ-LIANG added 3 commits November 29, 2023 15:22

add grad sync unitest

d09e575

sp unitest

3c590c8

Merge remote-tracking branch 'upstream/develop' into semi-auto/unify-…

524ba2d

…grad-sync

JZ-LIANG changed the title ~~[Auto Parallel] Unify Gradient Synchronization Using Partial->Replicated Mechanism~~ [Auto Parallel] Unify Gradient Synchronization in Static Mode Nov 29, 2023

JZ-LIANG changed the title ~~[Auto Parallel] Unify Gradient Synchronization in Static Mode~~ [Auto Parallel] Unify Gradient Synchronization Mechanism in Static Mode Nov 29, 2023

JZ-LIANG added 2 commits November 30, 2023 15:05

fixed unitest

0e5c048

Merge remote-tracking branch 'upstream/develop' into semi-auto/unify-…

04c19c5

…grad-sync

JZ-LIANG changed the title ~~[Auto Parallel] Unify Gradient Synchronization Mechanism in Static Mode~~ [Auto Parallel] Update Gradient Synchronization in Static Mode Dec 4, 2023

wanghuancoder approved these changes Dec 4, 2023

View reviewed changes

heavyrain-lzy approved these changes Dec 4, 2023

View reviewed changes

JZ-LIANG merged commit 7e5f101 into PaddlePaddle:develop Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Auto Parallel] Update Gradient Synchronization in Static Mode #59057

[Auto Parallel] Update Gradient Synchronization in Static Mode #59057

JZ-LIANG commented Nov 16, 2023 •

edited

Loading

paddle-ci-bot bot commented Nov 29, 2023

wanghuancoder left a comment

heavyrain-lzy left a comment

[Auto Parallel] Update Gradient Synchronization in Static Mode #59057

[Auto Parallel] Update Gradient Synchronization in Static Mode #59057

Conversation

JZ-LIANG commented Nov 16, 2023 • edited Loading

PR types

PR changes

Description

paddle-ci-bot bot commented Nov 29, 2023

wanghuancoder left a comment

Choose a reason for hiding this comment

heavyrain-lzy left a comment

Choose a reason for hiding this comment

JZ-LIANG commented Nov 16, 2023 •

edited

Loading