[BUG] ClusterLoadFallbackPolicy is not strictness when a shuffle with big partitions to register #30

TonyDoen · 2022-01-04T03:43:44Z

[BUG] ClusterLoadFallbackPolicy is not strictness when a shuffle with big partitions to register

What changes were proposed in this pull request?

If a shuffle with 5w partitions to register, and cluster has 3w but the slots usage is not in high load status, 5w allocation slots will failed.

Why are the changes needed?

This change will avoid this circumstance that a shuffle that it will cost too much slots over clusters' continues process of RSS handling

What are the items that need reviewer attention?

Related issues.

#17

Related pull requests.

How was this patch tested?

/cc @FMX

/assign @wangshengjie

CLAassistant · 2022-01-04T03:43:49Z

All committers have signed the CLA.

wangshengjie123 · 2022-01-04T07:23:43Z

...e-manager-2/src/main/scala/org/apache/spark/shuffle/rss/RssShuffleFallbackPolicyRunner.scala

@@ -31,7 +31,8 @@ class RssShuffleFallbackPolicyRunner(sparkConf: SparkConf) extends Logging {
  def applyAllFallbackPolicy(dependency: ShuffleDependency[_, _, _],


dependency.partitioner.numPartitions seems enough, could you please help to optimize this？Thanks

Yes, I have retrieved the param [dependency] from this function

wangshengjie123 · 2022-01-04T07:27:37Z

server-master/src/main/scala/com/aliyun/emr/rss/service/deploy/master/Master.scala

-  private def handleGetClusterLoadStatus(context: RpcCallContext): Unit = {
-    val (_, _, _, result) = getClusterLoad
+  private def handleGetClusterLoadStatus(context: RpcCallContext, numPartitions: Int): Unit = {
+    val (_, _, _, result) = getClusterLoad(numPartitions)


Why we need so much return values, please remove useless return

The [getClusterLoad] is common function, we may need the top three return-values in the other place. I can separate the value through creating new function.

FMX · 2022-01-06T10:11:01Z

LGTM

server-master/src/main/scala/com/aliyun/emr/rss/service/deploy/master/Master.scala

...e-manager-3/src/main/scala/org/apache/spark/shuffle/rss/RssShuffleFallbackPolicyRunner.scala

…bled

waitinfuture and others added 8 commits December 28, 2021 17:04

Initial Commit

cbaa5fa

Add template and ci.

8835f88

Merge pull request apache#16 from FMX/branch-templatesAndCI

0c63a18

rm one line

5192768

add check remainSlots bigger than requestSlots

20329b8

merge alibaba origin

ce6ca0b

Merge branch 'main' into issue-17

3ef21b5

make up shuffle-manager-3

71b581a

try change local user.email

28eb834

wangshengjie123 reviewed Jan 4, 2022

View reviewed changes

TonyDoen added 3 commits January 4, 2022 16:45

remove ShuffleDependency param

8b125e3

alter getClusterLoad return values

9c5b348

alter comment

5d1d8d1

waitinfuture reviewed Jan 17, 2022

View reviewed changes

server-master/src/main/scala/com/aliyun/emr/rss/service/deploy/master/Master.scala Outdated Show resolved Hide resolved

...e-manager-3/src/main/scala/org/apache/spark/shuffle/rss/RssShuffleFallbackPolicyRunner.scala Show resolved Hide resolved

TonyDoen and others added 4 commits January 24, 2022 14:30

retrieve workersSnapShot and add config: rss.clusterLoad.fallback.ena…

85903f4

…bled

Merge github.com:alibaba/RemoteShuffleService into issue-17

2537e34

rm unuseful numPartitions

c88cc15

fix checkstyle

62bbcfb

waitinfuture merged commit 302891a into apache:main Jan 26, 2022

waitinfuture linked an issue Jan 30, 2022 that may be closed by this pull request

[BUG] ClusterLoadFallbackPolicy is not strictness when a shuffle with big partitions to register #17

Closed

turboFei mentioned this pull request Dec 27, 2024

[CELEBORN-1720] Prevent stage re-run if task another attempt is running #3037

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ClusterLoadFallbackPolicy is not strictness when a shuffle with big partitions to register #30

[BUG] ClusterLoadFallbackPolicy is not strictness when a shuffle with big partitions to register #30

TonyDoen commented Jan 4, 2022 •

edited

Loading

CLAassistant commented Jan 4, 2022 •

edited

Loading

wangshengjie123 Jan 4, 2022

TonyDoen Jan 4, 2022

wangshengjie123 Jan 4, 2022

TonyDoen Jan 4, 2022 •

edited

Loading

FMX commented Jan 6, 2022

		@@ -31,7 +31,8 @@ class RssShuffleFallbackPolicyRunner(sparkConf: SparkConf) extends Logging {
		def applyAllFallbackPolicy(dependency: ShuffleDependency[_, _, _],

[BUG] ClusterLoadFallbackPolicy is not strictness when a shuffle with big partitions to register #30

[BUG] ClusterLoadFallbackPolicy is not strictness when a shuffle with big partitions to register #30

Conversation

TonyDoen commented Jan 4, 2022 • edited Loading

[BUG] ClusterLoadFallbackPolicy is not strictness when a shuffle with big partitions to register

What changes were proposed in this pull request?

Why are the changes needed?

What are the items that need reviewer attention?

Related issues.

Related pull requests.

How was this patch tested?

CLAassistant commented Jan 4, 2022 • edited Loading

wangshengjie123 Jan 4, 2022

Choose a reason for hiding this comment

TonyDoen Jan 4, 2022

Choose a reason for hiding this comment

wangshengjie123 Jan 4, 2022

Choose a reason for hiding this comment

TonyDoen Jan 4, 2022 • edited Loading

Choose a reason for hiding this comment

FMX commented Jan 6, 2022

TonyDoen commented Jan 4, 2022 •

edited

Loading

CLAassistant commented Jan 4, 2022 •

edited

Loading

TonyDoen Jan 4, 2022 •

edited

Loading