Replace HashSet with ConcurrentHashMap.newKeySet #3100

aidar-stripe · 2025-02-15T01:25:04Z

What changes were proposed in this pull request?

Replacing HashSet of PartitionLocations with concurrent version of it.

Why are the changes needed?

We are seeing some race conditions between handleGetReducerFileGroup& tryFinalCommit, where reducers complete without processing partition, even though there's data.

Problematic logs

On the driver side:

25/01/31 14:23:02 {} INFO org.apache.celeborn.client.commit.ReducePartitionCommitHandler: Shuffle 23 commit files complete. File count 23200 using 240180 ms
...
25/01/31 14:23:02 {} INFO org.apache.celeborn.client.commit.ReducePartitionCommitHandler: Shuffle 23 partition 11931-0: primary lost, use replica PartitionLocation[
  id-epoch:11931-0
  host-rpcPort-pushPort-fetchPort-replicatePort:10.68.138.242-39557-35555-37139-39685
  mode:REPLICA
  peer:(empty)
  storage hint:StorageInfo{type=SSD, mountPoint='', finalResult=true, filePath=}
  mapIdBitMap:null].
...
25/01/31 14:23:02 {} INFO org.apache.celeborn.client.commit.ReducePartitionCommitHandler: Succeed to handle stageEnd for 23.

On the executor side:

25/01/31 14:23:02 {executorId=92, jobId=28, partitionId=420, stageId=74, taskAttemptId=82047} INFO org.apache.celeborn.client.ShuffleClientImpl: Shuffle 23 request reducer file group success using 59315 ms, result partition size 12000
...
25/01/31 14:40:54 {executorId=92, partitionId=11931, taskAttemptId=93846} INFO org.apache.spark.executor.Executor: Running task 11931.0 in stage 74.0 (TID 93846)
25/01/31 14:40:54 {jobId=28, executorId=92, taskAttemptId=93846, partitionId=11931, stageId=74} INFO org.apache.spark.shuffle.celeborn.SparkShuffleManager: Shuffle 24 write mode is changed to SORT because partition count 12000 is greater than threshold 2000
25/01/31 14:40:54 {executorId=92, jobId=28, partitionId=11931, stageId=74, taskAttemptId=93846} INFO org.apache.spark.shuffle.celeborn.CelebornShuffleReader: BatchOpenStream for 0 cost 0ms
25/01/31 14:40:54 {} WARN org.apache.celeborn.client.ShuffleClientImpl: Shuffle data is empty for shuffle 23 partition 11931.

How was this patch tested?

No additional tests for this: I've tried to reproduce it, but we've only seen this happen with high number of nodes and during long execution time range.

More explanation on why/how this happens

// write path
 override def setStageEnd(shuffleId: Int): Unit = {
    getReducerFileGroupRequest synchronized {
      stageEndShuffleSet.add(shuffleId)
    }
....

// read path
 override def handleGetReducerFileGroup(context: RpcCallContext, shuffleId: Int): Unit = {
    // Quick return for ended stage, avoid occupy sync lock.
    if (isStageEnd(shuffleId)) {
      replyGetReducerFileGroup(context, shuffleId)
    } else {
      getReducerFileGroupRequest.synchronized {
...

override def isStageEnd(shuffleId: Int): Boolean = {
    stageEndShuffleSet.contains(shuffleId)
  }

Since concurrency guarantees between read/write path are based on ConcurrentHashMap's volatile values there's no guarantee that content of a HashSet would be seen fully by the reader thread.

…etween handleGetReducerFileGroup & tryFinalCommit

FMX · 2025-02-17T02:44:21Z

@aidar-stripe Hi, PR #2986 "[CELEBORN-1769] Fix packed partition location cause GetReducerFileGroupResponse lose location" fixed the scenario you might have encountered.

Is the PR in your distribution or can you provide your worker's distribution commit ID?

aidar-stripe · 2025-02-18T17:18:03Z

@FMX thanks for the link! I think you are absolutely right here, we were running a version of Celeborn client (it's been 0.5.1 with some of our commits for integrity checks, which were disabled).

I could confirm that PbGetReducerFileGroupResponse conversion code only takes primaries there:

        val fileGroup = pbGetReducerFileGroupResponse.getFileGroupsMap.asScala.map {
          case (partitionId, fileGroup) =>
            (
              partitionId,
              PbSerDeUtils.fromPbPackedPartitionLocationsPair(
                fileGroup.getPartitionLocationsPair)._1.asScala.toSet.asJava)
        }.asJava

This explains consistency of the failures that we've seen much better than the potential concurrency issue with the HashSet. I would still like to merge in the PR though, I think usage of ConcurrentHashSet still more appropriate there.

Replace HashSet with ConcurrentHashMap.newKeySet to get rid of race b…

d9c9024

…etween handleGetReducerFileGroup & tryFinalCommit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace HashSet with ConcurrentHashMap.newKeySet #3100

Replace HashSet with ConcurrentHashMap.newKeySet #3100

aidar-stripe commented Feb 15, 2025

FMX commented Feb 17, 2025

aidar-stripe commented Feb 18, 2025

Replace HashSet with ConcurrentHashMap.newKeySet #3100

Are you sure you want to change the base?

Replace HashSet with ConcurrentHashMap.newKeySet #3100

Conversation

aidar-stripe commented Feb 15, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Problematic logs

How was this patch tested?

More explanation on why/how this happens

FMX commented Feb 17, 2025

aidar-stripe commented Feb 18, 2025