Add support for concurrent batch Append and Replace #14407

AmatyaAvadhanula · 2023-06-12T11:11:51Z

Allows multiple Appending batch ingestion jobs to run with at most one Replacing job for an enclosing interval

Description

This PR utilizes the new task lock types: APPEND and REPLACE and builds on them to allow concurrent compaction with batch ingestion.

TODO - Describe the problems with segment locking, and briefly explain the mechanism used in this patch

Add new task actions: SegmentTransactionalAppendAction and SegmentTransactionalReplaceAction

The SegmentTransactionalAppendAction is used by appending tasks holding an APPEND lock. When committing segments to the metadata store, they transactionally commit the metadata corresponding to the segment id and the version of the REPLACE lock (if any) that was held on this interval in the druid_segmentVersions table.

The SegmentTransactionalReplaceAction is used by replacing tasks that hold a REPLACE lock. When committing the core partitions for a given interval, they also carry forward the previously appended segments held with the same lock (utilizing the metadata committed by the append action) to every version for which used segments exist.

Utilize the new task actions with previously added lock types to facilitate concurrent compaction with batch ingestion

Add new test Task types: AppendTask and ReplaceTask to help simulate various orders of events

Release note - TODO

This PR has:

imply-cheddar

Leaving some comments, will come back to it.

...src/main/java/org/apache/druid/indexing/common/actions/SegmentTransactionalAppendAction.java

indexing-service/src/main/java/org/apache/druid/indexing/common/task/AppendTask.java

imply-cheddar · 2023-06-16T06:42:46Z

indexing-service/src/main/java/org/apache/druid/indexing/common/task/AppendTask.java

+  public SegmentIdWithShardSpec allocateOrGetSegmentForTimestamp(String timestamp)
+  {
+    final DateTime time = DateTime.parse(timestamp);
+    for (SegmentIdWithShardSpec pendingSegment : pendingSegments) {
+      if (pendingSegment.getInterval().contains(time)) {
+        return pendingSegment;
+      }
+    }
+    return allocateNewSegmentForDate(time);
+  }
+
+  public SegmentIdWithShardSpec allocateNewSegmentForTimestamp(String timestamp)
+  {
+    return allocateNewSegmentForDate(DateTime.parse(timestamp));
+  }


These are the methods that you are using to tell the task what to do from the test thread. But they are also doing all of their work on the actual test thread, not on the task's thread. This means that if you ever have one of these block, it's going to block your test thread instead of the task thread, meaning that your test cannot make progress.

In order to fix this, you will need to make the actual run() part of the task basically just sit and wait on a queue of work for it to do. Then these calls would add new Runnables to that queue. Once you do that, you will likely find that it's incredibly simple to control the behavior of the tasks from the test itself.

indexing-service/src/main/java/org/apache/druid/indexing/common/task/IndexTask.java

indexing-service/src/main/java/org/apache/druid/indexing/common/task/ReplaceTask.java

indexing-service/src/main/java/org/apache/druid/indexing/common/task/Task.java

imply-cheddar · 2023-06-16T06:55:32Z

indexing-service/src/main/java/org/apache/druid/indexing/overlord/TimeChunkLockRequest.java

+    if (lockType.equals(TaskLockType.APPEND) && preferredVersion == null) {
+      return "1970-01-01T00:00:00.000Z";
+    }
    return preferredVersion == null ? DateTimes.nowUtc().toString() : preferredVersion;


The defaulting happening in this getter is a bit weird, let's try to make the things that build the Request do the right thing and make this getter less intelligent.

server/src/main/java/org/apache/druid/indexing/overlord/IndexerMetadataStorageCoordinator.java

.../main/java/org/apache/druid/segment/realtime/appenderator/TransactionalSegmentPublisher.java

kfaraz

There are some changes related to realtime tasks in this PR, primarily because the parameter useSharedLock has been removed. I think we should retain the useSharedLock parameter for the time being and only deprecate it for now to retain backward compatibililty.

Also, the realtime task related changes should be reverted from this PR.

indexing-service/src/main/java/org/apache/druid/indexing/common/actions/TaskLocks.java

...xing-service/src/main/java/org/apache/druid/indexing/common/task/AbstractBatchIndexTask.java

kfaraz · 2023-08-02T13:54:23Z

...xing-service/src/main/java/org/apache/druid/indexing/common/task/AbstractBatchIndexTask.java

@@ -297,15 +297,14 @@ public boolean determineLockGranularityAndTryLock(TaskActionClient client, List<
        Tasks.DEFAULT_FORCE_TIME_CHUNK_LOCK
    );
    IngestionMode ingestionMode = getIngestionMode();
-    final boolean useSharedLock = ingestionMode == IngestionMode.APPEND
-                                  && getContextValue(Tasks.USE_SHARED_LOCK, false);
+    final TaskLockType taskLockType = TaskLockType.valueOf(getContextValue(Tasks.TASK_LOCK_TYPE, TaskLockType.EXCLUSIVE.name()));


the default value of task lock type should be a constant somewhere.

indexing-service/src/main/java/org/apache/druid/indexing/common/task/TaskLockHelper.java

indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskLockbox.java

… into overlordSimulator

Motivation: - There is no usage of the `SegmentTransactionInsertAction` which passes a non-null non-empty value of `segmentsToBeDropped`. - This is not really needed either as overshadowed segments are marked as unused by the Coordinator and need not be done in the same transaction as committing segments. - It will also help simplify the changes being made in #14407 Changes: - Remove `segmentsToBeDropped` from the task action and all intermediate methods - Remove related tests which are not needed anymore

server/src/test/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinatorTest.java

This commit pulls out some changes from #14407 to simplify that PR. Changes: - Rename `IndexerMetadataStorageCoordinator.announceHistoricalSegments` to `commitSegments` - Rename the overloaded method to `commitSegmentsAndMetadata` - Fix some typos

kfaraz · 2023-09-25T01:33:30Z

Merging as the IT failure is unrelated.

abhishekagarwal87 · 2023-10-07T09:09:18Z

...src/main/java/org/apache/druid/indexing/common/actions/SegmentTransactionalAppendAction.java

+                  () -> SegmentPublishResult.fail(
+                      "Invalid task locks. Maybe they are revoked by a higher priority task."
+                      + " Please check the overlord log for details."
+                  )


We should have logged the intervals though.

abhishekagarwal87 · 2023-10-07T09:12:10Z

...rc/main/java/org/apache/druid/indexing/common/actions/SegmentTransactionalReplaceAction.java

+ * Replace segments in metadata storage. The segment versions must all be less than or equal to a lock held by
+ * your task for the segment intervals.


this javadoc is not very clear. versions of "what" segments? The ones being replaced? Also what does it mean here by "your" task. Some verbosity here could be helpful.

...rc/main/java/org/apache/druid/indexing/common/actions/SegmentTransactionalReplaceAction.java

abhishekagarwal87 · 2023-10-07T09:23:39Z

indexing-service/src/main/java/org/apache/druid/indexing/common/actions/TaskLocks.java

@@ -96,7 +112,8 @@ public static boolean isLockCoversSegments(
                  final TimeChunkLock timeChunkLock = (TimeChunkLock) lock;
                  return timeChunkLock.getInterval().contains(segment.getInterval())
                         && timeChunkLock.getDataSource().equals(segment.getDataSource())
-                         && timeChunkLock.getVersion().compareTo(segment.getVersion()) >= 0;
+                         && (timeChunkLock.getVersion().compareTo(segment.getVersion()) >= 0
+                             || TaskLockType.APPEND.equals(timeChunkLock.getType()));


Please leave some comments here, like append by definition covers all versions unlike other lock types

abhishekagarwal87 · 2023-10-07T09:25:17Z

indexing-service/src/main/java/org/apache/druid/indexing/common/actions/TaskLocks.java

+   * This method should be de-duplicated with {@link AbstractBatchIndexTask#determineLockType}
+   * by passing the ParallelIndexSupervisorTask instance into the
+   * SinglePhaseParallelIndexTaskRunner.


This passage is not clear. What kind of de-duplication does it refer to?

indexing-service/src/main/java/org/apache/druid/indexing/common/actions/TaskLocks.java

...xing-service/src/main/java/org/apache/druid/indexing/common/task/AbstractBatchIndexTask.java

abhishekagarwal87 · 2023-10-07T09:44:13Z

...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java

+      final Set<Map<String, Object>> usedSegmentLoadSpecs = toolbox
+          .getTaskActionClient()
+          .submit(new RetrieveUsedSegmentsAction(getDataSource(), getInterval(), null, Segments.INCLUDING_OVERSHADOWED))
+          .stream()
+          .map(DataSegment::getLoadSpec)
+          .collect(Collectors.toSet());
+
+      // Kill segments from the deep storage only if their load specs are not being used by any used segments
+      final List<DataSegment> segmentsToBeKilled = unusedSegments
+          .stream()
+          .filter(unusedSegment -> !usedSegmentLoadSpecs.contains(unusedSegment.getLoadSpec()))
+          .collect(Collectors.toList());
+
+      toolbox.getDataSegmentKiller().kill(segmentsToBeKilled);


This action is very expensive action especially if the interval is eternity which is often the case with killUnusedSegments task.

This was already happening, this PR has just reduced the number of segments that are being killed by the DataSegmentKiller.

How is that Kashif? Were we getting the list of used segments before too?

Ah, sorry, I misinterpreted the comment to be just for the last line, rather than this whole block of code.

indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskLockbox.java

abhishekagarwal87 · 2023-10-07T11:08:37Z

server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java

+   * there would be some used segments in the DB with versions higher than these
+   * append segments.
+   */
+  private Set<DataSegment> getSegmentsToUpgradeOnAppend(


could be renamed to getExtraVersionsForAppendSegments

"extra" versions is probably still confusing. How about get createUpgradedVersionsOfAppendSegments?

or newVersionsOfAppendSegments

Yeah, that works too. But I just want to avoid confusion between words like "new", "upgrade" or "extra" when all of them mean the same thing in our context.

abhishekagarwal87 · 2023-10-07T11:09:00Z

server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java

+    final Map<String, Set<Interval>> committedVersionToIntervals = new HashMap<>();
+    final Map<Interval, Set<DataSegment>> committedIntervalToSegments = new HashMap<>();


variable naming - committed -> overlapping

abhishekagarwal87 · 2023-10-07T11:11:03Z

server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java

+    final Set<DataSegment> upgradedSegments = new HashSet<>();
+    for (Map.Entry<String, Set<Interval>> entry : committedVersionToIntervals.entrySet()) {
+      final String upgradeVersion = entry.getKey();
+      Map<Interval, Set<DataSegment>> segmentsToUpgrade = getSegmentsWithVersionLowerThan(


naming -segmentsToUpgrade --> extraSegmentVersions

This method is not returning extra versions, it's returning the segments that need to be upgraded.

I think when I hear upgrade, I see something going from V0 to V1, but here, we leave V0 as it is and also create V1.

Yes, that's correct. In this line, we are just identifying the segments of V0 that need to go to V1.

abhishekagarwal87 · 2023-10-07T11:13:23Z

server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java

+    final Set<DataSegment> upgradedSegments = new HashSet<>();
+    for (Map.Entry<String, Set<Interval>> entry : committedVersionToIntervals.entrySet()) {
+      final String upgradeVersion = entry.getKey();
+      Map<Interval, Set<DataSegment>> segmentsToUpgrade = getSegmentsWithVersionLowerThan(


just for my own understanding, there should never be an empty value for a function such as getSegmentsWithVersionHigherThan with the same arguments?

Do you mean that the returned value should not be empty?

never mind. I was asking that versions in the db will always be higher or equal to the append segment version

abhishekagarwal87 · 2023-10-07T11:15:14Z

server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java

+   * Computes new Segment IDs for the {@code segmentsToUpgrade} being upgraded
+   * to the given {@code upgradeVersion}.
+   */
+  private Set<DataSegment> upgradeSegmentsToVersion(


the name is a bit confusing because it doesn't actually upgrade anything yet. Just creating extra versions.

How about createUpgradedVersionOfSegments?

that works too.

kfaraz · 2023-10-08T02:33:17Z

@abhishekagarwal87 , I have replied to your comments. The changes will be included in #15097 .

Add support for concurrent batch Append and Replace

d9563f2

imply-cheddar reviewed Jun 16, 2023

View reviewed changes

AmatyaAvadhanula added 3 commits August 1, 2023 12:21

Merge remote-tracking branch 'upstream/master' into overlordSimulator

35cc335

Fix compile errors

4d01445

Fix compilation in test

5acf93b

kfaraz reviewed Aug 2, 2023

View reviewed changes

kfaraz marked this pull request as ready for review August 3, 2023 06:09

AmatyaAvadhanula and others added 14 commits August 3, 2023 11:56

Clean stray comments and move test tasks under test

be23936

Address preliminary feedback

6a9e6e7

Fix version logic

e3e9cf3

Fix checkstyle

2ed61fd

Merge remote-tracking branch 'upstream/master' into overlordSimulator

0d9b5e6

Fix test

38a0071

Handle lock type when not specified in context

f5144a0

Use Intervals.utc

1413a30

Fix merge conflicts

468e4a2

Modify segment commit sql statement

d495c3c

Fix merge conflicts

671c01c

Clean up append and replace actions

311d0ca

Merge branch 'overlordSimulator' of github.com:AmatyaAvadhanula/druid…

5e4876b

… into overlordSimulator

Fix SQL, remove forbidden APIs

a2732ca

kfaraz mentioned this pull request Aug 21, 2023

Remove segmentsToBeDropped from SegmentTransactionInsertAction #14883

Merged

10 tasks

AmatyaAvadhanula added 2 commits August 22, 2023 12:39

Add unit tests

8f1e165

Resolve merge conflicts

83d9484

github-advanced-security bot found potential problems Aug 22, 2023

View reviewed changes

server/src/test/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinatorTest.java Fixed Show fixed Hide fixed

AmatyaAvadhanula added 3 commits August 22, 2023 20:39

Add comments

06cf8d3

Fix intellij inspections

5981130

Remove unnecessary exception handling

073bc26

kfaraz added 5 commits September 19, 2023 12:44

Remove extra change

ab0b400

Fix tests

8bb5a13

Merge branch 'master' of github.com:apache/druid into overlordSimulator

ae5e7c4

Remove unused dependency

9c7d5b2

Add more tests

17ab844

kfaraz mentioned this pull request Sep 21, 2023

Rename IMSC.announceHistoricalSegments to commitSegments #15021

Merged

Use correct init version for APPEND locks

e2b04d4

kfaraz added 7 commits September 21, 2023 16:58

Merge branch 'master' of github.com:apache/druid into overlordSimulator

37640b7

Fix IndexerSQLCoordinator tests

5f5c5bf

Add tests for new TaskLocks utility methods

444cfe4

Add tests for ReplaceTaskLock and SqlSegMetaManagerProvider

6309702

Merge branch 'master' of github.com:apache/druid into overlordSimulator

f5b7092

Add CreateTablesTest, rename new table to upgradeSegments

58433d0

Cleanup

a88da61

github-actions bot added the Area - Dependencies label Sep 24, 2023

kfaraz added 2 commits September 24, 2023 21:55

Fix spotted bugs

057252e

Fix checkstyle

7df51f2

kfaraz approved these changes Sep 25, 2023

View reviewed changes

kfaraz merged commit c62193c into apache:master Sep 25, 2023

abhishekagarwal87 added the Release Notes label Oct 7, 2023

abhishekagarwal87 reviewed Oct 7, 2023

View reviewed changes

AmatyaAvadhanula mentioned this pull request Oct 7, 2023

Optimize used segment fetching in Kill tasks #15107

Merged

10 tasks

abhishekagarwal87 reviewed Oct 7, 2023

View reviewed changes

AmatyaAvadhanula mentioned this pull request Oct 10, 2023

Add support for streaming ingestion with concurrent replace #15039

Merged

10 tasks

LakshSingla added this to the 28.0 milestone Oct 12, 2023

LakshSingla mentioned this pull request Nov 4, 2023

[DRAFT] 28.0.0 release notes #15326

Closed

kfaraz mentioned this pull request Mar 11, 2024

Kill segments by versions #15994

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for concurrent batch Append and Replace #14407

Add support for concurrent batch Append and Replace #14407

AmatyaAvadhanula commented Jun 12, 2023 •

edited

Loading

imply-cheddar left a comment

imply-cheddar Jun 16, 2023

imply-cheddar Jun 16, 2023

AmatyaAvadhanula Aug 7, 2023

kfaraz left a comment

kfaraz Aug 2, 2023

kfaraz commented Sep 25, 2023

abhishekagarwal87 Oct 7, 2023

abhishekagarwal87 Oct 7, 2023

abhishekagarwal87 Oct 7, 2023

abhishekagarwal87 Oct 7, 2023

abhishekagarwal87 Oct 7, 2023

kfaraz Oct 8, 2023

abhishekagarwal87 Oct 8, 2023

kfaraz Oct 9, 2023

abhishekagarwal87 Oct 7, 2023

kfaraz Oct 8, 2023

abhishekagarwal87 Oct 8, 2023

kfaraz Oct 9, 2023

abhishekagarwal87 Oct 7, 2023

abhishekagarwal87 Oct 7, 2023

kfaraz Oct 8, 2023

abhishekagarwal87 Oct 8, 2023

kfaraz Oct 9, 2023

abhishekagarwal87 Oct 7, 2023

kfaraz Oct 8, 2023

abhishekagarwal87 Oct 8, 2023

abhishekagarwal87 Oct 7, 2023

kfaraz Oct 8, 2023

abhishekagarwal87 Oct 8, 2023

kfaraz commented Oct 8, 2023

		* Replace segments in metadata storage. The segment versions must all be less than or equal to a lock held by
		* your task for the segment intervals.

		final Map<String, Set<Interval>> committedVersionToIntervals = new HashMap<>();
		final Map<Interval, Set<DataSegment>> committedIntervalToSegments = new HashMap<>();

Add support for concurrent batch Append and Replace #14407

Add support for concurrent batch Append and Replace #14407

Conversation

AmatyaAvadhanula commented Jun 12, 2023 • edited Loading

Description

Add new task actions: SegmentTransactionalAppendAction and SegmentTransactionalReplaceAction

Utilize the new task actions with previously added lock types to facilitate concurrent compaction with batch ingestion

Add new test Task types: AppendTask and ReplaceTask to help simulate various orders of events

Release note - TODO

imply-cheddar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz commented Sep 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz commented Oct 8, 2023

AmatyaAvadhanula commented Jun 12, 2023 •

edited

Loading