Add support for 'maxTotalRows' to incremental publishing kafka indexing task and appenderator based realtime task #6129

clintropolis · 2018-08-08T05:49:10Z

Resolves #5898 by adding getMaxTotalRows and getMaxRowsPerSegment to AppenderatorConfig, extending the model used by IndexTask to IncrementalPublishingKafkaIndexTaskRunner and AppenderatorDriverRealtimeTask.

Additionally, tweaks maxRowsPerSegment behavior of kafka indexing to match the appenderator based realtime indexing change in #6125.

gianm · 2018-08-17T17:56:32Z

...service/src/main/java/io/druid/indexing/kafka/IncrementalPublishingKafkaIndexTaskRunner.java

@@ -429,9 +429,6 @@ public void run()
          // if stop is requested or task's end offset is set by call to setEndOffsets method with finish set to true
          if (stopRequested.get() || sequences.get(sequences.size() - 1).isCheckpointed()) {


What's the rationale behind breaking here if the last sequence is checkpointed?

I discussed with @jihoonson because I noticed that previously it was setting the state to publishing and then going through a bunch of logic not related to publishing if stop wasn't user requested, and this is what was intended to happen here. As I understand, currently it will harmlessly but needlessly fall through the logic below before breaking out and publishing.

Got it. Seems fine, since we haven't read anything from the Kafka consumer yet.

gianm · 2018-08-17T17:57:54Z

server/src/main/java/io/druid/segment/realtime/appenderator/AppenderatorConfig.java

+  /**
+   * Maximum number of rows in memory before persisting to local storage
+   *
+   * @return


Please get rid of the return if it's not going to have anything useful on it; here, and a few other places in the file.

gianm · 2018-08-17T18:02:54Z

server/src/main/java/io/druid/segment/realtime/appenderator/AppenderatorConfig.java

  long getMaxBytesInMemory();

+


The extra newline here isn't needed.

gianm · 2018-08-17T18:03:24Z

server/src/main/java/io/druid/segment/realtime/appenderator/AppenderatorConfig.java

+   * @return
+   */
+  @Nullable
+  default Long getMaxTotalRows()


Why null instead of Long.MAX_VALUE? That's more rows that one machine could possibly store anyway.

This should never be used afaik, everything that uses this value gets it from json, this default is so I didn't have to add it to RealtimeTuningConfig which implements AppenderatorConfig.

If it's never going to be used, how about throwing UnsupportedOperationException and marking it non-nullable?

It's nullable to be consistent with how IndexTask was using it. I don't have strong opinions about null vs Long.MAX_VALUE, I'll rework where this is used to get rid of nullable.

Oh, I didn't read the comment on that, it looks like IndexTask needs this to be nullable, will throw UnsupportedOperationException in default method at least.

gianm · 2018-08-17T18:17:51Z

server/src/main/java/io/druid/segment/realtime/appenderator/AppenderatorImpl.java

@@ -611,6 +603,7 @@ public Object doCall() throws IOException
      }
      theSinks.put(identifier, sink);
      sink.finishWriting();
+      totalRows.addAndGet(-sink.getNumRows());


It would make more sense to decrement this at the same time as rowsCurrentlyInMemory, bytesCurrentlyInMemory, rather than here.

Hmm, where you are suggesting? Those values are decremented at persist time, this one is happening at publish time.

Oh, yeah, you're right. Let me re-read this with non-dumb eyes.

Ok, I re-read it, less dumbly this time, and it looks good to me. But consider the finishWriting-returning-boolean thing.

gianm · 2018-08-17T18:19:09Z

server/src/main/java/io/druid/segment/realtime/appenderator/AppenderatorImpl.java

@@ -1118,16 +1094,18 @@ public String apply(SegmentIdentifier input)
      final boolean removeOnDiskData
  )
  {
+    if (sink.isWritable()) {


What's the rationale behind moving this block?

For one, it seemed to be more legit to check that to match the we only count active sinks comment which implies writable to me. Additionally, since there are a couple of paths to decrementing the totalRows counter, I wanted to make sure I never double decremented it (it's other decrement is also tied to finishWriting).

Hmm, to prevent races (abandonSegment could be called in a separate thread) how about having sink.finishWriting() return true or false, corresponding to its writability state before the method was called? Then, only decrement the counters if it returns true.

gianm

👍 after CI

jihoonson · 2018-08-22T21:39:08Z

I'm reviewing this PR.

jihoonson · 2018-08-22T21:38:59Z

docs/content/development/extensions-core/kafka-ingestion.md

@@ -117,7 +117,8 @@ The tuningConfig is optional and default parameters will be used if no tuningCon
 |`type`|String|The indexing task type, this should always be `kafka`.|yes|
 |`maxRowsInMemory`|Integer|The number of rows to aggregate before persisting. This number is the post-aggregation rows, so it is not equivalent to the number of input events, but the number of aggregated rows that those events result in. This is used to manage the required JVM heap size. Maximum heap memory usage for indexing scales with maxRowsInMemory * (2 + maxPendingPersists). Normally user does not need to set this, but depending on the nature of data, if rows are short in terms of bytes, user may not want to store a million rows in memory and this value should be set.|no (default == 1000000)|
 |`maxBytesInMemory`|Long|The number of bytes to aggregate in heap memory before persisting. This is based on a rough estimate of memory usage and not actual usage. Normally this is computed internally and user does not need to set it. The maximum heap memory usage for indexing is maxBytesInMemory * (2 + maxPendingPersists).  |no (default == One-sixth of max JVM memory)|
-|`maxRowsPerSegment`|Integer|The number of rows to aggregate into a segment; this number is post-aggregation rows. Handoff will happen either if `maxRowsPerSegment` is hit or every `intermediateHandoffPeriod`, whichever happens earlier.|no (default == 5000000)|
+|`maxRowsPerSegment`|Integer|The number of rows to aggregate into a segment; this number is post-aggregation rows. Handoff will happen either if `maxRowsPerSegment` or `maxTotalRows` is hit or every `intermediateHandoffPeriod`, whichever happens earlier.|no (default == 5000000)|
+|`maxTotalRows`|Integer|The number of rows to aggregate across all segments; this number is post-aggregation rows. Handoff will happen either if `maxRowsPerSegment` or `maxTotalRows` is hit or every `intermediateHandoffPeriod`, whichever happens earlier.|no (default == unlimited)|


Looks Long: https://github.com/apache/incubator-druid/pull/6129/files#diff-1b4ea965daf44294fc7cc44870d9df06R65.

jihoonson · 2018-08-23T01:05:46Z

...ns-core/kafka-indexing-service/src/test/java/io/druid/indexing/kafka/KafkaIndexTaskTest.java

@@ -281,6 +282,7 @@ public KafkaIndexTaskTest(boolean isIncrementalHandoffSupported)
    );
  }

+


Please remove this.

jihoonson · 2018-08-23T01:06:37Z

...ns-core/kafka-indexing-service/src/test/java/io/druid/indexing/kafka/KafkaIndexTaskTest.java

@@ -476,7 +478,7 @@ public void testIncrementalHandOff() throws Exception
    }
    final String baseSequenceName = "sequence0";
    // as soon as any segment has more than one record, incremental publishing should happen
-    maxRowsPerSegment = 1;
+    maxRowsPerSegment = 2;


@clintropolis would you tell me why this change is needed?

This PR also matches the behavior of #6125 which changes a '>' to a '>=' and whose logic was moved here in this PR. So to not modify the test a ton i upped the count (since with new behavior the test scenario would be pushing every row).

I updated the main description of the PR to reflect this, thanks for the reminder, I forgot 👍

jihoonson · 2018-08-23T01:07:46Z

...ons-core/kafka-indexing-service/src/main/java/io/druid/indexing/kafka/KafkaTuningConfig.java

@@ -40,6 +40,7 @@
  private final int maxRowsInMemory;
  private final long maxBytesInMemory;
  private final int maxRowsPerSegment;
+  private final Long maxTotalRows;


Please add @Nullable.

jihoonson · 2018-08-23T01:07:54Z

...ons-core/kafka-indexing-service/src/main/java/io/druid/indexing/kafka/KafkaTuningConfig.java

+
+  @JsonProperty
+  @Override
+  public Long getMaxTotalRows()


@Nullable.

jihoonson · 2018-08-23T01:08:47Z

...g-service/src/main/java/io/druid/indexing/common/index/RealtimeAppenderatorTuningConfig.java

  private final long maxBytesInMemory;
+  private final int maxRowsPerSegment;
+  private final Long maxTotalRows;


@Nullable

jihoonson · 2018-08-23T01:09:01Z

...g-service/src/main/java/io/druid/indexing/common/index/RealtimeAppenderatorTuningConfig.java

  @JsonProperty
  public int getMaxRowsPerSegment()
  {
    return maxRowsPerSegment;
  }

+  @Override
+  @JsonProperty
+  public Long getMaxTotalRows()


@Nullable

jihoonson · 2018-08-23T01:09:55Z

indexing-service/src/main/java/io/druid/indexing/common/task/IndexTask.java

@@ -1485,6 +1485,7 @@ public long getMaxBytesInMemory()
    }

    @JsonProperty
+    @Override


Please add @Nullable here too.

Done (but line changed is below this comment so github isn't squashing this one in the UI)

jihoonson · 2018-08-23T01:33:50Z

server/src/main/java/io/druid/segment/realtime/appenderator/AppenderatorImpl.java

-    sink.finishWriting();
+    if (sink.finishWriting()) {
+      // Decrement this sink's rows from rowsCurrentlyInMemory (we only count active sinks).
+      rowsCurrentlyInMemory.addAndGet(-sink.getNumRowsInMemory());


Hmm, is this valid? It looks that this would not be called if this is executed first.

clintropolis · 2018-08-29T19:58:12Z

@jihoonson do you have any additional comments? I believe I've addressed all existing review.

jihoonson

LGTM. @clintropolis thanks!

… kafka index task and appenderator based realtime indexing task, as available in IndexTask

clintropolis mentioned this pull request Aug 8, 2018

Fix appenderator_realtime creating shards bigger by 1 than maxRowsPerSegment #6125

Merged

fjy added this to the 0.13.0 milestone Aug 13, 2018

clintropolis force-pushed the kafka-publish-total-rows branch from 17f6b6b to 1f4d2be Compare August 17, 2018 00:43

gianm reviewed Aug 17, 2018

View reviewed changes

gianm approved these changes Aug 19, 2018

View reviewed changes

jihoonson reviewed Aug 23, 2018

View reviewed changes

jihoonson approved these changes Aug 29, 2018

View reviewed changes

clintropolis force-pushed the kafka-publish-total-rows branch from bcfd2d9 to e1af7cb Compare August 30, 2018 20:12

clintropolis added 4 commits September 6, 2018 13:28

resolves apache#5898 by adding maxTotalRows to incremental publishing…

efb61e7

… kafka index task and appenderator based realtime indexing task, as available in IndexTask

address review comments

ae188b5

changes due to review

44f9f5f

merge fail

036f510

clintropolis force-pushed the kafka-publish-total-rows branch from c1d0c52 to 036f510 Compare September 6, 2018 20:28

jon-wei merged commit e6e068c into apache:master Sep 7, 2018

clintropolis deleted the kafka-publish-total-rows branch September 23, 2018 02:13

gianm mentioned this pull request Nov 11, 2018

Kafka: Publishing improperly starts immediately after setEndOffsets #6602

Closed

jihoonson mentioned this pull request Dec 26, 2018

Add support maxRowsPerSegment for auto compaction #6780

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for 'maxTotalRows' to incremental publishing kafka indexing task and appenderator based realtime task #6129

Add support for 'maxTotalRows' to incremental publishing kafka indexing task and appenderator based realtime task #6129

clintropolis commented Aug 8, 2018 •

edited

Loading

gianm Aug 17, 2018

clintropolis Aug 17, 2018

gianm Aug 17, 2018

gianm Aug 17, 2018

gianm Aug 17, 2018

gianm Aug 17, 2018

clintropolis Aug 17, 2018

gianm Aug 17, 2018

clintropolis Aug 17, 2018

clintropolis Aug 17, 2018

gianm Aug 17, 2018

clintropolis Aug 17, 2018

gianm Aug 17, 2018

gianm Aug 17, 2018

gianm Aug 17, 2018

clintropolis Aug 17, 2018

gianm Aug 17, 2018

clintropolis Aug 17, 2018

gianm left a comment

jihoonson commented Aug 22, 2018

jihoonson Aug 22, 2018

jihoonson Aug 23, 2018

jihoonson Aug 23, 2018

clintropolis Aug 23, 2018

jihoonson Aug 23, 2018

jihoonson Aug 23, 2018

jihoonson Aug 23, 2018

jihoonson Aug 23, 2018

jihoonson Aug 23, 2018

clintropolis Aug 23, 2018

jihoonson Aug 23, 2018

clintropolis commented Aug 29, 2018

jihoonson left a comment

		@@ -429,9 +429,6 @@ public void run()
		// if stop is requested or task's end offset is set by call to setEndOffsets method with finish set to true
		if (stopRequested.get() \|\| sequences.get(sequences.size() - 1).isCheckpointed()) {

		@@ -281,6 +282,7 @@ public KafkaIndexTaskTest(boolean isIncrementalHandoffSupported)
		);
		}

Add support for 'maxTotalRows' to incremental publishing kafka indexing task and appenderator based realtime task #6129

Add support for 'maxTotalRows' to incremental publishing kafka indexing task and appenderator based realtime task #6129

Conversation

clintropolis commented Aug 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm left a comment

Choose a reason for hiding this comment

jihoonson commented Aug 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clintropolis commented Aug 29, 2018

jihoonson left a comment

Choose a reason for hiding this comment

clintropolis commented Aug 8, 2018 •

edited

Loading