-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Behavior of index_parallel with appendToExisting=false and no bucketIntervals in GranularitySpec is surprising #6989
Comments
OK, I think I figured this out. Will test, but then will transform this issue into a docs suggestion and file a PR to fix. |
What I was missing here is that native batch parallel ingestion effectively acts as if appendToExisting is true unless you specify explicit intervals in the GranularitySpec. This seems to be different from both Hadoop batch ingestion and the Local Index Task (including This confused me. I'd like to help fix it! I think we should consider the current behavior a bug and the top-level parallel index task should error if all of the following are true):
While this would be a backwards-incompatible change in 0.14, native batch ingestion is still a very new feature and this behavior is very surprising — and there's a trivial workaround of setting appendToExisting to true if you like the current behavior. If that's not the right change, we could fix the docs instead. I'd update the doc of appendToExisting in native_tasks.md to mention that it is effectively true if intervals aren't specified, and the docs of (I suppose one could also make parallel indexing do two scans in this case, but in my case I certainly would have been happier being asked to add one line to my spec rather than have my experience take twice as long, and it's more complex.) I'm happy to do implement either the new error or the docs update based on what is best. |
@glasser thank you for finding this! I agree with you that the behavior of indexParallelTask is supposed to be same with (or at least similar to) indexTask or hadoopIndexTask. So, I think this is a bug. indexParallelTask is expected to overwrite existing segments unless I think it's still possible to avoid another scan even if
So, this would be mostly about allocating segmentIds and getting task locks. I think it would be better to modify In summary, we may want to change this block to call What do you think? |
I think advising about changing locking semantics is beyond my understanding of Druid (which expanded immensely today by researching this issue). What you describe seems reasonable, but I don't feel like I can confidently say that changing from a static set of locks determined at the beginning of the task to a dynamic set of locks is safe. That said, I could try to implement your suggestion even if I can't evaluate its correctness! |
Thank you! Looking forward to your PR. I think it's safe to get locks dynamically. The indexTask already gets locks one by one per interval though it happens in a tight loop. |
Hi @jihoonson. A couple random questions while I work on this:
|
I would recommend to use Intellij or any other IDE you prefer. If you want to do in the terminal, you can run
I think it's because of #4550. Those classes were written before #4550, and at that time, there was no concept of revoking locks. As a result, if two or more tasks get locks one by one dynamically, they might get stuck in the middle of ingestion. Moreover, it can cause a deadlock if they block each other. However, now, the higher priority tasks can preempt lower priority tasks.
Good question. I don't think we're currently supporting that kind of replacement. But maybe it's worth to support. |
I started! Another question about locking. In allocateNewSegment there's this comment:
But if I am reading the implementation of TaskLockbox and LockListAction correctly, revoked locks will show up in the response to a LockListAction. Should that stream have a |
@glasser sorry, just saw your last question. I left some comments in the PR. |
…7046) * index_parallel: support !appendToExisting with no explicit intervals This enables ParallelIndexSupervisorTask to dynamically request locks at runtime if it is run without explicit intervals in the granularity spec and with appendToExisting set to false. Previously, it behaved as if appendToExisting was set to true, which was undocumented and inconsistent with IndexTask and Hadoop indexing. Also, when ParallelIndexSupervisorTask allocates segments in the explicit interval case, fail if its locks on the interval have been revoked. Also make a few other additions/clarifications to native ingestion docs. Fixes #6989. * Review feedback. PR description on GitHub updated to match. * Make native batch ingestion partitions start at 0 * Fix to previous commit * Unit test. Verified to fail without the other commits on this branch. * Another round of review * Slightly scarier warning
…pache#7046) * index_parallel: support !appendToExisting with no explicit intervals This enables ParallelIndexSupervisorTask to dynamically request locks at runtime if it is run without explicit intervals in the granularity spec and with appendToExisting set to false. Previously, it behaved as if appendToExisting was set to true, which was undocumented and inconsistent with IndexTask and Hadoop indexing. Also, when ParallelIndexSupervisorTask allocates segments in the explicit interval case, fail if its locks on the interval have been revoked. Also make a few other additions/clarifications to native ingestion docs. Fixes apache#6989. * Review feedback. PR description on GitHub updated to match. * Make native batch ingestion partitions start at 0 * Fix to previous commit * Unit test. Verified to fail without the other commits on this branch. * Another round of review * Slightly scarier warning
…7046) (#7113) * index_parallel: support !appendToExisting with no explicit intervals This enables ParallelIndexSupervisorTask to dynamically request locks at runtime if it is run without explicit intervals in the granularity spec and with appendToExisting set to false. Previously, it behaved as if appendToExisting was set to true, which was undocumented and inconsistent with IndexTask and Hadoop indexing. Also, when ParallelIndexSupervisorTask allocates segments in the explicit interval case, fail if its locks on the interval have been revoked. Also make a few other additions/clarifications to native ingestion docs. Fixes #6989. * Review feedback. PR description on GitHub updated to match. * Make native batch ingestion partitions start at 0 * Fix to previous commit * Unit test. Verified to fail without the other commits on this branch. * Another round of review * Slightly scarier warning
We're experimenting with native batch ingestion on our 0.13-incubating cluster for the first time (with a custom firehose reading from files saved to GCS by Secor, with a custom InputRowParser).
There was a period of a week where the data source had no data. We ran batch ingestion (index_parallel) over one particular hour (5am-6am on December 16th) and it successfully ingested that hour — a segment showed up in the coordinator, it could be queried, etc. (Our segment granularity is HOUR.)
Then we ran it again on the entire 24 hours of December 16th. It ran 24 subtasks (our firehose divides up by hour) and ingested the full day, yay!
Except when we look in the coordinator, it now lists 2 segments with identical sizes for the 5am hour that we first tested with. Also both of them are listed with the same version which was from the first batch ingestion, not the version that the other 23 segments have from the second batch ingestion.
We did not explicitly specify appendToExisting in our ioConfig but I believe the default is false and looking at the task payload it is expanded to false.
Are we doing something wrong if our goal is to replace existing segments? Isn't that what
appendToExisting: false
should do?The bad hour in the coordinator:
The good hour:
The text was updated successfully, but these errors were encountered: