-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add compaction task #4985
Add compaction task #4985
Changes from 3 commits
8ae97fd
26dad05
9aa7fff
25c7b00
89aeb26
91ee617
0757294
ffe21e7
1fcdf4a
805c81c
e740743
bb64304
f6666c5
3e1f5a3
c203e36
02ff1bd
d540f75
58109d3
f9e5e06
3582a58
95a8a71
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -104,7 +104,7 @@ Tasks can have different default priorities depening on their types. Here are a | |
|---------|----------------| | ||
|Realtime index task|75| | ||
|Batch index task|50| | ||
|Merge/Append task|25| | ||
|Merge/Append/Compation task|25| | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Compaction (spelling) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. |
||
|Other tasks|0| | ||
|
||
You can override the task priority by setting your priority in the task context like below. | ||
|
@@ -184,19 +184,6 @@ On the contrary, in the incremental publishing mode, segments are incrementally | |
|
||
To enable bulk publishing mode, `forceGuaranteedRollup` should be set in the TuningConfig. Note that this option cannot be used with either `forceExtendableShardSpecs` of TuningConfig or `appendToExisting` of IOConfig. | ||
|
||
### Task Context | ||
|
||
The task context is used for various task configuration parameters. The following parameters apply to all tasks. | ||
|
||
|property|default|description| | ||
|--------|-------|-----------| | ||
|taskLockTimeout|300000|task lock timeout in millisecond. For more details, see [the below Locking section](#locking).| | ||
|
||
<div class="note caution"> | ||
When a task acquires a lock, it sends a request via HTTP and awaits until it receives a response containing the lock acquisition result. | ||
As a result, an HTTP timeout error can occur if `taskLockTimeout` is greater than `druid.server.http.maxIdleTime` of overlords. | ||
</div> | ||
|
||
Segment Merging Tasks | ||
--------------------- | ||
|
||
|
@@ -210,7 +197,8 @@ Append tasks append a list of segments together into a single segment (one after | |
"id": <task_id>, | ||
"dataSource": <task_datasource>, | ||
"segments": <JSON list of DataSegment objects to append>, | ||
"aggregations": <optional list of aggregators> | ||
"aggregations": <optional list of aggregators>, | ||
"context": <task context> | ||
} | ||
``` | ||
|
||
|
@@ -228,7 +216,8 @@ The grammar is: | |
"dataSource": <task_datasource>, | ||
"aggregations": <list of aggregators>, | ||
"rollup": <whether or not to rollup data during a merge>, | ||
"segments": <JSON list of DataSegment objects to merge> | ||
"segments": <JSON list of DataSegment objects to merge>, | ||
"context": <task context> | ||
} | ||
``` | ||
|
||
|
@@ -245,10 +234,53 @@ The grammar is: | |
"dataSource": <task_datasource>, | ||
"aggregations": <list of aggregators>, | ||
"rollup": <whether or not to rollup data during a merge>, | ||
"interval": <DataSegment objects in this interval are going to be merged> | ||
"interval": <DataSegment objects in this interval are going to be merged>, | ||
"context": <task context> | ||
} | ||
``` | ||
|
||
### Compaction Task | ||
|
||
Compaction tasks merge all segments of the given interval. The syntax is: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this should include a segmentGranularity too. Unless your idea is that the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, all the segments of the
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest adding two more sentences:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added. |
||
|
||
```json | ||
{ | ||
"type": "compact", | ||
"id": <task_id>, | ||
"dataSource": <task_datasource>, | ||
"interval": <interval to specify segments to be merged>, | ||
"tuningConfig" <index task tuningConfig>, | ||
"context": <task context> | ||
} | ||
``` | ||
|
||
|Field|Description|Required| | ||
|-----|-----------|--------| | ||
|`type`|Task type. Should be `compact`|Yes| | ||
|`id`|Task id|No| | ||
|`dataSource`|dataSource name to be compacted|Yes| | ||
|`interval`|interval of segments to be compacted|Yes| | ||
|`tuningConfig`|[Index task tuningConfig](#tuningconfig)|No| | ||
|`context`|[Task context](#taskcontext)|No| | ||
|
||
An example of compaction task is | ||
|
||
```json | ||
{ | ||
"type" : "compact", | ||
"dataSource" : "wikipedia", | ||
"interval" : "2017-01-01/2018-01-01" | ||
} | ||
``` | ||
|
||
This compaction task merges _all segments_ of the interval `2017-01-01/2018-01-01`. | ||
|
||
A compaction task internally generates an indexTask spec for performing compaction work with some fixed parameters. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably more clear:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
For example, its `firehose` is always the [ingestSegmentSpec](./firehose.html) and `dimensionsSpec` and `metricsSpec` | ||
always include all dimensions and metrics of the input segments. | ||
|
||
Note that all input segments should have the same `queryGranularity` and `rollup`. See [Segment Metadata Queries](../querying/segmentmetadataquery.html#analysistypes) for more details. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens if they don't have consistent queryGranularity and rollup? (Docs should say and it should hopefully be reasonable, since this situation may happen in real life.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. It thrown an exception before, but now, it automatically checks and sets rollup if it is set for all input segments. |
||
|
||
Segment Destroying Tasks | ||
------------------------ | ||
|
||
|
@@ -261,7 +293,8 @@ Kill tasks delete all information about a segment and removes it from deep stora | |
"type": "kill", | ||
"id": <task_id>, | ||
"dataSource": <task_datasource>, | ||
"interval" : <all_segments_in_this_interval_will_die!> | ||
"interval" : <all_segments_in_this_interval_will_die!>, | ||
"context": <task context> | ||
} | ||
``` | ||
|
||
|
@@ -342,6 +375,21 @@ These tasks start, sleep for a time and are used only for testing. The available | |
} | ||
``` | ||
|
||
Task Context | ||
------------ | ||
|
||
The task context is used for various task configuration parameters. The following parameters apply to all task types. | ||
|
||
|property|default|description| | ||
|--------|-------|-----------| | ||
|taskLockTimeout|300000|task lock timeout in millisecond. For more details, see [the below Locking section](#locking).| | ||
|priority|Different based on task types. See [Task Priority](#task-priority).|Task priority| | ||
|
||
<div class="note caution"> | ||
When a task acquires a lock, it sends a request via HTTP and awaits until it receives a response containing the lock acquisition result. | ||
As a result, an HTTP timeout error can occur if `taskLockTimeout` is greater than `druid.server.http.maxIdleTime` of overlords. | ||
</div> | ||
|
||
Locking | ||
------- | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.