Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to configure default analysis types in SegmentMetadataQuery #4259

Merged
merged 23 commits into from
May 26, 2017

Conversation

gkc2104
Copy link
Contributor

@gkc2104 gkc2104 commented May 9, 2017

#3773

Now that defaultAnalysisTypes is being updated, it would make it easier for users to update to 0.10.0 if they could easily revert to old behavior without having to update all SegmentMetadataQueries they use.

Suggest not only providing an option to revert to old behavior but also be able to specify their own defaultAnalysisTypes using druid.query.segmentMetadata.defaultAnalysisType.


@JsonProperty
private Period defaultHistory = ISO_FORMATTER.parsePeriod(DEFAULT_PERIOD_STRING);

@JsonProperty
private EnumSet<SegmentMetadataQuery.AnalysisType> defaultAnalysisType = DEFAULT_ANALYSIS_TYPES;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested defaultAnalysisTypes (update getter and setter names too)

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with the way this is currently written, all nodes involved in a query (broker, historical, etc) will apply their own defaultAnalysisTypes, meaning we'll get weird behavior if different nodes have different values. To fix that, the broker should rewrite the query to include explicit analysis types, to prevent the historicals/etc from filling in their own (possibly different) set. If that's already happening somewhere then I missed it.

One place that could be done is in SegmentMetadataQueryQueryToolChest.mergeResults, it gets called pretty early in the query pipeline and can do rewrites like that. If you make this change then you should be able to revert the cache strategy back to the old code, since the caching layer won't activate until after the query is rewritten, and it can assume analysisTypes is set to an explicit value. As a sanity check you could have the cache strategy throw some kind of IllegalArgumentException if it's called on a query where analysisTypes is still null.

public EnumSet<SegmentMetadataQuery.AnalysisType> getAnalysisTypes(SegmentMetadataQuery query)
{
if (query.getAnalysisTypes() == null) {
return config != null ? config.getDefaultAnalysisTypes() : SegmentMetadataQueryConfig.DEFAULT_ANALYSIS_TYPES;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like handling null config shouldn't be necessary here. It shouldn't be null in production. If it is null sometimes in testing, how about replacing those spots with new SegmentMetadataQueryConfig() to get a default config? Then, SegmentMetadataQueryConfig.DEFAULT_ANALYSIS_TYPES could be private.

@gkc2104
Copy link
Contributor Author

gkc2104 commented May 9, 2017

Given that the config is specific to the broker's runtime.properties, would other nodes like historical's be able to have their own SegmentMetadataQueryConfig, as this where it gets set, and that would also mean that they would be using different ToolChests as well, is it intended for them to be able to set different defaultHistory as well.

It has been reimplemented to not be reliant on the query for analysisTypes in SegmentMetadataQueryQueryToolChest , I extract it from the ToolChest in
https://github.com/metamx/druid/blob/defaultAnalysisUpdate/processing/src/main/java/io/druid/query/metadata/SegmentMetadataQueryRunnerFactory.java#L93
and use that at all the places analysisTypes is required in the QueryRunnerFactory, to rely on analysisTypes of the config, if not in the query.

@@ -86,6 +86,7 @@ See [groupBy server configuration](../querying/groupbyquery.html#server-configur
|Property|Description|Default|
|--------|-----------|-------|
|`druid.query.segmentMetadata.defaultHistory`|When no interval is specified in the query, use a default interval of defaultHistory before the end time of the most recent segment, specified in ISO8601 format. This property also controls the duration of the default interval used by GET /druid/v2/datasources/{dataSourceName} interactions for retrieving datasource dimensions/metrics.|P1W|
|`druid.query.segmentMetadata.defaultAnalysisTypes`|This can be used to set the Default Analysis Types for all segment metadata queries, this can be overridden when making the query|[CARDINALITY, INTERVAL, MINMAX]|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value here should be updated to ["cardinality", "interval", "minmax"]. Also please point to this config in docs of Segment Metadata Queries, line 35.

@fjy fjy removed this from the 0.10.1 milestone May 9, 2017
@leventov
Copy link
Member

@fjy this PR actually blocks us from updating to 0.10.0 (even not 0.10.1), so it fits the criteria that I suggested before, unlike many other PRs that currently target 0.10.1, and don't block anybody from updating to 0.10.0.

@drcrallen
Copy link
Contributor

drcrallen commented May 10, 2017

I think @gianm is proposing that the broker overwrite the analysis types if null is passed in, and if someone sets druid.query.segmentMetadata.defaultAnalysisType differently on historicals than brokers, then the broker setting should be enforced. If someone queries historicals directly, then caveat emptor. Or in other words, druid.query.segmentMetadata.defaultAnalysisType should never do anything on historicals unless someone is querying the historicals directly.

@gianm
Copy link
Contributor

gianm commented May 10, 2017

Yeah, I meant what @drcrallen said.

In particular I'm suggesting that if the broker receives a query without analysisTypes set, it should rewrite the query to include analysisTypes before it passes it down to other nodes. Otherwise different nodes involved in the query might not agree on what the analysisTypes are, which could cause weird results.

@gkc2104
Copy link
Contributor Author

gkc2104 commented May 11, 2017

I'm not sure if this was the best way of doing it, but I made the analysisType field of SegmentMetadatQuery not final, and provided a setter, that is used to updated the query in all calls of query in the overridden methods of QueryToolChest not only mergeResults, and also defaulted the cache strategy and other methods that relied on analysis Tupe to be provided by the query

@@ -166,12 +169,66 @@ public String getType()
return analysisTypes;
}

public void setAnalysisTypes(EnumSet<AnalysisType> analysisTypes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query object should be immutable. Please replace this method with "withAnalysisTypes()", returning a new query object with the given analysis types. Also update SegmentMetadataQueryBuilder, to include a new field.

@@ -107,12 +108,14 @@ public SegmentMetadataQueryQueryToolChest(SegmentMetadataQueryConfig config, Gen
Map<String, Object> context
)
{
Query<SegmentAnalysis> query = queryPlus.getQuery();
SegmentMetadataQuery castedQuery = (SegmentMetadataQuery) queryPlus.getQuery();
SegmentMetadataQuery updatedQuery =(SegmentMetadataQuery) (castedQuery.withAnalysisTypes(getFinalAnalysisTypes(castedQuery)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withAnalysisTypes() could return SegmentMetadataQuery and casting not needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested to join those methods into a single method like SegmentMetadataQuery.withFinalizedAnalysisTypes(toolChest), for less boilerplate, because currently they are always used together. Also this method may avoid creating a new query object and return the same instance, if analysisTypes are already set.


public EnumSet<SegmentMetadataQuery.AnalysisType> getFinalAnalysisTypes(SegmentMetadataQuery query)
{
if (query.getAnalysisTypes() == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested Objects.firstNonNull()

return new BinaryFn<SegmentAnalysis, SegmentAnalysis, SegmentAnalysis>()
{
private final SegmentMetadataQuery query = (SegmentMetadataQuery) inQ;
private final SegmentMetadataQuery query = updatedQuery;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this field is needed?

@@ -121,7 +124,10 @@ public SegmentMetadataQueryQueryToolChest(SegmentMetadataQueryConfig config, Gen
@Override
protected Ordering<SegmentAnalysis> makeOrdering(Query<SegmentAnalysis> query)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be no point for this anonymous QueryRunner to extend ResultMergeQueryRunner rather than BySegmentSkippingQueryRunner directly, because it overrides doRun(). Then this QueryRunner doesn't need to follow API of ResultMergeQueryRunner, it may accept SegmentMetadataQuery instead of Query<SegmentAnalysis> in makeOrdering() and createMergeFn(), to avoid casting.

@@ -121,7 +124,10 @@ public SegmentMetadataQueryQueryToolChest(SegmentMetadataQueryConfig config, Gen
@Override
protected Ordering<SegmentAnalysis> makeOrdering(Query<SegmentAnalysis> query)
{
if (((SegmentMetadataQuery) query).isMerge()) {
SegmentMetadataQuery castedQuery = (SegmentMetadataQuery) query;
SegmentMetadataQuery updatedQuery =(SegmentMetadataQuery) (castedQuery.withAnalysisTypes(getFinalAnalysisTypes(castedQuery)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already done in upstream doRun() call.

return this;
}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra line

@@ -178,19 +187,21 @@ public SegmentAnalysis apply(SegmentAnalysis arg1, SegmentAnalysis arg2)
@Override
public CacheStrategy<SegmentAnalysis, SegmentAnalysis, SegmentMetadataQuery> getCacheStrategy(final SegmentMetadataQuery query)
{

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra line

return new CacheStrategy<SegmentAnalysis, SegmentAnalysis, SegmentMetadataQuery>()
{
@Override
public boolean isCacheable(SegmentMetadataQuery query, boolean willMergeRunners)
public boolean isCacheable(SegmentMetadataQuery updatedQuery, boolean willMergeRunners)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be renamed?

@@ -235,7 +246,8 @@ public SegmentAnalysis apply(@Nullable SegmentAnalysis input)
@Override
public <T extends LogicalSegment> List<T> filterSegments(SegmentMetadataQuery query, List<T> segments)
{
if (!query.isUsingDefaultInterval()) {
SegmentMetadataQuery updatedQuery = (SegmentMetadataQuery) query.withAnalysisTypes(getFinalAnalysisTypes(query));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pointless, because doesn't affect isUsingDefaultInterval

@@ -962,6 +962,7 @@ public static ResultBuilder newResultBuilder()
private Boolean merge;
private Boolean lenientAggregatorMerge;
private Map<String, Object> context;
private Boolean usingDefaultInterval;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useDefaultInterval seems to be an unnecessary configuration, that allows to create inconsistency, if you pass useDefaultInterval=false, and querySegmentSpec which actually represents the default interval to SegmentMetadataQuery constructor.

I suggest the following plan:

  • Don't add usingDefaultInterval to this builder
  • Leave usingDefaultInterval parameter of SegmentMetadataQuery for compatibility, but ignore it, and document the fact that it is going to be removed.
  • In the constructor, set usingDefaultInterval=true if querySegmentSpec == null or querySegmentSpec is not null, and it has just one interval which is equal to the default interval.

{
return new BinaryFn<SegmentAnalysis, SegmentAnalysis, SegmentAnalysis>()
{
private final SegmentMetadataQuery query = (SegmentMetadataQuery) inQ;
private final SegmentMetadataQuery query = inQ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field is not needed

),
MERGE_TRANSFORM_FN
);
}

@Override
protected Ordering<SegmentAnalysis> makeOrdering(Query<SegmentAnalysis> query)
protected Ordering<SegmentAnalysis> makeOrdering(SegmentMetadataQuery query)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be private

@@ -138,12 +138,11 @@ public int compare(
return query.getResultOrdering(); // No two elements should be equal, so it should never merge
}

@Override
protected BinaryFn<SegmentAnalysis, SegmentAnalysis, SegmentAnalysis> createMergeFn(final Query<SegmentAnalysis> inQ)
protected BinaryFn<SegmentAnalysis, SegmentAnalysis, SegmentAnalysis> createMergeFn(final SegmentMetadataQuery inQ)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be private

import io.druid.query.spec.MultipleIntervalSegmentSpec;
import io.druid.query.spec.QuerySegmentSpec;
import jdk.nashorn.internal.ir.annotations.Ignore;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use unrelated annotation

@@ -106,7 +102,7 @@ public SegmentMetadataQuery(
@JsonProperty("merge") Boolean merge,
@JsonProperty("context") Map<String, Object> context,
@JsonProperty("analysisTypes") EnumSet<AnalysisType> analysisTypes,
@JsonProperty("usingDefaultInterval") Boolean useDefaultInterval,
@Ignore @JsonProperty("usingDefaultInterval") Boolean useDefaultInterval,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document that this parameter is ignored in a simple comment, also add note that it is going to be removed, and left now only for the sake of compatibility.

@@ -121,11 +117,17 @@ public SegmentMetadataQuery(
if (querySegmentSpec == null) {
this.usingDefaultInterval = true;
} else {
this.usingDefaultInterval = useDefaultInterval == null ? false : useDefaultInterval;
if (querySegmentSpec.getIntervals().size() == 1 && querySegmentSpec.getIntervals()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing .get(0)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please add a test for this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address this. Also test that if the default interval is explicitly specified via query builder (rather than leaving it as null, that will be replaced with the default interval in the SegmentMetadataQuery's constructor), than isDefaultInterval is correctly set to true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I somehow missed .get(0) on the next line. The formatting is strange: suggested

if (querySegmentSpec.getIntervals().size() == 1 &&
    querySegmentSpec.getIntervals().get(0).equals(DEFAULT_INTERVAL)) {

@@ -255,6 +256,14 @@ public boolean hasMinMax()
return Druids.SegmentMetadataQueryBuilder.copy(this).toInclude(includerator).build();
}

public SegmentMetadataQuery withFinalizedAnalysisTypes(SegmentMetadataQueryConfig config)
{
return Druids.SegmentMetadataQueryBuilder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't create a new query object if analysisTypes are already non-null, just return this

@leventov
Copy link
Member

@gkc2104 BTW is this valid that SegmentMetadataQueryQueryToolChest.filterSegments() doesn't use intervals of the query?

@gkc2104
Copy link
Contributor Author

gkc2104 commented May 11, 2017

@leventov Agreed, it is building a new interval using Default History and the max segment time.
I added a failing test that shows this as well, If you feel that the filtering is being handled incorrectly, I could update it in this PR itself

@@ -1077,6 +1081,33 @@ public Interval getInterval()
for (int i = 0; i < filteredSegments2.size(); i++) {
Assert.assertEquals(expectedSegments2.get(i).getInterval(), filteredSegments2.get(i).getInterval());
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leventov added the isUsingDefaultInterval in here along with a failing test (because of filterSegments implementation). What should I do about filterSegments though ?

@gkc2104
Copy link
Contributor Author

gkc2104 commented May 12, 2017

Updated filterSegments implementation to take into consideration the query's intervals, and not return everything after DefaultHistory and nothing before even if overridden by the query.

final Interval targetInterval = new Interval(config.getDefaultHistory(), targetEnd);
List<Interval> intervals = query.getIntervals();

DateTime queryStartTime = JodaUtils.ETERNITY.getEnd();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other parts of the Druid codebase new DateTime(JodaUtils.MAX_INSTANT) is used

List<Interval> intervals = query.getIntervals();

DateTime queryStartTime = JodaUtils.ETERNITY.getEnd();
DateTime queryEndTIme = JodaUtils.ETERNITY.getStart();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queryEndTime

DateTime queryStartTime = JodaUtils.ETERNITY.getEnd();
DateTime queryEndTIme = JodaUtils.ETERNITY.getStart();

for (Interval interval : intervals) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider

queryStartTime = intervals.stream().map(Interval::getEnd)
    .max(Ordering.natural()).orElseThrow(IllegalStateException::new);

@leventov leventov added this to the 0.10.1 milestone May 16, 2017
@leventov
Copy link
Member

@fjy restoring 0.10.1 milestone because it's an important change that restores compatibility of segment metadata queries.

@gianm @drcrallen could you please review the design again?

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good except for the defaultInterval stuff. I'm not sure how that got roped into this patch, but I think it's fine the way it is and doesn't need to be changed.

@leventov wrote,

useDefaultInterval seems to be an unnecessary configuration, that allows to create inconsistency, if you pass useDefaultInterval=false, and querySegmentSpec which actually represents the default interval to SegmentMetadataQuery constructor.

I suggest the following plan:

  • Don't add usingDefaultInterval to this builder
  • Leave usingDefaultInterval parameter of SegmentMetadataQuery for compatibility, but ignore it, and document the fact that it is going to be removed.
  • In the constructor, set usingDefaultInterval=true if querySegmentSpec == null or querySegmentSpec is not null, and it has just one interval which is equal to the default interval.

See #1732 (comment) for the original rationale about why this exists. I guess it was confusing, so we should add a comment, but it does serve a purpose.

It definitely shouldn't be something a user should be able to provide in the builder. Its only purpose is to help "remember" whether or not a user provided intervals. The idea is we wanted providing null and providing the default interval explicitly to have different behavior. null does druid.query.segmentMetadata.defaultHistory prior to maxTime; explicitly specifying the interval from DEFAULT_INTERVAL will do exactly what the user asked for, and use that specific interval, not the one based on druid.query.segmentMetadata.defaultHistory.

@@ -235,18 +236,27 @@ public SegmentAnalysis apply(@Nullable SegmentAnalysis input)
@Override
public <T extends LogicalSegment> List<T> filterSegments(SegmentMetadataQuery query, List<T> segments)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of the changes in this method? They don't seem related to the rest of the patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these are unrelated to the changes proposed for this patch, I'll open up another PR for this.
The current implementation does not take into consideration the time range while filtering, I added an improvement to only include segments that are within the intervals mentioned by the query.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation does not take into consideration the time range while filtering, I added an improvement to only include segments that are within the intervals mentioned by the query.

filterSegments doesn't need to worry about that, since the list of segments it gets has already been pruned down to whatever matches the intervals filter. In fact many implementations just return segments; with no changes -- check out timeseries for an example. filterSegments only has to potentially filter it down even further, if it wants (see timeBoundary for another example of that). I suppose the javadocs for filterSegments could be more clear on this.

Please also see #4259 (review) for rationale about the current behavior of defaultInterval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that this method was called to prune the segments to include only the required ones from a list of all segments, didn't know it was already pruned. Now that I look at the examples you suggested I understand why it is required even if we already prune it.

Reg: defaultInterval - I had some failing tests that were because of defaultInterval, I don't remember the exact test case, but reverting to the old behavior did not break any tests (probably the change of not having null configs helped I guess), so we can keep the implementation on master

I excluded these changes from this PR.

@@ -87,7 +87,8 @@ public SegmentMetadataQueryRunnerFactory(
public Sequence<SegmentAnalysis> run(QueryPlus<SegmentAnalysis> inQ, Map<String, Object> responseContext)
{
SegmentMetadataQuery query = (SegmentMetadataQuery) inQ.getQuery();
final SegmentAnalyzer analyzer = new SegmentAnalyzer(query.getAnalysisTypes());
SegmentMetadataQuery updatedQuery = query.withFinalizedAnalysisTypes(toolChest.getConfig());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well do this and the cast in one line to avoid accidentally referencing query instead of updatedQuery.

@@ -106,6 +101,7 @@ public SegmentMetadataQuery(
@JsonProperty("merge") Boolean merge,
@JsonProperty("context") Map<String, Object> context,
@JsonProperty("analysisTypes") EnumSet<AnalysisType> analysisTypes,
// useDefaultInterval will be removed, but is left for now for compatibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with useDefaultInterval?

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@gianm
Copy link
Contributor

gianm commented May 25, 2017

Needs one more design review.

@drcrallen drcrallen merged commit dcb07d6 into apache:master May 26, 2017
@drcrallen drcrallen deleted the defaultAnalysisUpdate branch May 26, 2017 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants