Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add long/float ColumnSelectorStrategy implementations #3838

Merged
merged 23 commits into from
Feb 9, 2017

Conversation

jon-wei
Copy link
Contributor

@jon-wei jon-wei commented Jan 11, 2017

This PR adds long and float implementations of the query processing ColumnSelectorStrategy interface introduced in #3570.

This allows users to include non-String columns as dimensions within queries by specifying a typeHints map in the query context. This is described in more detail in the doc changes within the PR.

For benchmarking, the patch also adds a set of "simple" schemas to the benchmark schemas, which has a single dimension "dimSequential" and a row count aggregator, where "dimSequential" contains an ascending sequence of integer values, encoded as String/Long/Float. Results for the benchmarks are located at:

https://gist.github.com/jon-wei/44a2b8ab97678710a9ef5b24aa107dec


NOTE:

  • When applying extraction functions to a float/long, the resulting DimensionSelector (exFns can only output Strings currently) constructs an on-the-fly dictionary without limits on memory usage, this may be problematic if an extraction function's set of output values has high cardinality. This is needed since query engines rely on the existence of a dictionary when handling Strings, I haven't thought of a better way to handle this yet.

NOTE:

Benchmark                                                                 (defaultStrategy)  (initialBuckets)  (numProcessingThreads)  (numSegments)  (queryGranularity)  (rowsPerSegment)  Mode  Cnt      Score     Error  Units
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexFloatOnly                         v2                -1                       4              4                 all            250000  avgt   50  17895.755 ± 136.289  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexFloatThenLong                     v2                -1                       4              4                 all            250000  avgt   50  35926.555 ± 110.157  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexFloatThenString                   v2                -1                       4              4                 all            250000  avgt   50  44480.465 ± 398.283  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexFloatTwice                        v2                -1                       4              4                 all            250000  avgt   50  35673.128 ±  40.837  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexLongOnly                          v2                -1                       4              4                 all            250000  avgt   50  17920.575 ±  46.517  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexLongThenString                    v2                -1                       4              4                 all            250000  avgt   50  46077.899 ± 470.782  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexLongThenFloat                     v2                -1                       4              4                 all            250000  avgt   50  35986.157 ±  69.486  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexLongTwice                         v2                -1                       4              4                 all            250000  avgt   50  35938.898 ±  70.856  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexNumericOnly                       v2                -1                       4              4                 all            250000  avgt   50  20979.666 ±  51.640  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexNumericThenString                 v2                -1                       4              4                 all            250000  avgt   50  52961.866 ± 438.055  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexStringOnly                        v2                -1                       4              4                 all            250000  avgt   50  22801.356 ± 238.549  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexStringThenLong                    v2                -1                       4              4                 all            250000  avgt   50  44720.358 ± 454.617  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexStringThenNumeric                 v2                -1                       4              4                 all            250000  avgt   50  51805.316 ± 408.565  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexStringTwice                       v2                -1                       4              4                 all            250000  avgt   50  46402.020 ± 657.067  us/op

This set of benchmarks attempts determine the running time for a group by query against single string/long/float columns (monomorphic base case), and running times for sequences of queries that use more than one type implementation.

For example, StringThenLong runs the string query followed by the long query within the same benchmark.

There appears to be some overhead when multiple type implementations are called, e.g.:

LongOnly = 17920.575us
StringOnly = 22801.356us

which should sum to ~40000us for both.

StringThenLong however has a running time of 444720.358us, about 10% higher than expected.

The difference bigger for NumericOnly (query on both a long and float) and NumericThenString:

NumericOnly = 20979.666
StringOnly = 22801.356

Sum should be ~44000, but NumericThenString and StringThenNumeric are 52961.866 and 51805.316 respectively.

The difference between StringOnly and StringTwice is smaller, suggesting the overhead may be related to the polymorphism. However, the behavior is not seen for long/float only, e.g. LongTwice is very close to LongThenFloat and (LongOnly + FloatOnly)


Similar benchmarks for TopN don't show the same discrepancy:

Benchmark                                                            (numSegments)  (rowsPerSegment)  (schemaAndQuery)  (threshold)  Mode  Cnt      Score      Error  Units
TopNTypeInterfaceBenchmark.querySingleQueryableIndexFloatOnly                    1            250000           basic.A           10  avgt   50  41916.788 ±  590.935  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexFloatThenLong                1            250000           basic.A           10  avgt   50  84411.958 ±  446.999  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexFloatThenString              1            250000           basic.A           10  avgt   50  94284.029 ± 1395.548  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexFloatTwice                   1            250000           basic.A           10  avgt   50  85390.807 ±  934.008  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexLongOnly                     1            250000           basic.A           10  avgt   50  43835.461 ±  612.863  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexLongThenFloat                1            250000           basic.A           10  avgt   50  86043.197 ±  597.789  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexLongThenString               1            250000           basic.A           10  avgt   50  87899.020 ±  520.239  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexLongTwice                    1            250000           basic.A           10  avgt   50  86441.232 ±  635.296  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexStringOnly                   1            250000           basic.A           10  avgt   50  50487.063 ±  402.854  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexStringThenFloat              1            250000           basic.A           10  avgt   50  90925.010 ±  731.045  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexStringThenLong               1            250000           basic.A           10  avgt   50  91756.314 ±  741.951  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexStringTwice                  1            250000           basic.A           10  avgt   50  98454.496 ±  996.865  us/op

I am looking into the GroupBy performance discrepancy more and considering other possible approaches (e.g., some form of code generation), but opening this PR now to get some thoughts from the community on these aspects.

@jon-wei jon-wei added this to the 0.10.0 milestone Jan 11, 2017
@leventov leventov self-assigned this Jan 11, 2017
@jon-wei jon-wei force-pushed the dimtype3 branch 2 times, most recently from 6919b7f to 24394e3 Compare January 11, 2017 20:36
@jon-wei jon-wei closed this Jan 11, 2017
@jon-wei jon-wei reopened this Jan 11, 2017
@gianm
Copy link
Contributor

gianm commented Jan 19, 2017

When applying extraction functions to a float/long, the resulting DimensionSelector (exFns can only output Strings currently) constructs an on-the-fly dictionary without limits on memory usage, this may be problematic if an extraction function's set of output values has high cardinality. This is needed since query engines rely on the existence of a dictionary when handling Strings, I haven't thought of a better way to handle this yet.

SingleScanTimeDimSelector has the same problem. IMO, the query engines should maintain String dictionaries on their own if the selector doesn't have one, which it can report by returning CARDINALITY_UNKNOWN from getValueCardinality(). The query engines are in the best position to manage memory in an appropriate way. I don't think you need to do this as part of this PR, but, I think it's the best way to go ultimately.

@gianm
Copy link
Contributor

gianm commented Jan 19, 2017

Based on GroupByTypeInterfaceBenchmark, there is some virtual method overhead seen in GroupBy V2, like the concern discussed in this thread: #3797 (comment).

StringThenLong however has a running time of 444720.358us, about 10% higher than expected.

Do you have benchmarks on grouping on high-cardinality numeric columns as strings vs. as longs? (same data, but stored/grouped as string vs. stored/grouped as long). I suspect that this will be a much bigger difference than 10%, and if so, given that:

  • there is a lot of performance benefit on both topN and groupBy in grouping numeric data as longs instead of strings (we need numbers on this, though)
  • there is no measurable performance hit from polymorphism in topN
  • there is measurable hit from polymorphism in groupBy for strings, but groupBy is under a lot of flux anyway
  • we are investigating ways to un-do the performance hit for strings

I'm willing to say that the change is worth including.

@leventov
Copy link
Member

When applying extraction functions to a float/long, the resulting DimensionSelector (exFns can only output Strings currently) constructs an on-the-fly dictionary without limits on memory usage, this may be problematic if an extraction function's set of output values has high cardinality. This is needed since query engines rely on the existence of a dictionary when handling Strings, I haven't thought of a better way to handle this yet.

SingleScanTimeDimSelector has the same problem. IMO, the query engines should maintain String dictionaries on their own if the selector doesn't have one, which it can report by returning CARDINALITY_UNKNOWN from getValueCardinality(). The query engines are in the best position to manage memory in an appropriate way. I don't think you need to do this as part of this PR, but, I think it's the best way to go ultimately.

Also StringDimensionIndexer's DimensionSelector and QueryableIndexStorageAdapter's DimensionSelectors have positive getValueCardinality() but don't have dictionaries when there is non-null extractionFn. When I worked on #3858 I first changed them to return CARDINALITY_UNKNOWN in this case, but a lot of tests started to fail. Having cardinality or not and having dictionary or not are different properties. This is handled in #3858 by adding idLookup() method to DimensionSelector interface.

@leventov
Copy link
Member

I am looking into the GroupBy performance discrepancy more and considering other possible approaches (e.g., some form of code generation), but opening this PR now to get some thoughts from the community on these aspects.

I considered code generation when worked on #3798. Decided it's too difficult and changes the current source tree too much. So I suggest to resolve this by expanding #3798 to groupBy queries and other types of queries other than topN.

Copy link
Member

@leventov leventov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR review until TopNBinaryFn excluding

@Override
public DruidFloatPredicate makeFloatPredicate()
{
return new DruidFloatPredicate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please extract a singleton

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using singletons for these false predicates now, in DruidLongPredicate/DruidFloatPredicate

@Override
public DruidFloatPredicate makeFloatPredicate()
{
return new DruidFloatPredicate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please extract a singleton

@@ -67,6 +70,61 @@
return parseInt(query, "uncoveredIntervalsLimit", defaultValue);
}

public static <T> Map<String, Object> getContextTypeHints(Query<T> query)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't it accept Query<?> without a method type parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this function now, based on @gianm's comment about moving the type specification to DimensionSpec

return query.getContextValue(QueryContextKeys.TYPE_HINTS);
}

public static <T> Map<String, ValueType> getContextTypeHints(Query<T> query, List<DimensionSpec> dimensions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't it accept Query<?> without a method type parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this function now, based on @gianm's comment about moving the type specification to DimensionSpec

}
}

private static <T> Map<String, ValueType> convertTypeHintsValues(Map<String, Object> oldMap)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<T> is not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function was removed

@@ -137,6 +139,34 @@ public boolean applyLong(long input)
};
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the method which ends with this brace Predicates.equalTo() and singleton could be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the String predicate here to use Predicates.equalTo() and replaced the always false predicates with singletons

@@ -225,4 +255,18 @@ private void initLongValue()
longsInitialized = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If caching resulting predicate directly instead of valueAsLong (it's possible because predicates are immutable), longsInitialized is not needed. Also, locking could be removed and benign race allowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cached the predicate instead

int keyBufferPosition, Object rowObj, int rowValIdx, ByteBuffer keyBuffer
)
{
// rows from a long column are always size 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not clear to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is saying that the constant "1" below is the number of values in a row. It's 1 because long columns can't be multi-value. Maybe this comment would be clearer if it was directly above the return rowValIdx < 1; line? Or if that constant was given a name like LongColumnSelector.ROW_SIZE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reworded the comment to mention multivalue support, and replaced the constant with a named one

int keyBufferPosition, Object rowObj, int rowValIdx, ByteBuffer keyBuffer
)
{
// rows from a float column are always size 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not clear to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to above

@@ -74,7 +74,10 @@ public void run(
{
boolean hasDimValSelector = (dimValSelector != null);

final int cardinality = params.getCardinality();
int cardinality = params.getCardinality();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change should be a part of this PR? cardinality < 0 probably means that completely different approach should be taken to compute topN, setting just to Integer.MAX_VALUE may result in a lot of excessive work and allocations in run() method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this is really necessary. It looks like this code path may not get executed for non-string selectors. If so, there should be no need to reset it here. If there is a need to reset it here, that seems suspicious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed BaseTopNAlgorithm to have a "with cardinality" path and "no cardinality" path, the "no cardinality" path doesn't use DimValSelector (current implementations need cardinality) and the multi-pass logic

@jon-wei
Copy link
Contributor Author

jon-wei commented Jan 21, 2017

@leventov @gianm

Thanks for the reviews so far, marking this WIP while I address the comments and get more benchmark results, will update this early next week

@jon-wei jon-wei changed the title Add long/float ColumnSelectorStrategy implementations Add long/float ColumnSelectorStrategy implementations [WIP] Jan 21, 2017
Copy link
Member

@leventov leventov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete review

if (input == null) {
return null;
}
if (input instanceof String) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is redundant, String.toString() returns this

@@ -43,10 +43,6 @@ protected TopNParams(
this.cursor = cursor;
this.cardinality = selectorPlus.getColumnSelectorStrategy().getCardinality(selectorPlus.getSelector());
this.numValuesPerPass = numValuesPerPass;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to some change above, not sure this should be a part of this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to be here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is removed since the TopNParams can now hold CARDINALITY_UNKNOWN for certain dimension selectors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add this check to anything that calls getCardinality(), getSelectorPlus(), or getDimSelector() and wouldn't be able to handle the CARDINALITY_UNKNOWN case? That's a useful sanity check to make sure that the code is routing the proper cases to the proper algorithms.

@@ -46,6 +46,10 @@ public int getCardinality(DimensionSelector selector)
// Optimization possibly requires a reverse lookup from value to ID, which is
// not possible when applying an extraction function

if (params.getCardinality() < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And changes in this class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the check, no longer needed with separate code path for "cardinality unknown" case

capabilities = DEFAULT_STRING_CAPABILITIES;
}

// Currently, all extractionFns output Strings, so force the type if present
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"force the type if present" -- what does it mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed comment to:

"Currently, all extractionFns output Strings, so the column will return String values via a DimensionSelector if an extractionFn is present."

return ((Number) valObj).longValue();
} else if (valObj instanceof String) {
try {
String s = ((String) valObj).replace(",", "");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What removing commas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this parsing code from getLongMetric/getFloatMetric in MapBasedRowinitially, I kept the comma removal since these transform functions would be used on String columns with "numeric" data, and figured it would be useful to allow that format

I don't mind removing this if someone is against it though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's puzzling. I think it's worth to understand why it was added in the first place, and if the reason is still valid, to add a comment on this line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I suppose the purpose is to remove 1000s delimiter from numbers, e. g. 1,000,000 is 1000000. But not in all locales, e. g. in Russian locale comma is a separator between integer part and fractional part in non-integer numbers.

IMO this is "too smart" and we shouldn't do this. Probably it should be discussed separately, not in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, good point on the locales, I agree this is too smart, I've removed the comma removal and replaced these blocks with GuavaUtils.tryParseLong and Floats.tryParse

// TopNBinaryFn relies on this returning the same Float object when no conversion is needed
public static Float convertObjectToFloat(Object valObj)
{
if (valObj == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there is no similar check in convertObjectToLong()?

return ((Number) valObj).floatValue();
} else if (valObj instanceof String) {
try {
return Float.valueOf(((String) valObj).replace(",", ""));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing commas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine it could become a source or bugs, because in some Locales comma is used as a separator between integer and fractional parts. So removing commas from "1,0" will silently produce 10.0f instead of failing with exception.

{
private final FloatColumnSelector delegate;
private final ExtractionFn exFn;
private final Map<String, Integer> valueToId = Maps.newHashMap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how DimensionSelectors are used, accumulating valueToId and idToValue doesn't make much sense. I agree with @gianm that users of DimensionSelectors are in a better position to do such caching. If #3858 is merged first, this FloatWrappingDimensionSelector should return null from idLookup(), false from nameLookupPossibleInAdvance(), ZeroIndexedInts from getRow() and perform float -> String conversion in lookupName(). However, if FloatColumnSelector has equivalents of DimensionSelector's idLookup(), lookupName() and getValueCardinality(), FloatWrappingDimensionSelector could be optimized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on having the dictionary management reside in the query engines and other users, I'm in favor of getting #3858 merged first so I'll hold off on changing this part for now

import java.util.List;
import java.util.Map;

public class LongWrappingDimensionSelector implements DimensionSelector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as in FloatWrappingDimensionSelector

@Override
public DruidFloatPredicate makeFloatPredicate()
{
return new DruidFloatPredicate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace with a singleton

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace with a singleton

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, missed that the first time, thanks

}
}

private void initPredicate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm losing the point of lazy initialization here, as well several similar places below. Maybe it's simpler to compute build a predicate eagerly.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stopped on StringTopNColumnSelectorStrategy.java, will continue later

@@ -21,4 +21,4 @@ The query context is used for various query configuration parameters.
|`maxResults`|500000|Maximum number of results groupBy query can process. Default value used can be changed by `druid.query.groupBy.maxResults` in druid configuration at broker and historical nodes. At query time you can only lower the value.|
|`maxIntermediateRows`|50000|Maximum number of intermediate rows while processing single segment for groupBy query. Default value used can be changed by `druid.query.groupBy.maxIntermediateRows` in druid configuration at broker and historical nodes. At query time you can only lower the value.|
|`groupByIsSingleThreaded`|false|Whether to run single threaded group By queries. Default value used can be changed by `druid.query.groupBy.singleThreaded` in druid configuration at historical nodes.|

|`typeHints`|{}| A map of column name -> column type (String, Long, Float). By default, druid returns all column values as strings within query results. If querying on a non-String column, `typeHints` must be included in a query, containing a mapping of the name of the non-String column to its desired return type. This is necessary because columns with the same name in different segments do not necessarily have the same value type, and a type must be chosen when merging results.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be part of DimensionSpec rather than the query context. Usually the query context is tuning-type stuff or minor behavior modifications, but type hints are more than that.

For this part:

If querying on a non-String column, typeHints must be included in a query

Is that true or does the type default to string if not specified? The current language suggests that if typeHints are not included then the query will fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed typeHints and added an outputType field to DimensionSpec, docs have been updated accordingly

@@ -99,3 +99,21 @@ Possible codes for the *error* field include:
|`Query cancelled`|The query was cancelled through the query cancellation API.|
|`Resource limit exceeded`|The query exceeded a configured resource limit (e.g. groupBy maxResults).|
|`Unknown exception`|Some other exception occurred. Check errorMessage and errorClass for details, although keep in mind that the contents of those fields are free-form and may change from release to release.|


Column Types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## Column Types (or however many #'s) is more normal

@Override
public void hashValues(LongColumnSelector dimSelector, HyperLogLogCollector collector)
{
collector.add(CardinalityAggregator.hashFn.hashLong(dimSelector.get()).asBytes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's because hashFn doesn't have a hashFloat method. But you could do hashFn.hashInt(Float.floatToIntBits(f)).

@Override
public void hashValues(FloatColumnSelector dimSelector, HyperLogLogCollector collector)
{
collector.add(CardinalityAggregator.hashFn.hashUnencodedChars(String.valueOf(dimSelector.get())).asBytes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hashFn.hashInt(Float.floatToIntBits(dimSelector.get())) should be faster than encoding the float as a string.

private final boolean hasUpperLongBound = BoundLongPredicateSupplier.this.hasUpperLongBound;
private final long lowerLongBound = hasLowerLongBound ? BoundLongPredicateSupplier.this.lowerLongBound : 0L;
private final long upperLongBound = hasUpperLongBound ? BoundLongPredicateSupplier.this.upperLongBound : 0L;
synchronized (initLock) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leventov the predicate is not trivial to compute (it involves some string parsing) – do you think it's likely that lock savings from allowing the race outweighs the cost of potentially computing it multiple times?


TopNResultValue arg1Vals = arg1.getValue();
TopNResultValue arg2Vals = arg2.getValue();

for (DimensionAndMetricValueExtractor arg1Val : arg1Vals) {
retVals.put(arg1Val.getStringDimensionValue(dimension), arg1Val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the getStringDimensionValue() method still needed? if not, get rid of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't used anymore, I've removed getStringDimensionValue now

}
}

private DimensionAndMetricValueExtractor getTransformedExtractor(DimensionAndMetricValueExtractor argVal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TopNBinaryFn isn't the right place to do this, it should happen when the TopNResultValues are built in the first place. Partially this is for efficiency but also it's because the binary fn doesn't always fire, so you can't rely on it for transformations that need to happen. If there's only one result value (like a topN over a single segment) then it won't fire.

So putting this in TopNMapFn would be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the transform functions to TopNMapFn, the transformation is now applied within DimExtractionTopNAlgorithm as it adds results to the result builder

@Override
public int getCardinality(FloatColumnSelector selector)
{
return -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps pull this out into a TopNColumnSelectorStrategy.CARDINALITY_UNKNOWN constant for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added TopNColumnSelectorStrategy.CARDINALITY_UNKNOWN

getContextTypeHints(query)
);

if (dimensions != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this logic should be part of the DimensionSpec, maybe a method like dimensionSpec.getOutputType().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this to DimensionSpec outputType

DimensionSelector selector,
Cursor cursor,
Aggregator[][] rowSelector,
Map<Comparable, Aggregator[]> aggregatesStore
)
{
final IndexedInts dimValues = selector.getRow();
if (selector.getValueCardinality() < 0) {
dimExtractionScanAndAggregateNoCardinality(selector, rowSelector, aggregatesStore, cursor, query);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I guess the purpose of this is to support extractionFns on longs/floats?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This'll probably need some reconciling with #3858.

@jon-wei jon-wei force-pushed the dimtype3 branch 6 times, most recently from 77f3e88 to 4fd0ecb Compare January 25, 2017 02:59
@jon-wei
Copy link
Contributor Author

jon-wei commented Jan 25, 2017

Updated the patch, still waiting on benchmark results comparing performance of a groupBy on a String column with numeric data vs. its equivalent long and float versions

I've left the filter predicate locking/lazy initialization for now since we haven't reached a consensus on that yet, addressed the other comments in that area though

I'm in favor of getting #3858 merged before this PR, so I'll revisit the FloatWrapping/LongWrapping selectors after that

The latest update on this PR also changes the hash function used in GroupByV2, since benchmarks showed very poor hashing behavior (leading to ~50 seconds running time on the multiQueryableIndex benchmarks) on floats using the existing XOR and shift function. The benchmarks used are included.

@jon-wei jon-wei changed the title Add long/float ColumnSelectorStrategy implementations [WIP] Add long/float ColumnSelectorStrategy implementations Jan 25, 2017
@jon-wei jon-wei closed this Feb 7, 2017
@jon-wei jon-wei reopened this Feb 7, 2017
@@ -76,6 +76,21 @@ public static Builder builder()
return new Builder();
}

public static ValueType getValueTypeForSqlTypeName(SqlTypeName sqlTypeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to "Calcites" where most of the random utility functions should go.

throw new ISE("Cannot translate sqlTypeName[%s] to Druid type for field[%s]", sqlTypeName, name);
}

final DimensionSpec dimensionSpec = rex.toDimensionSpec(rowSignature, null, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to pass ValueType.STRING in here explicitly and check for non-null outputType in toDimensionSpec.

public DimensionSpec toDimensionSpec(
final RowSignature rowSignature,
final String outputName,
final ValueType outputType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a Preconditions.checkNotNull(outputType, "outputType") to prevent bugs due to accidentally forgetting to pass in an outputType. Callers should pass in STRING if that's what they want.

List<Row> expectedResults;

if (config.getDefaultStrategy().equals(GroupByStrategySelector.STRATEGY_V1)) {
expectedResults = Arrays.asList(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should throw an exception. In general, groupBy v1 (which doesn't support outputType other than STRING) should throw an exception if outputType is anything other than STRING. The message should be something like "groupBy v1 only supports outputType STRING" which will give people a little push to use groupBy v2.

final StringComparator naturalComparator;
if (columnType == ValueType.STRING) {
naturalComparator = StringComparators.LEXICOGRAPHIC;
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If outputType is COMPLEX or null this is not true. I know this is "impossible" but please check for it anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good way to do this is have an else { sortingNeeded = true; break; }

@gianm
Copy link
Contributor

gianm commented Feb 7, 2017

@jon-wei please also update documentations:

  • sql.md (the "Unsupported features" section)
  • dimensionspecs.md (add "outputType" to the extraction spec, and note where outputTypes are and are not supported out of the places that use dimension specs: topN, groupBy v1, groupBy v2, cardinality aggregator, search query, select query)

@jon-wei jon-wei force-pushed the dimtype3 branch 3 times, most recently from 679d671 to 5ebc21a Compare February 8, 2017 00:52
@@ -455,10 +456,17 @@ private DimensionSelector makeDimensionSelectorUndecorated(
}

if (columnDesc.getCapabilities().getType() == ValueType.LONG) {
// Filtered dimension specs are not supported on numerics
if (dimensionSpec instanceof BaseFilteredDimensionSpec) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, these instanceof BaseFilteredDimensionSpec checks everywhere are kind of gnarly. There's way too many of them for comfort. Is there something better we can do? Maybe making them work is a better option if it's not too hard… why don't they, again? They should be able to decorate a wrapped dimension selector just fine, right?

Is it just that we don't know if a DimensionSpec's "decorate" method is actually going to get called, because getColumnValueSelectorFromDimensionSpec might skip it by going and making something like a LongColumnSelector instead? If so, we could address that by adding a "mustDecorate" method to DimensionSpec, and if it mustDecorate, then we force creation of a DimensionSelector (rather than whatever type the underlying column is).

I would also accept alternate solutions that don't involve instanceof checks all over the place.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 after travis

@gianm gianm merged commit ca2b04f into apache:master Feb 9, 2017
@jon-wei jon-wei deleted the dimtype3 branch October 6, 2017 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants