Add long/float ColumnSelectorStrategy implementations #3838

jon-wei · 2017-01-11T04:07:33Z

This PR adds long and float implementations of the query processing ColumnSelectorStrategy interface introduced in #3570.

This allows users to include non-String columns as dimensions within queries by specifying a typeHints map in the query context. This is described in more detail in the doc changes within the PR.

For benchmarking, the patch also adds a set of "simple" schemas to the benchmark schemas, which has a single dimension "dimSequential" and a row count aggregator, where "dimSequential" contains an ascending sequence of integer values, encoded as String/Long/Float. Results for the benchmarks are located at:

https://gist.github.com/jon-wei/44a2b8ab97678710a9ef5b24aa107dec

NOTE:

When applying extraction functions to a float/long, the resulting DimensionSelector (exFns can only output Strings currently) constructs an on-the-fly dictionary without limits on memory usage, this may be problematic if an extraction function's set of output values has high cardinality. This is needed since query engines rely on the existence of a dictionary when handling Strings, I haven't thought of a better way to handle this yet.

NOTE:

Based on GroupByTypeInterfaceBenchmark, there is some virtual method overhead seen in GroupBy V2, like the concern discussed in this thread: Filters: Use ColumnSelectorFactory directly for building row-based matchers. #3797 (comment).

Benchmark                                                                 (defaultStrategy)  (initialBuckets)  (numProcessingThreads)  (numSegments)  (queryGranularity)  (rowsPerSegment)  Mode  Cnt      Score     Error  Units
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexFloatOnly                         v2                -1                       4              4                 all            250000  avgt   50  17895.755 ± 136.289  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexFloatThenLong                     v2                -1                       4              4                 all            250000  avgt   50  35926.555 ± 110.157  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexFloatThenString                   v2                -1                       4              4                 all            250000  avgt   50  44480.465 ± 398.283  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexFloatTwice                        v2                -1                       4              4                 all            250000  avgt   50  35673.128 ±  40.837  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexLongOnly                          v2                -1                       4              4                 all            250000  avgt   50  17920.575 ±  46.517  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexLongThenString                    v2                -1                       4              4                 all            250000  avgt   50  46077.899 ± 470.782  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexLongThenFloat                     v2                -1                       4              4                 all            250000  avgt   50  35986.157 ±  69.486  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexLongTwice                         v2                -1                       4              4                 all            250000  avgt   50  35938.898 ±  70.856  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexNumericOnly                       v2                -1                       4              4                 all            250000  avgt   50  20979.666 ±  51.640  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexNumericThenString                 v2                -1                       4              4                 all            250000  avgt   50  52961.866 ± 438.055  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexStringOnly                        v2                -1                       4              4                 all            250000  avgt   50  22801.356 ± 238.549  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexStringThenLong                    v2                -1                       4              4                 all            250000  avgt   50  44720.358 ± 454.617  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexStringThenNumeric                 v2                -1                       4              4                 all            250000  avgt   50  51805.316 ± 408.565  us/op
GroupByTypeInterfaceBenchmark.querySingleQueryableIndexStringTwice                       v2                -1                       4              4                 all            250000  avgt   50  46402.020 ± 657.067  us/op

This set of benchmarks attempts determine the running time for a group by query against single string/long/float columns (monomorphic base case), and running times for sequences of queries that use more than one type implementation.

For example, StringThenLong runs the string query followed by the long query within the same benchmark.

There appears to be some overhead when multiple type implementations are called, e.g.:

LongOnly = 17920.575us
StringOnly = 22801.356us

which should sum to ~40000us for both.

StringThenLong however has a running time of 444720.358us, about 10% higher than expected.

The difference bigger for NumericOnly (query on both a long and float) and NumericThenString:

NumericOnly = 20979.666
StringOnly = 22801.356

Sum should be ~44000, but NumericThenString and StringThenNumeric are 52961.866 and 51805.316 respectively.

The difference between StringOnly and StringTwice is smaller, suggesting the overhead may be related to the polymorphism. However, the behavior is not seen for long/float only, e.g. LongTwice is very close to LongThenFloat and (LongOnly + FloatOnly)

Similar benchmarks for TopN don't show the same discrepancy:

Benchmark                                                            (numSegments)  (rowsPerSegment)  (schemaAndQuery)  (threshold)  Mode  Cnt      Score      Error  Units
TopNTypeInterfaceBenchmark.querySingleQueryableIndexFloatOnly                    1            250000           basic.A           10  avgt   50  41916.788 ±  590.935  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexFloatThenLong                1            250000           basic.A           10  avgt   50  84411.958 ±  446.999  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexFloatThenString              1            250000           basic.A           10  avgt   50  94284.029 ± 1395.548  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexFloatTwice                   1            250000           basic.A           10  avgt   50  85390.807 ±  934.008  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexLongOnly                     1            250000           basic.A           10  avgt   50  43835.461 ±  612.863  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexLongThenFloat                1            250000           basic.A           10  avgt   50  86043.197 ±  597.789  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexLongThenString               1            250000           basic.A           10  avgt   50  87899.020 ±  520.239  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexLongTwice                    1            250000           basic.A           10  avgt   50  86441.232 ±  635.296  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexStringOnly                   1            250000           basic.A           10  avgt   50  50487.063 ±  402.854  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexStringThenFloat              1            250000           basic.A           10  avgt   50  90925.010 ±  731.045  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexStringThenLong               1            250000           basic.A           10  avgt   50  91756.314 ±  741.951  us/op
TopNTypeInterfaceBenchmark.querySingleQueryableIndexStringTwice                  1            250000           basic.A           10  avgt   50  98454.496 ±  996.865  us/op

I am looking into the GroupBy performance discrepancy more and considering other possible approaches (e.g., some form of code generation), but opening this PR now to get some thoughts from the community on these aspects.

gianm · 2017-01-19T21:56:09Z

When applying extraction functions to a float/long, the resulting DimensionSelector (exFns can only output Strings currently) constructs an on-the-fly dictionary without limits on memory usage, this may be problematic if an extraction function's set of output values has high cardinality. This is needed since query engines rely on the existence of a dictionary when handling Strings, I haven't thought of a better way to handle this yet.

SingleScanTimeDimSelector has the same problem. IMO, the query engines should maintain String dictionaries on their own if the selector doesn't have one, which it can report by returning CARDINALITY_UNKNOWN from getValueCardinality(). The query engines are in the best position to manage memory in an appropriate way. I don't think you need to do this as part of this PR, but, I think it's the best way to go ultimately.

gianm · 2017-01-19T22:05:37Z

Based on GroupByTypeInterfaceBenchmark, there is some virtual method overhead seen in GroupBy V2, like the concern discussed in this thread: #3797 (comment).

StringThenLong however has a running time of 444720.358us, about 10% higher than expected.

Do you have benchmarks on grouping on high-cardinality numeric columns as strings vs. as longs? (same data, but stored/grouped as string vs. stored/grouped as long). I suspect that this will be a much bigger difference than 10%, and if so, given that:

there is a lot of performance benefit on both topN and groupBy in grouping numeric data as longs instead of strings (we need numbers on this, though)
there is no measurable performance hit from polymorphism in topN
there is measurable hit from polymorphism in groupBy for strings, but groupBy is under a lot of flux anyway
we are investigating ways to un-do the performance hit for strings

I'm willing to say that the change is worth including.

leventov · 2017-01-20T00:05:28Z

When applying extraction functions to a float/long, the resulting DimensionSelector (exFns can only output Strings currently) constructs an on-the-fly dictionary without limits on memory usage, this may be problematic if an extraction function's set of output values has high cardinality. This is needed since query engines rely on the existence of a dictionary when handling Strings, I haven't thought of a better way to handle this yet.

SingleScanTimeDimSelector has the same problem. IMO, the query engines should maintain String dictionaries on their own if the selector doesn't have one, which it can report by returning CARDINALITY_UNKNOWN from getValueCardinality(). The query engines are in the best position to manage memory in an appropriate way. I don't think you need to do this as part of this PR, but, I think it's the best way to go ultimately.

Also StringDimensionIndexer's DimensionSelector and QueryableIndexStorageAdapter's DimensionSelectors have positive getValueCardinality() but don't have dictionaries when there is non-null extractionFn. When I worked on #3858 I first changed them to return CARDINALITY_UNKNOWN in this case, but a lot of tests started to fail. Having cardinality or not and having dictionary or not are different properties. This is handled in #3858 by adding idLookup() method to DimensionSelector interface.

leventov · 2017-01-20T00:23:02Z

I am looking into the GroupBy performance discrepancy more and considering other possible approaches (e.g., some form of code generation), but opening this PR now to get some thoughts from the community on these aspects.

I considered code generation when worked on #3798. Decided it's too difficult and changes the current source tree too much. So I suggest to resolve this by expanding #3798 to groupBy queries and other types of queries other than topN.

leventov

PR review until TopNBinaryFn excluding

leventov · 2017-01-20T00:25:33Z

benchmarks/src/main/java/io/druid/benchmark/DimensionPredicateFilterBenchmark.java

+        @Override
+        public DruidFloatPredicate makeFloatPredicate()
+        {
+          return new DruidFloatPredicate()


Please extract a singleton

Using singletons for these false predicates now, in DruidLongPredicate/DruidFloatPredicate

leventov · 2017-01-20T00:26:14Z

benchmarks/src/main/java/io/druid/benchmark/FilterPartitionBenchmark.java

+          @Override
+          public DruidFloatPredicate makeFloatPredicate()
+          {
+            return new DruidFloatPredicate()


Please extract a singleton

leventov · 2017-01-20T00:27:58Z

processing/src/main/java/io/druid/query/BaseQuery.java

@@ -67,6 +70,61 @@
    return parseInt(query, "uncoveredIntervalsLimit", defaultValue);
  }

+  public static <T> Map<String, Object> getContextTypeHints(Query<T> query)


Couldn't it accept Query<?> without a method type parameter?

I've removed this function now, based on @gianm's comment about moving the type specification to DimensionSpec

leventov · 2017-01-20T00:28:22Z

processing/src/main/java/io/druid/query/BaseQuery.java

+    return query.getContextValue(QueryContextKeys.TYPE_HINTS);
+  }
+
+  public static <T> Map<String, ValueType> getContextTypeHints(Query<T> query, List<DimensionSpec> dimensions)


Couldn't it accept Query<?> without a method type parameter?

I've removed this function now, based on @gianm's comment about moving the type specification to DimensionSpec

leventov · 2017-01-20T00:30:50Z

processing/src/main/java/io/druid/query/BaseQuery.java

+    }
+  }
+
+  private static <T> Map<String, ValueType> convertTypeHintsValues(Map<String, Object> oldMap)


<T> is not used

Function was removed

leventov · 2017-01-20T02:15:26Z

processing/src/main/java/io/druid/query/filter/SelectorDimFilter.java

@@ -137,6 +139,34 @@ public boolean applyLong(long input)
            };
          }
        }


In the method which ends with this brace Predicates.equalTo() and singleton could be used.

I changed the String predicate here to use Predicates.equalTo() and replaced the always false predicates with singletons

leventov · 2017-01-20T02:17:23Z

processing/src/main/java/io/druid/query/filter/SelectorDimFilter.java

@@ -225,4 +255,18 @@ private void initLongValue()
      longsInitialized = true;


If caching resulting predicate directly instead of valueAsLong (it's possible because predicates are immutable), longsInitialized is not needed. Also, locking could be removed and benign race allowed.

Cached the predicate instead

leventov · 2017-01-20T02:18:58Z

processing/src/main/java/io/druid/query/groupby/epinephelinae/GroupByQueryEngineV2.java

+        int keyBufferPosition, Object rowObj, int rowValIdx, ByteBuffer keyBuffer
+    )
+    {
+      // rows from a long column are always size 1


This comment is not clear to me

This is saying that the constant "1" below is the number of values in a row. It's 1 because long columns can't be multi-value. Maybe this comment would be clearer if it was directly above the return rowValIdx < 1; line? Or if that constant was given a name like LongColumnSelector.ROW_SIZE.

I've reworded the comment to mention multivalue support, and replaced the constant with a named one

leventov · 2017-01-20T02:19:52Z

processing/src/main/java/io/druid/query/groupby/epinephelinae/GroupByQueryEngineV2.java

+        int keyBufferPosition, Object rowObj, int rowValIdx, ByteBuffer keyBuffer
+    )
+    {
+      // rows from a float column are always size 1


This comment is not clear to me

Similar comment to above

leventov · 2017-01-20T02:35:02Z

processing/src/main/java/io/druid/query/topn/BaseTopNAlgorithm.java

@@ -74,7 +74,10 @@ public void run(
  {
    boolean hasDimValSelector = (dimValSelector != null);

-    final int cardinality = params.getCardinality();
+    int cardinality = params.getCardinality();


Why this change should be a part of this PR? cardinality < 0 probably means that completely different approach should be taken to compute topN, setting just to Integer.MAX_VALUE may result in a lot of excessive work and allocations in run() method.

I'm wondering if this is really necessary. It looks like this code path may not get executed for non-string selectors. If so, there should be no need to reset it here. If there is a need to reset it here, that seems suspicious.

I've changed BaseTopNAlgorithm to have a "with cardinality" path and "no cardinality" path, the "no cardinality" path doesn't use DimValSelector (current implementations need cardinality) and the multi-pass logic

jon-wei · 2017-01-21T04:28:10Z

@leventov @gianm

Thanks for the reviews so far, marking this WIP while I address the comments and get more benchmark results, will update this early next week

leventov

Complete review

leventov · 2017-01-21T01:38:23Z

processing/src/main/java/io/druid/query/topn/TopNBinaryFn.java

+      if (input == null) {
+        return null;
+      }
+      if (input instanceof String) {


This block is redundant, String.toString() returns this

leventov · 2017-01-21T02:07:37Z

processing/src/main/java/io/druid/query/topn/TopNParams.java

@@ -43,10 +43,6 @@ protected TopNParams(
    this.cursor = cursor;
    this.cardinality = selectorPlus.getColumnSelectorStrategy().getCardinality(selectorPlus.getSelector());
    this.numValuesPerPass = numValuesPerPass;
-


Related to some change above, not sure this should be a part of this PR

Still need to be here?

This is removed since the TopNParams can now hold CARDINALITY_UNKNOWN for certain dimension selectors

Could you add this check to anything that calls getCardinality(), getSelectorPlus(), or getDimSelector() and wouldn't be able to handle the CARDINALITY_UNKNOWN case? That's a useful sanity check to make sure that the code is routing the proper cases to the proper algorithms.

leventov · 2017-01-21T02:10:13Z

processing/src/main/java/io/druid/query/topn/types/StringTopNColumnSelectorStrategy.java

@@ -46,6 +46,10 @@ public int getCardinality(DimensionSelector selector)
    // Optimization possibly requires a reverse lookup from value to ID, which is
    // not possible when applying an extraction function

+    if (params.getCardinality() < 0) {


And changes in this class

removed the check, no longer needed with separate code path for "cardinality unknown" case

leventov · 2017-01-21T02:15:00Z

processing/src/main/java/io/druid/segment/DimensionHandlerUtils.java

+      capabilities = DEFAULT_STRING_CAPABILITIES;
+    }
+
+    // Currently, all extractionFns output Strings, so force the type if present


"force the type if present" -- what does it mean?

Changed comment to:

"Currently, all extractionFns output Strings, so the column will return String values via a DimensionSelector if an extractionFn is present."

leventov · 2017-01-21T02:16:34Z

processing/src/main/java/io/druid/segment/DimensionHandlerUtils.java

+      return ((Number) valObj).longValue();
+    } else if (valObj instanceof String) {
+      try {
+        String s = ((String) valObj).replace(",", "");


What removing commas?

I took this parsing code from getLongMetric/getFloatMetric in MapBasedRowinitially, I kept the comma removal since these transform functions would be used on String columns with "numeric" data, and figured it would be useful to allow that format

I don't mind removing this if someone is against it though.

It's puzzling. I think it's worth to understand why it was added in the first place, and if the reason is still valid, to add a comment on this line.

Well, I suppose the purpose is to remove 1000s delimiter from numbers, e. g. 1,000,000 is 1000000. But not in all locales, e. g. in Russian locale comma is a separator between integer part and fractional part in non-integer numbers.

IMO this is "too smart" and we shouldn't do this. Probably it should be discussed separately, not in this PR

Hm, good point on the locales, I agree this is too smart, I've removed the comma removal and replaced these blocks with GuavaUtils.tryParseLong and Floats.tryParse

leventov · 2017-01-21T02:18:09Z

processing/src/main/java/io/druid/segment/DimensionHandlerUtils.java

+  // TopNBinaryFn relies on this returning the same Float object when no conversion is needed
+  public static Float convertObjectToFloat(Object valObj)
+  {
+    if (valObj == null) {


Why there is no similar check in convertObjectToLong()?

leventov · 2017-01-21T02:18:33Z

processing/src/main/java/io/druid/segment/DimensionHandlerUtils.java

+      return ((Number) valObj).floatValue();
+    } else if (valObj instanceof String) {
+      try {
+        return Float.valueOf(((String) valObj).replace(",", ""));


Why removing commas?

I can imagine it could become a source or bugs, because in some Locales comma is used as a separator between integer and fractional parts. So removing commas from "1,0" will silently produce 10.0f instead of failing with exception.

leventov · 2017-01-21T04:59:25Z

processing/src/main/java/io/druid/segment/FloatWrappingDimensionSelector.java

+{
+  private final FloatColumnSelector delegate;
+  private final ExtractionFn exFn;
+  private final Map<String, Integer> valueToId = Maps.newHashMap();


Given how DimensionSelectors are used, accumulating valueToId and idToValue doesn't make much sense. I agree with @gianm that users of DimensionSelectors are in a better position to do such caching. If #3858 is merged first, this FloatWrappingDimensionSelector should return null from idLookup(), false from nameLookupPossibleInAdvance(), ZeroIndexedInts from getRow() and perform float -> String conversion in lookupName(). However, if FloatColumnSelector has equivalents of DimensionSelector's idLookup(), lookupName() and getValueCardinality(), FloatWrappingDimensionSelector could be optimized.

Agree on having the dictionary management reside in the query engines and other users, I'm in favor of getting #3858 merged first so I'll hold off on changing this part for now

leventov · 2017-01-21T05:00:36Z

processing/src/main/java/io/druid/segment/LongWrappingDimensionSelector.java

+import java.util.List;
+import java.util.Map;
+
+public class LongWrappingDimensionSelector implements DimensionSelector


Same comments as in FloatWrappingDimensionSelector

leventov · 2017-01-21T05:18:40Z

processing/src/main/java/io/druid/segment/filter/SpatialFilter.java

+          @Override
+          public DruidFloatPredicate makeFloatPredicate()
+          {
+            return new DruidFloatPredicate()


Please replace with a singleton

whoops, missed that the first time, thanks

leventov · 2017-01-21T20:57:40Z

processing/src/main/java/io/druid/query/filter/BoundDimFilter.java

+        }
+      }
+
+      private void initPredicate()


Actually I'm losing the point of lazy initialization here, as well several similar places below. Maybe it's simpler to compute build a predicate eagerly.

gianm

Stopped on StringTopNColumnSelectorStrategy.java, will continue later

gianm · 2017-01-21T19:18:13Z

docs/content/querying/query-context.md

@@ -21,4 +21,4 @@ The query context is used for various query configuration parameters.
 |`maxResults`|500000|Maximum number of results groupBy query can process. Default value used can be changed by `druid.query.groupBy.maxResults` in druid configuration at broker and historical nodes. At query time you can only lower the value.|
 |`maxIntermediateRows`|50000|Maximum number of intermediate rows while processing single segment for groupBy query. Default value used can be changed by `druid.query.groupBy.maxIntermediateRows` in druid configuration at broker and historical nodes. At query time you can only lower the value.|
 |`groupByIsSingleThreaded`|false|Whether to run single threaded group By queries. Default value used can be changed by `druid.query.groupBy.singleThreaded` in druid configuration at historical nodes.|
-
+|`typeHints`|{}| A map of column name -> column type (String, Long, Float). By default, druid returns all column values as strings within query results. If querying on a non-String column, `typeHints` must be included in a query, containing a mapping of the name of the non-String column to its desired return type. This is necessary because columns with the same name in different segments do not necessarily have the same value type, and a type must be chosen when merging results.|


This should be part of DimensionSpec rather than the query context. Usually the query context is tuning-type stuff or minor behavior modifications, but type hints are more than that.

For this part:

If querying on a non-String column, typeHints must be included in a query

Is that true or does the type default to string if not specified? The current language suggests that if typeHints are not included then the query will fail.

Removed typeHints and added an outputType field to DimensionSpec, docs have been updated accordingly

gianm · 2017-01-21T19:20:05Z

docs/content/querying/querying.md

@@ -99,3 +99,21 @@ Possible codes for the *error* field include:
 |`Query cancelled`|The query was cancelled through the query cancellation API.|
 |`Resource limit exceeded`|The query exceeded a configured resource limit (e.g. groupBy maxResults).|
 |`Unknown exception`|Some other exception occurred. Check errorMessage and errorClass for details, although keep in mind that the contents of those fields are free-form and may change from release to release.|
+
+
+Column Types


## Column Types (or however many #'s) is more normal

gianm · 2017-01-21T19:26:32Z

...uid/query/aggregation/cardinality/types/LongCardinalityAggregatorColumnSelectorStrategy.java

+  @Override
+  public void hashValues(LongColumnSelector dimSelector, HyperLogLogCollector collector)
+  {
+    collector.add(CardinalityAggregator.hashFn.hashLong(dimSelector.get()).asBytes());


I guess it's because hashFn doesn't have a hashFloat method. But you could do hashFn.hashInt(Float.floatToIntBits(f)).

gianm · 2017-01-21T19:27:00Z

...id/query/aggregation/cardinality/types/FloatCardinalityAggregatorColumnSelectorStrategy.java

+  @Override
+  public void hashValues(FloatColumnSelector dimSelector, HyperLogLogCollector collector)
+  {
+    collector.add(CardinalityAggregator.hashFn.hashUnencodedChars(String.valueOf(dimSelector.get())).asBytes());


hashFn.hashInt(Float.floatToIntBits(dimSelector.get())) should be faster than encoding the float as a string.

gianm · 2017-01-21T19:33:32Z

processing/src/main/java/io/druid/query/filter/BoundDimFilter.java

-          private final boolean hasUpperLongBound = BoundLongPredicateSupplier.this.hasUpperLongBound;
-          private final long lowerLongBound = hasLowerLongBound ? BoundLongPredicateSupplier.this.lowerLongBound : 0L;
-          private final long upperLongBound = hasUpperLongBound ? BoundLongPredicateSupplier.this.upperLongBound : 0L;
+        synchronized (initLock) {


@leventov the predicate is not trivial to compute (it involves some string parsing) – do you think it's likely that lock savings from allowing the race outweighs the cost of potentially computing it multiple times?

gianm · 2017-01-21T21:30:36Z

processing/src/main/java/io/druid/query/topn/TopNBinaryFn.java


    TopNResultValue arg1Vals = arg1.getValue();
    TopNResultValue arg2Vals = arg2.getValue();

    for (DimensionAndMetricValueExtractor arg1Val : arg1Vals) {
-      retVals.put(arg1Val.getStringDimensionValue(dimension), arg1Val);


Is the getStringDimensionValue() method still needed? if not, get rid of it.

It wasn't used anymore, I've removed getStringDimensionValue now

gianm · 2017-01-21T21:43:39Z

processing/src/main/java/io/druid/query/topn/TopNBinaryFn.java

+    }
+  }
+
+  private DimensionAndMetricValueExtractor getTransformedExtractor(DimensionAndMetricValueExtractor argVal)


TopNBinaryFn isn't the right place to do this, it should happen when the TopNResultValues are built in the first place. Partially this is for efficiency but also it's because the binary fn doesn't always fire, so you can't rely on it for transformations that need to happen. If there's only one result value (like a topN over a single segment) then it won't fire.

So putting this in TopNMapFn would be better.

I moved the transform functions to TopNMapFn, the transformation is now applied within DimExtractionTopNAlgorithm as it adds results to the result builder

gianm · 2017-01-21T21:47:33Z

processing/src/main/java/io/druid/query/topn/types/FloatTopNColumnSelectorStrategy.java

+  @Override
+  public int getCardinality(FloatColumnSelector selector)
+  {
+    return -1;


Perhaps pull this out into a TopNColumnSelectorStrategy.CARDINALITY_UNKNOWN constant for clarity.

Added TopNColumnSelectorStrategy.CARDINALITY_UNKNOWN

gianm · 2017-01-21T21:58:57Z

processing/src/main/java/io/druid/query/BaseQuery.java

+        getContextTypeHints(query)
+    );
+
+    if (dimensions != null) {


IMO, this logic should be part of the DimensionSpec, maybe a method like dimensionSpec.getOutputType().

Moved this to DimensionSpec outputType

gianm · 2017-01-21T22:05:49Z

processing/src/main/java/io/druid/query/topn/types/StringTopNColumnSelectorStrategy.java

      DimensionSelector selector,
      Cursor cursor,
      Aggregator[][] rowSelector,
      Map<Comparable, Aggregator[]> aggregatesStore
  )
  {
-    final IndexedInts dimValues = selector.getRow();
+    if (selector.getValueCardinality() < 0) {
+      dimExtractionScanAndAggregateNoCardinality(selector, rowSelector, aggregatesStore, cursor, query);


Hm, I guess the purpose of this is to support extractionFns on longs/floats?

This'll probably need some reconciling with #3858.

jon-wei · 2017-01-25T03:43:16Z

Updated the patch, still waiting on benchmark results comparing performance of a groupBy on a String column with numeric data vs. its equivalent long and float versions

I've left the filter predicate locking/lazy initialization for now since we haven't reached a consensus on that yet, addressed the other comments in that area though

I'm in favor of getting #3858 merged before this PR, so I'll revisit the FloatWrapping/LongWrapping selectors after that

The latest update on this PR also changes the hash function used in GroupByV2, since benchmarks showed very poor hashing behavior (leading to ~50 seconds running time on the multiQueryableIndex benchmarks) on floats using the existing XOR and shift function. The benchmarks used are included.

gianm · 2017-02-07T22:02:47Z

sql/src/main/java/io/druid/sql/calcite/table/RowSignature.java

@@ -76,6 +76,21 @@ public static Builder builder()
    return new Builder();
  }

+  public static ValueType getValueTypeForSqlTypeName(SqlTypeName sqlTypeName)


Please move this to "Calcites" where most of the random utility functions should go.

gianm · 2017-02-07T22:13:04Z

sql/src/main/java/io/druid/sql/calcite/aggregation/ApproxCountDistinctSqlAggregator.java

+        throw new ISE("Cannot translate sqlTypeName[%s] to Druid type for field[%s]", sqlTypeName, name);
+      }
+
+      final DimensionSpec dimensionSpec = rex.toDimensionSpec(rowSignature, null, null);


Better to pass ValueType.STRING in here explicitly and check for non-null outputType in toDimensionSpec.

gianm · 2017-02-07T22:13:43Z

sql/src/main/java/io/druid/sql/calcite/expression/RowExtraction.java

+  public DimensionSpec toDimensionSpec(
+      final RowSignature rowSignature,
+      final String outputName,
+      final ValueType outputType


This should have a Preconditions.checkNotNull(outputType, "outputType") to prevent bugs due to accidentally forgetting to pass in an outputType. Callers should pass in STRING if that's what they want.

gianm · 2017-02-07T22:30:18Z

processing/src/test/java/io/druid/query/groupby/GroupByQueryRunnerTest.java

+    List<Row> expectedResults;
+
+    if (config.getDefaultStrategy().equals(GroupByStrategySelector.STRATEGY_V1)) {
+      expectedResults = Arrays.asList(


This should throw an exception. In general, groupBy v1 (which doesn't support outputType other than STRING) should throw an exception if outputType is anything other than STRING. The message should be something like "groupBy v1 only supports outputType STRING" which will give people a little push to use groupBy v2.

gianm · 2017-02-07T22:33:28Z

processing/src/main/java/io/druid/query/groupby/orderby/DefaultLimitSpec.java

+        final StringComparator naturalComparator;
+        if (columnType == ValueType.STRING) {
+          naturalComparator = StringComparators.LEXICOGRAPHIC;
+        } else {


If outputType is COMPLEX or null this is not true. I know this is "impossible" but please check for it anyway.

A good way to do this is have an else { sortingNeeded = true; break; }

gianm · 2017-02-07T22:50:45Z

@jon-wei please also update documentations:

sql.md (the "Unsupported features" section)
dimensionspecs.md (add "outputType" to the extraction spec, and note where outputTypes are and are not supported out of the places that use dimension specs: topN, groupBy v1, groupBy v2, cardinality aggregator, search query, select query)

gianm · 2017-02-08T02:08:13Z

processing/src/main/java/io/druid/segment/QueryableIndexStorageAdapter.java

@@ -455,10 +456,17 @@ private DimensionSelector makeDimensionSelectorUndecorated(
                      }

                      if (columnDesc.getCapabilities().getType() == ValueType.LONG) {
+                        // Filtered dimension specs are not supported on numerics
+                        if (dimensionSpec instanceof BaseFilteredDimensionSpec) {


Ugh, these instanceof BaseFilteredDimensionSpec checks everywhere are kind of gnarly. There's way too many of them for comfort. Is there something better we can do? Maybe making them work is a better option if it's not too hard… why don't they, again? They should be able to decorate a wrapped dimension selector just fine, right?

Is it just that we don't know if a DimensionSpec's "decorate" method is actually going to get called, because getColumnValueSelectorFromDimensionSpec might skip it by going and making something like a LongColumnSelector instead? If so, we could address that by adding a "mustDecorate" method to DimensionSpec, and if it mustDecorate, then we force creation of a DimensionSelector (rather than whatever type the underlying column is).

I would also accept alternate solutions that don't involve instanceof checks all over the place.

gianm

👍 after travis

jon-wei added Discuss Feature labels Jan 11, 2017

jon-wei added this to the 0.10.0 milestone Jan 11, 2017

leventov self-assigned this Jan 11, 2017

jon-wei force-pushed the dimtype3 branch 2 times, most recently from 6919b7f to 24394e3 Compare January 11, 2017 20:36

jon-wei closed this Jan 11, 2017

jon-wei reopened this Jan 11, 2017

Add long/float ColumnSelectorStrategy implementations

b67e83e

jon-wei force-pushed the dimtype3 branch from 24394e3 to b67e83e Compare January 11, 2017 21:50

leventov requested changes Jan 20, 2017

View reviewed changes

jon-wei changed the title ~~Add long/float ColumnSelectorStrategy implementations~~ Add long/float ColumnSelectorStrategy implementations [WIP] Jan 21, 2017

leventov requested changes Jan 21, 2017

View reviewed changes

leventov reviewed Jan 21, 2017

View reviewed changes

gianm reviewed Jan 21, 2017

View reviewed changes

jon-wei force-pushed the dimtype3 branch 6 times, most recently from 77f3e88 to 4fd0ecb Compare January 25, 2017 02:59

jon-wei changed the title ~~Add long/float ColumnSelectorStrategy implementations [WIP]~~ Add long/float ColumnSelectorStrategy implementations Jan 25, 2017

Address PR comments

95e2e50

jon-wei force-pushed the dimtype3 branch from 4fd0ecb to 95e2e50 Compare January 25, 2017 21:08

ScanQuery test and comment about isInputRaw

f0c3f5b

jon-wei force-pushed the dimtype3 branch from 8aaa6ff to 832f919 Compare February 7, 2017 04:12

jon-wei added 2 commits February 6, 2017 22:13

Add outputType to extractionDimensionSpec, tweak SQL tests

b5aef87

Fix limit spec optimization for numerics

e5bca4a

jon-wei force-pushed the dimtype3 branch from 832f919 to e5bca4a Compare February 7, 2017 06:14

Add cardinality sanity checks to TopN

299e13f

jon-wei closed this Feb 7, 2017

jon-wei reopened this Feb 7, 2017

jon-wei added 2 commits February 6, 2017 23:17

Merge remote-tracking branch 'upstream/master' into dimtype3

ab813bc

Fix import from merge

3e0ff60

gianm reviewed Feb 7, 2017

View reviewed changes

Add tests for filtered dimension spec outputType

5b5f385

jon-wei force-pushed the dimtype3 branch 3 times, most recently from 679d671 to 5ebc21a Compare February 8, 2017 00:52

Address PR comments

b5dad54

jon-wei force-pushed the dimtype3 branch from 5ebc21a to b5dad54 Compare February 8, 2017 01:18

gianm reviewed Feb 8, 2017

View reviewed changes

jon-wei force-pushed the dimtype3 branch from f51c8ff to d1f0669 Compare February 8, 2017 23:42

Allow filtered dimspecs on numerics

3e993ee

jon-wei force-pushed the dimtype3 branch from d1f0669 to 3e993ee Compare February 9, 2017 00:07

More comments

bfd6c8c

gianm approved these changes Feb 9, 2017

View reviewed changes

gianm merged commit ca2b04f into apache:master Feb 9, 2017

gianm mentioned this pull request Feb 9, 2017

SQL: Add context and contextual functions to planner. #3919

Merged

gianm mentioned this pull request Feb 28, 2017

Druid 0.10.0 release notes #3944

Closed

jon-wei deleted the dimtype3 branch October 6, 2017 22:22

clambertus unassigned dclim Jul 6, 2018

@@ @@ -137,6 +139,34 @@ public boolean applyLong(long input) @@
                           };
                         }
                       }

		@@ -225,4 +255,18 @@ private void initLongValue()
		longsInitialized = true;

Add long/float ColumnSelectorStrategy implementations #3838

Add long/float ColumnSelectorStrategy implementations #3838

Conversation

jon-wei commented Jan 11, 2017 • edited Loading

gianm commented Jan 19, 2017

gianm commented Jan 19, 2017 • edited Loading

leventov commented Jan 20, 2017

leventov commented Jan 20, 2017

leventov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jon-wei commented Jan 21, 2017

leventov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jon-wei commented Jan 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jon-wei commented Jan 11, 2017 •

edited

Loading

gianm commented Jan 19, 2017 •

edited

Loading

jon-wei commented Jan 25, 2017 •

edited

Loading