-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add long/float ColumnSelectorStrategy implementations #3838
Conversation
6919b7f
to
24394e3
Compare
SingleScanTimeDimSelector has the same problem. IMO, the query engines should maintain String dictionaries on their own if the selector doesn't have one, which it can report by returning |
Do you have benchmarks on grouping on high-cardinality numeric columns as strings vs. as longs? (same data, but stored/grouped as string vs. stored/grouped as long). I suspect that this will be a much bigger difference than 10%, and if so, given that:
I'm willing to say that the change is worth including. |
Also |
I considered code generation when worked on #3798. Decided it's too difficult and changes the current source tree too much. So I suggest to resolve this by expanding #3798 to groupBy queries and other types of queries other than topN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR review until TopNBinaryFn excluding
@Override | ||
public DruidFloatPredicate makeFloatPredicate() | ||
{ | ||
return new DruidFloatPredicate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please extract a singleton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using singletons for these false predicates now, in DruidLongPredicate/DruidFloatPredicate
@Override | ||
public DruidFloatPredicate makeFloatPredicate() | ||
{ | ||
return new DruidFloatPredicate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please extract a singleton
@@ -67,6 +70,61 @@ | |||
return parseInt(query, "uncoveredIntervalsLimit", defaultValue); | |||
} | |||
|
|||
public static <T> Map<String, Object> getContextTypeHints(Query<T> query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't it accept Query<?>
without a method type parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this function now, based on @gianm's comment about moving the type specification to DimensionSpec
return query.getContextValue(QueryContextKeys.TYPE_HINTS); | ||
} | ||
|
||
public static <T> Map<String, ValueType> getContextTypeHints(Query<T> query, List<DimensionSpec> dimensions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't it accept Query<?>
without a method type parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this function now, based on @gianm's comment about moving the type specification to DimensionSpec
} | ||
} | ||
|
||
private static <T> Map<String, ValueType> convertTypeHintsValues(Map<String, Object> oldMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<T>
is not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function was removed
@@ -137,6 +139,34 @@ public boolean applyLong(long input) | |||
}; | |||
} | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the method which ends with this brace Predicates.equalTo()
and singleton could be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the String predicate here to use Predicates.equalTo()
and replaced the always false predicates with singletons
@@ -225,4 +255,18 @@ private void initLongValue() | |||
longsInitialized = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If caching resulting predicate directly instead of valueAsLong
(it's possible because predicates are immutable), longsInitialized
is not needed. Also, locking could be removed and benign race allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cached the predicate instead
int keyBufferPosition, Object rowObj, int rowValIdx, ByteBuffer keyBuffer | ||
) | ||
{ | ||
// rows from a long column are always size 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not clear to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is saying that the constant "1" below is the number of values in a row. It's 1 because long columns can't be multi-value. Maybe this comment would be clearer if it was directly above the return rowValIdx < 1;
line? Or if that constant was given a name like LongColumnSelector.ROW_SIZE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reworded the comment to mention multivalue support, and replaced the constant with a named one
int keyBufferPosition, Object rowObj, int rowValIdx, ByteBuffer keyBuffer | ||
) | ||
{ | ||
// rows from a float column are always size 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not clear to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment to above
@@ -74,7 +74,10 @@ public void run( | |||
{ | |||
boolean hasDimValSelector = (dimValSelector != null); | |||
|
|||
final int cardinality = params.getCardinality(); | |||
int cardinality = params.getCardinality(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change should be a part of this PR? cardinality < 0 probably means that completely different approach should be taken to compute topN, setting just to Integer.MAX_VALUE
may result in a lot of excessive work and allocations in run()
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this is really necessary. It looks like this code path may not get executed for non-string selectors. If so, there should be no need to reset it here. If there is a need to reset it here, that seems suspicious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed BaseTopNAlgorithm to have a "with cardinality" path and "no cardinality" path, the "no cardinality" path doesn't use DimValSelector
(current implementations need cardinality) and the multi-pass logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complete review
if (input == null) { | ||
return null; | ||
} | ||
if (input instanceof String) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block is redundant, String.toString() returns this
@@ -43,10 +43,6 @@ protected TopNParams( | |||
this.cursor = cursor; | |||
this.cardinality = selectorPlus.getColumnSelectorStrategy().getCardinality(selectorPlus.getSelector()); | |||
this.numValuesPerPass = numValuesPerPass; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to some change above, not sure this should be a part of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to be here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is removed since the TopNParams can now hold CARDINALITY_UNKNOWN for certain dimension selectors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add this check to anything that calls getCardinality()
, getSelectorPlus()
, or getDimSelector()
and wouldn't be able to handle the CARDINALITY_UNKNOWN case? That's a useful sanity check to make sure that the code is routing the proper cases to the proper algorithms.
@@ -46,6 +46,10 @@ public int getCardinality(DimensionSelector selector) | |||
// Optimization possibly requires a reverse lookup from value to ID, which is | |||
// not possible when applying an extraction function | |||
|
|||
if (params.getCardinality() < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And changes in this class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the check, no longer needed with separate code path for "cardinality unknown" case
capabilities = DEFAULT_STRING_CAPABILITIES; | ||
} | ||
|
||
// Currently, all extractionFns output Strings, so force the type if present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"force the type if present" -- what does it mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed comment to:
"Currently, all extractionFns output Strings, so the column will return String values via a DimensionSelector if an extractionFn is present."
return ((Number) valObj).longValue(); | ||
} else if (valObj instanceof String) { | ||
try { | ||
String s = ((String) valObj).replace(",", ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What removing commas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took this parsing code from getLongMetric/getFloatMetric
in MapBasedRow
initially, I kept the comma removal since these transform functions would be used on String columns with "numeric" data, and figured it would be useful to allow that format
I don't mind removing this if someone is against it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's puzzling. I think it's worth to understand why it was added in the first place, and if the reason is still valid, to add a comment on this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I suppose the purpose is to remove 1000s delimiter from numbers, e. g. 1,000,000
is 1000000
. But not in all locales, e. g. in Russian locale comma is a separator between integer part and fractional part in non-integer numbers.
IMO this is "too smart" and we shouldn't do this. Probably it should be discussed separately, not in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, good point on the locales, I agree this is too smart, I've removed the comma removal and replaced these blocks with GuavaUtils.tryParseLong
and Floats.tryParse
// TopNBinaryFn relies on this returning the same Float object when no conversion is needed | ||
public static Float convertObjectToFloat(Object valObj) | ||
{ | ||
if (valObj == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why there is no similar check in convertObjectToLong()
?
return ((Number) valObj).floatValue(); | ||
} else if (valObj instanceof String) { | ||
try { | ||
return Float.valueOf(((String) valObj).replace(",", "")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why removing commas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can imagine it could become a source or bugs, because in some Locales comma is used as a separator between integer and fractional parts. So removing commas from "1,0"
will silently produce 10.0f
instead of failing with exception.
{ | ||
private final FloatColumnSelector delegate; | ||
private final ExtractionFn exFn; | ||
private final Map<String, Integer> valueToId = Maps.newHashMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how DimensionSelectors are used, accumulating valueToId
and idToValue
doesn't make much sense. I agree with @gianm that users of DimensionSelectors are in a better position to do such caching. If #3858 is merged first, this FloatWrappingDimensionSelector
should return null
from idLookup()
, false
from nameLookupPossibleInAdvance()
, ZeroIndexedInts
from getRow()
and perform float -> String conversion in lookupName()
. However, if FloatColumnSelector
has equivalents of DimensionSelector's idLookup()
, lookupName()
and getValueCardinality()
, FloatWrappingDimensionSelector
could be optimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on having the dictionary management reside in the query engines and other users, I'm in favor of getting #3858 merged first so I'll hold off on changing this part for now
import java.util.List; | ||
import java.util.Map; | ||
|
||
public class LongWrappingDimensionSelector implements DimensionSelector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as in FloatWrappingDimensionSelector
@Override | ||
public DruidFloatPredicate makeFloatPredicate() | ||
{ | ||
return new DruidFloatPredicate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace with a singleton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace with a singleton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops, missed that the first time, thanks
} | ||
} | ||
|
||
private void initPredicate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm losing the point of lazy initialization here, as well several similar places below. Maybe it's simpler to compute build a predicate eagerly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stopped on StringTopNColumnSelectorStrategy.java, will continue later
@@ -21,4 +21,4 @@ The query context is used for various query configuration parameters. | |||
|`maxResults`|500000|Maximum number of results groupBy query can process. Default value used can be changed by `druid.query.groupBy.maxResults` in druid configuration at broker and historical nodes. At query time you can only lower the value.| | |||
|`maxIntermediateRows`|50000|Maximum number of intermediate rows while processing single segment for groupBy query. Default value used can be changed by `druid.query.groupBy.maxIntermediateRows` in druid configuration at broker and historical nodes. At query time you can only lower the value.| | |||
|`groupByIsSingleThreaded`|false|Whether to run single threaded group By queries. Default value used can be changed by `druid.query.groupBy.singleThreaded` in druid configuration at historical nodes.| | |||
|
|||
|`typeHints`|{}| A map of column name -> column type (String, Long, Float). By default, druid returns all column values as strings within query results. If querying on a non-String column, `typeHints` must be included in a query, containing a mapping of the name of the non-String column to its desired return type. This is necessary because columns with the same name in different segments do not necessarily have the same value type, and a type must be chosen when merging results.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be part of DimensionSpec rather than the query context. Usually the query context is tuning-type stuff or minor behavior modifications, but type hints are more than that.
For this part:
If querying on a non-String column,
typeHints
must be included in a query
Is that true or does the type default to string if not specified? The current language suggests that if typeHints are not included then the query will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed typeHints and added an outputType field to DimensionSpec, docs have been updated accordingly
docs/content/querying/querying.md
Outdated
@@ -99,3 +99,21 @@ Possible codes for the *error* field include: | |||
|`Query cancelled`|The query was cancelled through the query cancellation API.| | |||
|`Resource limit exceeded`|The query exceeded a configured resource limit (e.g. groupBy maxResults).| | |||
|`Unknown exception`|Some other exception occurred. Check errorMessage and errorClass for details, although keep in mind that the contents of those fields are free-form and may change from release to release.| | |||
|
|||
|
|||
Column Types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Column Types
(or however many #'s) is more normal
@Override | ||
public void hashValues(LongColumnSelector dimSelector, HyperLogLogCollector collector) | ||
{ | ||
collector.add(CardinalityAggregator.hashFn.hashLong(dimSelector.get()).asBytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's because hashFn doesn't have a hashFloat
method. But you could do hashFn.hashInt(Float.floatToIntBits(f))
.
@Override | ||
public void hashValues(FloatColumnSelector dimSelector, HyperLogLogCollector collector) | ||
{ | ||
collector.add(CardinalityAggregator.hashFn.hashUnencodedChars(String.valueOf(dimSelector.get())).asBytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hashFn.hashInt(Float.floatToIntBits(dimSelector.get()))
should be faster than encoding the float as a string.
private final boolean hasUpperLongBound = BoundLongPredicateSupplier.this.hasUpperLongBound; | ||
private final long lowerLongBound = hasLowerLongBound ? BoundLongPredicateSupplier.this.lowerLongBound : 0L; | ||
private final long upperLongBound = hasUpperLongBound ? BoundLongPredicateSupplier.this.upperLongBound : 0L; | ||
synchronized (initLock) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leventov the predicate is not trivial to compute (it involves some string parsing) – do you think it's likely that lock savings from allowing the race outweighs the cost of potentially computing it multiple times?
|
||
TopNResultValue arg1Vals = arg1.getValue(); | ||
TopNResultValue arg2Vals = arg2.getValue(); | ||
|
||
for (DimensionAndMetricValueExtractor arg1Val : arg1Vals) { | ||
retVals.put(arg1Val.getStringDimensionValue(dimension), arg1Val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the getStringDimensionValue()
method still needed? if not, get rid of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't used anymore, I've removed getStringDimensionValue
now
} | ||
} | ||
|
||
private DimensionAndMetricValueExtractor getTransformedExtractor(DimensionAndMetricValueExtractor argVal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TopNBinaryFn isn't the right place to do this, it should happen when the TopNResultValues are built in the first place. Partially this is for efficiency but also it's because the binary fn doesn't always fire, so you can't rely on it for transformations that need to happen. If there's only one result value (like a topN over a single segment) then it won't fire.
So putting this in TopNMapFn would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the transform functions to TopNMapFn
, the transformation is now applied within DimExtractionTopNAlgorithm
as it adds results to the result builder
@Override | ||
public int getCardinality(FloatColumnSelector selector) | ||
{ | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps pull this out into a TopNColumnSelectorStrategy.CARDINALITY_UNKNOWN
constant for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added TopNColumnSelectorStrategy.CARDINALITY_UNKNOWN
getContextTypeHints(query) | ||
); | ||
|
||
if (dimensions != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this logic should be part of the DimensionSpec, maybe a method like dimensionSpec.getOutputType()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this to DimensionSpec outputType
DimensionSelector selector, | ||
Cursor cursor, | ||
Aggregator[][] rowSelector, | ||
Map<Comparable, Aggregator[]> aggregatesStore | ||
) | ||
{ | ||
final IndexedInts dimValues = selector.getRow(); | ||
if (selector.getValueCardinality() < 0) { | ||
dimExtractionScanAndAggregateNoCardinality(selector, rowSelector, aggregatesStore, cursor, query); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I guess the purpose of this is to support extractionFns on longs/floats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This'll probably need some reconciling with #3858.
77f3e88
to
4fd0ecb
Compare
Updated the patch, still waiting on benchmark results comparing performance of a groupBy on a String column with numeric data vs. its equivalent long and float versions I've left the filter predicate locking/lazy initialization for now since we haven't reached a consensus on that yet, addressed the other comments in that area though I'm in favor of getting #3858 merged before this PR, so I'll revisit the FloatWrapping/LongWrapping selectors after that The latest update on this PR also changes the hash function used in GroupByV2, since benchmarks showed very poor hashing behavior (leading to ~50 seconds running time on the |
@@ -76,6 +76,21 @@ public static Builder builder() | |||
return new Builder(); | |||
} | |||
|
|||
public static ValueType getValueTypeForSqlTypeName(SqlTypeName sqlTypeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this to "Calcites" where most of the random utility functions should go.
throw new ISE("Cannot translate sqlTypeName[%s] to Druid type for field[%s]", sqlTypeName, name); | ||
} | ||
|
||
final DimensionSpec dimensionSpec = rex.toDimensionSpec(rowSignature, null, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to pass ValueType.STRING
in here explicitly and check for non-null outputType in toDimensionSpec.
public DimensionSpec toDimensionSpec( | ||
final RowSignature rowSignature, | ||
final String outputName, | ||
final ValueType outputType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a Preconditions.checkNotNull(outputType, "outputType")
to prevent bugs due to accidentally forgetting to pass in an outputType. Callers should pass in STRING if that's what they want.
List<Row> expectedResults; | ||
|
||
if (config.getDefaultStrategy().equals(GroupByStrategySelector.STRATEGY_V1)) { | ||
expectedResults = Arrays.asList( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should throw an exception. In general, groupBy v1 (which doesn't support outputType other than STRING) should throw an exception if outputType is anything other than STRING. The message should be something like "groupBy v1 only supports outputType STRING" which will give people a little push to use groupBy v2.
final StringComparator naturalComparator; | ||
if (columnType == ValueType.STRING) { | ||
naturalComparator = StringComparators.LEXICOGRAPHIC; | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If outputType is COMPLEX
or null
this is not true. I know this is "impossible" but please check for it anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good way to do this is have an else { sortingNeeded = true; break; }
@jon-wei please also update documentations:
|
679d671
to
5ebc21a
Compare
@@ -455,10 +456,17 @@ private DimensionSelector makeDimensionSelectorUndecorated( | |||
} | |||
|
|||
if (columnDesc.getCapabilities().getType() == ValueType.LONG) { | |||
// Filtered dimension specs are not supported on numerics | |||
if (dimensionSpec instanceof BaseFilteredDimensionSpec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, these instanceof BaseFilteredDimensionSpec
checks everywhere are kind of gnarly. There's way too many of them for comfort. Is there something better we can do? Maybe making them work is a better option if it's not too hard… why don't they, again? They should be able to decorate a wrapped dimension selector just fine, right?
Is it just that we don't know if a DimensionSpec's "decorate" method is actually going to get called, because getColumnValueSelectorFromDimensionSpec might skip it by going and making something like a LongColumnSelector instead? If so, we could address that by adding a "mustDecorate" method to DimensionSpec, and if it mustDecorate, then we force creation of a DimensionSelector (rather than whatever type the underlying column is).
I would also accept alternate solutions that don't involve instanceof checks all over the place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 after travis
This PR adds long and float implementations of the query processing ColumnSelectorStrategy interface introduced in #3570.
This allows users to include non-String columns as dimensions within queries by specifying a
typeHints
map in the query context. This is described in more detail in the doc changes within the PR.For benchmarking, the patch also adds a set of "simple" schemas to the benchmark schemas, which has a single dimension "dimSequential" and a row count aggregator, where "dimSequential" contains an ascending sequence of integer values, encoded as String/Long/Float. Results for the benchmarks are located at:
https://gist.github.com/jon-wei/44a2b8ab97678710a9ef5b24aa107dec
NOTE:
NOTE:
This set of benchmarks attempts determine the running time for a group by query against single string/long/float columns (monomorphic base case), and running times for sequences of queries that use more than one type implementation.
For example, StringThenLong runs the string query followed by the long query within the same benchmark.
There appears to be some overhead when multiple type implementations are called, e.g.:
LongOnly = 17920.575us
StringOnly = 22801.356us
which should sum to ~40000us for both.
StringThenLong however has a running time of 444720.358us, about 10% higher than expected.
The difference bigger for NumericOnly (query on both a long and float) and NumericThenString:
NumericOnly = 20979.666
StringOnly = 22801.356
Sum should be ~44000, but NumericThenString and StringThenNumeric are 52961.866 and 51805.316 respectively.
The difference between StringOnly and StringTwice is smaller, suggesting the overhead may be related to the polymorphism. However, the behavior is not seen for long/float only, e.g. LongTwice is very close to LongThenFloat and (LongOnly + FloatOnly)
Similar benchmarks for TopN don't show the same discrepancy:
I am looking into the GroupBy performance discrepancy more and considering other possible approaches (e.g., some form of code generation), but opening this PR now to get some thoughts from the community on these aspects.