Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow EARLIEST/EARLIEST_BY/LATEST/LATEST_BY for STRING columns without specifying maxStringBytes #14848

Merged
merged 17 commits into from
Aug 23, 2023
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 5 additions & 10 deletions docs/querying/sql-aggregations.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,16 +86,11 @@ In the aggregation functions supported by Druid, only `COUNT`, `ARRAY_AGG`, and
|`STDDEV_POP(expr)`|Computes standard deviation population of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
|`STDDEV_SAMP(expr)`|Computes standard deviation sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
|`STDDEV(expr)`|Computes standard deviation sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
|`EARLIEST(expr)`|Returns the earliest value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like `__time` in a Druid datasource), the "earliest" is taken from the row with the overall earliest non-null value of the timestamp column. If the earliest non-null value of the timestamp column appears in multiple rows, the `expr` may be taken from any of those rows. If `expr` does not come from a relation with a timestamp, then it is simply the first value encountered.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
|`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`|
|`EARLIEST_BY(expr, timestampExpr)`|Returns the earliest value of `expr`, which must be numeric. The earliest value of `expr` is taken from the row with the overall earliest non-null value of `timestampExpr`. If the earliest non-null value of `timestampExpr` appears in multiple rows, the `expr` may be taken from any of those rows.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
|`EARLIEST_BY(expr, timestampExpr, maxBytesPerString)`| Like `EARLIEST_BY(expr, timestampExpr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`|
|`LATEST(expr)`|Returns the latest value of `expr`, which must be numeric. The `expr` must come from a relation with a timestamp column (like `__time` in a Druid datasource) and the "latest" is taken from the row with the overall latest non-null value of the timestamp column. If the latest non-null value of the timestamp column appears in multiple rows, the `expr` may be taken from any of those rows. |`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
|`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`|
|`LATEST_BY(expr, timestampExpr)`|Returns the latest value of `expr`, which must be numeric. The latest value of `expr` is taken from the row with the overall latest non-null value of `timestampExpr`. If the overall latest non-null value of `timestampExpr` appears in multiple rows, the `expr` may be taken from any of those rows.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
|`LATEST_BY(expr, timestampExpr, maxBytesPerString)`|Like `LATEST_BY(expr, timestampExpr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`|
|`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be numeric. This aggregator can simplify and optimize the performance by returning the first encountered value (including null)|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`|
|`EARLIEST(expr, [maxBytesPerString])`|Returns the earliest value of `expr`.<br />If `expr` comes from a relation with a timestamp column (like `__time` in a Druid datasource), the "earliest" is taken from the row with the overall earliest non-null value of the timestamp column.<br />If the earliest non-null value of the timestamp column appears in multiple rows, the `expr` may be taken from any of those rows. If `expr` does not come from a relation with a timestamp, then it is simply the first value encountered.<br /><br />If `expr` is a string or complex type `maxBytesPerString` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The `maxBytesPerString` parameter should be set as low as possible, since high values will lead to wasted memory.<br/>If `maxBytesPerString`is omitted; it defaults to `1024`. |`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0` or `''` (depending on the type of `expr`)|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: even though this is only for strings right now, i wonder if we should future-proof this a bit by calling maxBytesPerString something like maxBytesPerValue instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to also replace maxBytesPerString everywhere in the main codebase - but it seems like this maxBytesPerString may also appear in the native api...so decided against that:
https://druid.apache.org/docs/latest/querying/aggregations/

I believe a change like that may cause some troubles - so I wonder which path should we take:

  • (A) rename those as well
    • I would be worried that by doing this would cause some confusion to users / native api client using the system
  • (B) leave them alone
    • current state of the PR
  • (C) undo this rename
    • so the native api is in line with the docs regarding this
  • (X) something else? :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think leaving them alone is best since in that case it is actually specific to the native stringFirst/stringLast aggregators so its fine that the json properties of that spec are string specific.

|`EARLIEST_BY(expr, timestampExpr, [maxBytesPerString])`|Returns the earliest value of `expr`.<br />The earliest value of `expr` is taken from the row with the overall earliest non-null value of `timestampExpr`. <br />If the earliest non-null value of `timestampExpr` appears in multiple rows, the `expr` may be taken from any of those rows.<br /><br />If `expr` is a string or complex type `maxBytesPerString` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The `maxBytesPerString` parameter should be set as low as possible, since high values will lead to wasted memory.<br/>If `maxBytesPerString`is omitted; it defaults to `1024`. |`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0` or `''` (depending on the type of `expr`)|
|`LATEST(expr, [maxBytesPerString])`|Returns the latest value of `expr`<br />The `expr` must come from a relation with a timestamp column (like `__time` in a Druid datasource) and the "latest" is taken from the row with the overall latest non-null value of the timestamp column.<br />If the latest non-null value of the timestamp column appears in multiple rows, the `expr` may be taken from any of those rows.<br /><br />If `expr` is a string or complex type `maxBytesPerString` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The `maxBytesPerString` parameter should be set as low as possible, since high values will lead to wasted memory.<br/>If `maxBytesPerString`is omitted; it defaults to `1024`. |`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0` or `''` (depending on the type of `expr`)|
|`LATEST_BY(expr, timestampExpr, [maxBytesPerString])`|Returns the latest value of `expr`.<br />The latest value of `expr` is taken from the row with the overall latest non-null value of `timestampExpr`.<br />If the overall latest non-null value of `timestampExpr` appears in multiple rows, the `expr` may be taken from any of those rows.<br /><br />If `expr` is a string or complex type `maxBytesPerString` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The `maxBytesPerString` parameter should be set as low as possible, since high values will lead to wasted memory.<br/>If `maxBytesPerString`is omitted; it defaults to `1024`. |`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0` or `''` (depending on the type of `expr`)|
|`ANY_VALUE(expr, [maxBytesPerString])`|Returns any value of `expr` including null. This aggregator can simplify and optimize the performance by returning the first encountered value (including `null`).<br /><br />If `expr` is a string or complex type `maxBytesPerString` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The `maxBytesPerString` parameter should be set as low as possible, since high values will lead to wasted memory.<br/>If `maxBytesPerString`is omitted; it defaults to `1024`. |`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0` or `''` (depending on the type of `expr`)|
|`GROUPING(expr, expr...)`|Returns a number to indicate which groupBy dimension is included in a row, when using `GROUPING SETS`. Refer to [additional documentation](aggregations.md#grouping-aggregator) on how to infer this number.|N/A|
|`ARRAY_AGG(expr, [size])`|Collects all values of `expr` into an ARRAY, including null values, with `size` in bytes limit on aggregation size (default of 1024 bytes). If the aggregated array grows larger than the maximum size in bytes, the query will fail. Use of `ORDER BY` within the `ARRAY_AGG` expression is not currently supported, and the ordering of results within the output array may vary depending on processing order.|`null`|
|`ARRAY_AGG(DISTINCT expr, [size])`|Collects all distinct values of `expr` into an ARRAY, including null values, with `size` in bytes limit on aggregation size (default of 1024 bytes) per aggregate. If the aggregated array grows larger than the maximum size in bytes, the query will fail. Use of `ORDER BY` within the `ARRAY_AGG` expression is not currently supported, and the ordering of results will be based on the default for the element type.|`null`|
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ AggregatorFactory createAggregatorFactory(
String fieldName,
String timeColumn,
ColumnType type,
int maxStringBytes
Integer maxStringBytes
)
{
switch (type.getType()) {
Expand All @@ -108,7 +108,7 @@ AggregatorFactory createAggregatorFactory(
String fieldName,
String timeColumn,
ColumnType type,
int maxStringBytes
Integer maxStringBytes
)
{
switch (type.getType()) {
Expand All @@ -134,7 +134,7 @@ AggregatorFactory createAggregatorFactory(
String fieldName,
String timeColumn,
ColumnType type,
int maxStringBytes
Integer maxStringBytes
)
{
switch (type.getType()) {
Expand All @@ -157,7 +157,7 @@ abstract AggregatorFactory createAggregatorFactory(
String fieldName,
String timeColumn,
ColumnType outputType,
int maxStringBytes
Integer maxStringBytes
);
}

Expand Down Expand Up @@ -236,7 +236,7 @@ public Aggregation toDruidAggregation(
final AggregatorFactory theAggFactory;
switch (args.size()) {
case 1:
theAggFactory = aggregatorType.createAggregatorFactory(aggregatorName, fieldName, null, outputType, -1);
theAggFactory = aggregatorType.createAggregatorFactory(aggregatorName, fieldName, null, outputType, null);
break;
case 2:
int maxStringBytes;
Expand Down Expand Up @@ -327,8 +327,7 @@ private static class EarliestLatestSqlAggFunction extends SqlAggFunction
EARLIEST_LATEST_ARG0_RETURN_TYPE_INFERENCE,
InferTypes.RETURN_TYPE,
OperandTypes.or(
OperandTypes.NUMERIC,
OperandTypes.BOOLEAN,
OperandTypes.ANY,
OperandTypes.sequence(
"'" + aggregatorType.name() + "(expr, maxBytesPerString)'",
OperandTypes.ANY,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ public Aggregation toDruidAggregation(
rexNodes.get(1)
),
outputType,
-1
null
);
break;
case 3:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13957,13 +13957,13 @@ public void testGreatestFunctionForStringWithIsNull()
.setVirtualColumns(
expressionVirtualColumn(
"v0",
"CAST(greatest(\"dim1\",\"dim2\"), 'DOUBLE')",
ColumnType.DOUBLE
"greatest(\"dim1\",\"dim2\")",
ColumnType.STRING
)
)
.setGranularity(Granularities.ALL)
.addDimension(new DefaultDimensionSpec("l1", "_d0", ColumnType.LONG))
.addAggregator(new DoubleLastAggregatorFactory("a0", "v0", null))
.addAggregator(new StringLastAggregatorFactory("a0", "v0", null, 1024))
.setPostAggregatorSpecs(ImmutableList.of(
expressionPostAgg("p0", "isnull(\"a0\")")
))
Expand All @@ -13976,9 +13976,9 @@ public void testGreatestFunctionForStringWithIsNull()
new Object[]{325323L, false}
) :
ImmutableList.of(
new Object[]{null, true},
new Object[]{null, false},
new Object[]{0L, false},
new Object[]{7L, true},
new Object[]{7L, false},
new Object[]{325323L, false}
)
);
Expand Down Expand Up @@ -14269,4 +14269,36 @@ public void testFilterWithNVLAndNotIn()
)
);
}

@Test
public void testLatestByOnStringColumnWithoutMaxBytesSpecified()
{
String defaultString = useDefault ? "" : null;
cannotVectorize();
testQuery(
"SELECT dim2,LATEST(dim3),LATEST_BY(dim1, __time),EARLIEST(dim3),EARLIEST_BY(dim1, __time),ANY_VALUE(dim3) FROM druid.foo where dim2='abc' group by 1",
ImmutableList.of(
GroupByQuery.builder()
.setDataSource(CalciteTests.DATASOURCE1)
.setInterval(querySegmentSpec(Filtration.eternity()))
.setGranularity(Granularities.ALL)
.setVirtualColumns(
expressionVirtualColumn("v0", "'abc'", ColumnType.STRING))
.setDimFilter(equality("dim2", "abc", ColumnType.STRING))
.setDimensions(
dimensions(new DefaultDimensionSpec("v0", "d0", ColumnType.STRING)))
.setAggregatorSpecs(
aggregators(
new StringLastAggregatorFactory("a0", "dim3", "__time", 1024),
new StringLastAggregatorFactory("a1", "dim1", "__time", 1024),
new StringFirstAggregatorFactory("a2", "dim3", "__time", 1024),
new StringFirstAggregatorFactory("a3", "dim1", "__time", 1024),
new StringAnyAggregatorFactory("a4", "dim3", 1024)))
.build()

),
ImmutableList.of(
new Object[] {"abc", defaultString, "def", defaultString, "def", defaultString}
));
}
}