-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow EARLIEST/EARLIEST_BY/LATEST/LATEST_BY for STRING columns without specifying maxStringBytes #14848
Conversation
…er the interfaces were for `int` and so -1 was passed
@@ -14269,4 +14269,36 @@ public void testFilterWithNVLAndNotIn() | |||
) | |||
); | |||
} | |||
|
|||
@Test | |||
public void testLatestBy() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test name is a bit too generic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally agree - renamed it to testLatestByWithoutMaxBytes
at first but decided to go with testLatestByOnStringColumnWithoutMaxBytesSpecified
to be more specific
thanks @kgyrtkirk. LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for this change, will make this more customer friendly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think its probably fine to consolidate the doc entries for these functions. Should we also adjust the entries on the function reference page https://github.com/apache/druid/blob/master/docs/querying/sql-functions.md#earliest to combine them?
docs/querying/sql-aggregations.md
Outdated
|`LATEST_BY(expr, timestampExpr, maxBytesPerString)`|Like `LATEST_BY(expr, timestampExpr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`| | ||
|`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be numeric. This aggregator can simplify and optimize the performance by returning the first encountered value (including null)|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`| | ||
|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`| | ||
|`EARLIEST(expr, [maxBytesPerString])`|Returns the earliest value of `expr`.<br />If `expr` comes from a relation with a timestamp column (like `__time` in a Druid datasource), the "earliest" is taken from the row with the overall earliest non-null value of the timestamp column.<br />If the earliest non-null value of the timestamp column appears in multiple rows, the `expr` may be taken from any of those rows. If `expr` does not come from a relation with a timestamp, then it is simply the first value encountered.<br /><br />If `expr` is a string or complex type `maxBytesPerString` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The `maxBytesPerString` parameter should be set as low as possible, since high values will lead to wasted memory.<br/>If `maxBytesPerString`is omitted; it defaults to `1024`. |`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0` or `''` (depending on the type of `expr`)| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: even though this is only for strings right now, i wonder if we should future-proof this a bit by calling maxBytesPerString
something like maxBytesPerValue
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking to also replace maxBytesPerString
everywhere in the main codebase - but it seems like this maxBytesPerString
may also appear in the native
api...so decided against that:
https://druid.apache.org/docs/latest/querying/aggregations/
I believe a change like that may cause some troubles - so I wonder which path should we take:
- (A) rename those as well
- I would be worried that by doing this would cause some confusion to users / native api client using the system
- (B) leave them alone
- current state of the PR
- (C) undo this rename
- so the native api is in line with the docs regarding this
- (X) something else? :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think leaving them alone is best since in that case it is actually specific to the native stringFirst/stringLast aggregators so its fine that the json properties of that spec are string specific.
earlier:
LATEST(string_col)
have casted their string argument to double ; so that it resulted in0
LATEST_BY(string_col)
returned an errorThis change enables to use the default 1024 bytes to aggregate strings in these cases without the need to specify the buffersize directly for the function.
Release note:
The
EARLIEST/EARLIEST_BY/LATEST/LATEST_BY
SQL aggregates no longer interpret string arguments as double - instead they configure the aggregation with a 1024 sized buffer.This PR has: