Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow EARLIEST/EARLIEST_BY/LATEST/LATEST_BY for STRING columns without specifying maxStringBytes #14848

Merged
merged 17 commits into from
Aug 23, 2023

Conversation

kgyrtkirk
Copy link
Member

@kgyrtkirk kgyrtkirk commented Aug 16, 2023

earlier:

  • LATEST(string_col) have casted their string argument to double ; so that it resulted in 0
  • LATEST_BY(string_col) returned an error

This change enables to use the default 1024 bytes to aggregate strings in these cases without the need to specify the buffersize directly for the function.

Release note:

The EARLIEST/EARLIEST_BY/LATEST/LATEST_BY SQL aggregates no longer interpret string arguments as double - instead they configure the aggregation with a 1024 sized buffer.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

@@ -14269,4 +14269,36 @@ public void testFilterWithNVLAndNotIn()
)
);
}

@Test
public void testLatestBy() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test name is a bit too generic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally agree - renamed it to testLatestByWithoutMaxBytes at first but decided to go with testLatestByOnStringColumnWithoutMaxBytesSpecified to be more specific

@abhishekagarwal87
Copy link
Contributor

thanks @kgyrtkirk. LGTM.

Copy link
Contributor

@somu-imply somu-imply left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for this change, will make this more customer friendly

@kgyrtkirk kgyrtkirk marked this pull request as ready for review August 16, 2023 20:28
Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm 👍

docs/querying/sql-aggregations.md Outdated Show resolved Hide resolved
docs/querying/sql-aggregations.md Outdated Show resolved Hide resolved
Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think its probably fine to consolidate the doc entries for these functions. Should we also adjust the entries on the function reference page https://github.com/apache/druid/blob/master/docs/querying/sql-functions.md#earliest to combine them?

|`LATEST_BY(expr, timestampExpr, maxBytesPerString)`|Like `LATEST_BY(expr, timestampExpr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`|
|`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be numeric. This aggregator can simplify and optimize the performance by returning the first encountered value (including null)|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`|
|`EARLIEST(expr, [maxBytesPerString])`|Returns the earliest value of `expr`.<br />If `expr` comes from a relation with a timestamp column (like `__time` in a Druid datasource), the "earliest" is taken from the row with the overall earliest non-null value of the timestamp column.<br />If the earliest non-null value of the timestamp column appears in multiple rows, the `expr` may be taken from any of those rows. If `expr` does not come from a relation with a timestamp, then it is simply the first value encountered.<br /><br />If `expr` is a string or complex type `maxBytesPerString` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The `maxBytesPerString` parameter should be set as low as possible, since high values will lead to wasted memory.<br/>If `maxBytesPerString`is omitted; it defaults to `1024`. |`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0` or `''` (depending on the type of `expr`)|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: even though this is only for strings right now, i wonder if we should future-proof this a bit by calling maxBytesPerString something like maxBytesPerValue instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to also replace maxBytesPerString everywhere in the main codebase - but it seems like this maxBytesPerString may also appear in the native api...so decided against that:
https://druid.apache.org/docs/latest/querying/aggregations/

I believe a change like that may cause some troubles - so I wonder which path should we take:

  • (A) rename those as well
    • I would be worried that by doing this would cause some confusion to users / native api client using the system
  • (B) leave them alone
    • current state of the PR
  • (C) undo this rename
    • so the native api is in line with the docs regarding this
  • (X) something else? :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think leaving them alone is best since in that case it is actually specific to the native stringFirst/stringLast aggregators so its fine that the json properties of that spec are string specific.

@clintropolis clintropolis merged commit e806d09 into apache:master Aug 23, 2023
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants