Allow EARLIEST/EARLIEST_BY/LATEST/LATEST_BY for STRING columns without specifying maxStringBytes #14848

kgyrtkirk · 2023-08-16T12:53:46Z

earlier:

LATEST(string_col) have casted their string argument to double ; so that it resulted in 0
LATEST_BY(string_col) returned an error

This change enables to use the default 1024 bytes to aggregate strings in these cases without the need to specify the buffersize directly for the function.

Release note:

The EARLIEST/EARLIEST_BY/LATEST/LATEST_BY SQL aggregates no longer interpret string arguments as double - instead they configure the aggregation with a 1024 sized buffer.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

…er the interfaces were for `int` and so -1 was passed

abhishekagarwal87 · 2023-08-16T12:55:23Z

sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java

@@ -14269,4 +14269,36 @@ public void testFilterWithNVLAndNotIn()
        )
    );
  }
+
+  @Test
+  public void testLatestBy() {


test name is a bit too generic

totally agree - renamed it to testLatestByWithoutMaxBytes at first but decided to go with testLatestByOnStringColumnWithoutMaxBytesSpecified to be more specific

abhishekagarwal87 · 2023-08-16T14:43:58Z

thanks @kgyrtkirk. LGTM.

somu-imply

Looks good. Thanks for this change, will make this more customer friendly

clintropolis

overall lgtm 👍

docs/querying/sql-aggregations.md

clintropolis

i think its probably fine to consolidate the doc entries for these functions. Should we also adjust the entries on the function reference page https://github.com/apache/druid/blob/master/docs/querying/sql-functions.md#earliest to combine them?

clintropolis · 2023-08-21T21:56:51Z

docs/querying/sql-aggregations.md

-|`LATEST_BY(expr, timestampExpr, maxBytesPerString)`|Like `LATEST_BY(expr, timestampExpr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`|
-|`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be numeric. This aggregator can simplify and optimize the performance by returning the first encountered value (including null)|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
-|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`|
+|`EARLIEST(expr, [maxBytesPerString])`|Returns the earliest value of `expr`.<br />If `expr` comes from a relation with a timestamp column (like `__time` in a Druid datasource), the "earliest" is taken from the row with the overall earliest non-null value of the timestamp column.<br />If the earliest non-null value of the timestamp column appears in multiple rows, the `expr` may be taken from any of those rows. If `expr` does not come from a relation with a timestamp, then it is simply the first value encountered.<br /><br />If `expr` is a string or complex type `maxBytesPerString` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The  `maxBytesPerString` parameter should be set as low as possible, since high values will lead to wasted memory.<br/>If `maxBytesPerString`is omitted; it defaults to `1024`. |`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0` or `''` (depending on the type of `expr`)|


nit: even though this is only for strings right now, i wonder if we should future-proof this a bit by calling maxBytesPerString something like maxBytesPerValue instead

I was thinking to also replace maxBytesPerString everywhere in the main codebase - but it seems like this maxBytesPerString may also appear in the native api...so decided against that:
https://druid.apache.org/docs/latest/querying/aggregations/

I believe a change like that may cause some troubles - so I wonder which path should we take:

(A) rename those as well

I would be worried that by doing this would cause some confusion to users / native api client using the system

(B) leave them alone

current state of the PR

(C) undo this rename

so the native api is in line with the docs regarding this

(X) something else? :D

i think leaving them alone is best since in that case it is actually specific to the native stringFirst/stringLast aggregators so its fine that the json properties of that spec are string specific.

kgyrtkirk added 4 commits August 16, 2023 10:52

PR#8815 added some check that negative values are not accepted; howev…

1444487

…er the interfaces were for `int` and so -1 was passed

accept any for earliest/latest

110d8c0

extend test

90d10db

fix test

33b8129

abhishekagarwal87 reviewed Aug 16, 2023

View reviewed changes

kgyrtkirk added 3 commits August 16, 2023 13:21

cleanup test; fix checkstyle

905508d

more specific test name

1efcb64

accept removal of cast from plan

83402e5

abhishekagarwal87 approved these changes Aug 16, 2023

View reviewed changes

abhishekagarwal87 added the Area - Querying label Aug 16, 2023

update test; update docs

cf3ed18

github-actions bot added the Area - Documentation label Aug 16, 2023

somu-imply approved these changes Aug 16, 2023

View reviewed changes

kgyrtkirk marked this pull request as ready for review August 16, 2023 20:28

clintropolis reviewed Aug 17, 2023

View reviewed changes

docs/querying/sql-aggregations.md Outdated Show resolved Hide resolved

docs/querying/sql-aggregations.md Outdated Show resolved Hide resolved

clintropolis added the Release Notes label Aug 17, 2023

kgyrtkirk added 6 commits August 17, 2023 10:53

extend test; disable vectorization - no support for earliest

d0f5487

add test for ANY

3b7eaf9

update docs

ee73a8b

fix br

e41893a

remove ws

b75be8e

Merge remote-tracking branch 'apache/master' into latest-by-error

11c0fd8

soumyava approved these changes Aug 21, 2023

View reviewed changes

clintropolis approved these changes Aug 21, 2023

View reviewed changes

kgyrtkirk added 3 commits August 22, 2023 09:11

merge

8f2691a

rename maxBytesPerString to maxBytesPerValue

9908504

update docs/querying/sql-functions.md

531328c

clintropolis approved these changes Aug 23, 2023

View reviewed changes

clintropolis merged commit e806d09 into apache:master Aug 23, 2023

LakshSingla added this to the 28.0 milestone Oct 12, 2023

LakshSingla mentioned this pull request Nov 4, 2023

[DRAFT] 28.0.0 release notes #15326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow EARLIEST/EARLIEST_BY/LATEST/LATEST_BY for STRING columns without specifying maxStringBytes #14848

Allow EARLIEST/EARLIEST_BY/LATEST/LATEST_BY for STRING columns without specifying maxStringBytes #14848

kgyrtkirk commented Aug 16, 2023 •

edited

Loading

abhishekagarwal87 Aug 16, 2023

kgyrtkirk Aug 16, 2023

abhishekagarwal87 commented Aug 16, 2023

somu-imply left a comment

clintropolis left a comment

clintropolis left a comment

clintropolis Aug 21, 2023

kgyrtkirk Aug 22, 2023

clintropolis Aug 23, 2023

Allow EARLIEST/EARLIEST_BY/LATEST/LATEST_BY for STRING columns without specifying maxStringBytes #14848

Allow EARLIEST/EARLIEST_BY/LATEST/LATEST_BY for STRING columns without specifying maxStringBytes #14848

Conversation

kgyrtkirk commented Aug 16, 2023 • edited Loading

abhishekagarwal87 Aug 16, 2023

Choose a reason for hiding this comment

kgyrtkirk Aug 16, 2023

Choose a reason for hiding this comment

abhishekagarwal87 commented Aug 16, 2023

somu-imply left a comment

Choose a reason for hiding this comment

clintropolis left a comment

Choose a reason for hiding this comment

clintropolis left a comment

Choose a reason for hiding this comment

clintropolis Aug 21, 2023

Choose a reason for hiding this comment

kgyrtkirk Aug 22, 2023

Choose a reason for hiding this comment

clintropolis Aug 23, 2023

Choose a reason for hiding this comment

kgyrtkirk commented Aug 16, 2023 •

edited

Loading