Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix string first/last aggregator comparator #12773

Merged
merged 2 commits into from
Aug 1, 2022

Conversation

rohangarg
Copy link
Member

Currently, the first/last string aggregator's getComparator method (used in topN ranking, grouping by ordering, post agg ordering) compares two aggregator values on __time first and then on values. This change makes that comparator to compare only on values which is the correct behavior. The numerical first/last aggregators already compare on the values.

The current users who are ordering their results on the string first/last aggregator would need to augment their queries a max(timeCol) aggreagtor to achieve the same results.
For instance, a query like SELECT stringCol2, LATEST(stringCol, 1024) as lat FROM table GROUP BY stringCol2 ORDER BY lat would need to be rewritten as SELECT stringCol2, LATEST(stringCol, 1024) as lat FROM table GROUP BY stringCol2 ORDER BY MAX(timeCol)

The ingestion ordering is kept as is it since it doesn't impact the query results.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable, though I think it also changes ingest time behavior so that columns ingested after this change might produce different segments with the same input data than prior to the changes here?

I tagged this with release notes and incompatible labels so we are sure to document these behavior changes in the next release. I don't personally think we need to retain backwards compatible behavior, but it might be worth getting some additional approval.

Comment on lines 53 to 59
if (o1 == null && o2 == null) {
comparation = 0;
} else if (o1 == null) {
comparation = -1;
} else if (o2 == null) {
comparation = 1;
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, should we just wrap this comparator in Comparator.nullsFirst so that this part only has to deal with non-null pairs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the pointer, re-wrote the comparators in a consice manner 👍

regarding the ingestion part, I've kept the ingestion comparator to be the same as before. Further, I'm not sure if we provide ability to order rows by metric values. In any case, I think the effect should only be at query time.

Copy link
Contributor

@imply-cheddar imply-cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. And, we should document the change in behavior. I think that the change won't actually be significant in terms of impacting anyone, but we should call it out in case it is and, indicate that anyone who cares about the old behavior should be doing an ORDER+LIMIT on the MAX(__time).

I do worry a little bit about the virtual function call overhead of the composed-chained Comparator versus a method that just does the work, but we can let profiling tell us that it's a bottleneck before we go on the warpath.

@rohangarg rohangarg merged commit 7ae6cc6 into apache:master Aug 1, 2022
@abhishekagarwal87 abhishekagarwal87 added this to the 24.0.0 milestone Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants