Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnest now works on MSQ #14886

Merged
merged 31 commits into from
Sep 25, 2023
Merged

Unnest now works on MSQ #14886

merged 31 commits into from
Sep 25, 2023

Conversation

somu-imply
Copy link
Contributor

@somu-imply somu-imply commented Aug 21, 2023

This entails:

  1. Removing the enableUnnest flag and additional machinery
  2. Updating the datasource plan and frame processors to support unnest
  3. Adding support in MSQ for UnnestDataSource and FilteredDataSource
  4. CalciteArrayTest now has a MSQ test component
  5. Additional tests for Unnest on MSQ
  6. Removing engine feature of unnest alongside (1)

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@soumyava soumyava added Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 WIP labels Aug 23, 2023
@somu-imply somu-imply changed the title Unnest msq Unnest now works on MSQ Aug 24, 2023
@somu-imply somu-imply marked this pull request as ready for review August 29, 2023 15:27
Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Left some comments.

// An UnnestDataSource or FilteredDataSource can have a join as a base
// In such a case a side channel is expected to be there
if (!(dataSource instanceof JoinDataSource
|| dataSource instanceof UnnestDataSource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnest seems to have 2 data sources FilteredDS and UnnestDS. Could you please help me understand a bit more here.
Seems like the filter could have easily been pushed to the unnest data source. Is there are reason not to ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If we are keeping it as is, then there should be a better pattern than doing an instanceof for all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question:
What happens if we have a simple unnest query like -

SELECT * FROM tab1,UNNEST(col1)

What is expected to be in the side channel then? Seems like that they should be empty in this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a Filtered data source is a data source with a filter. It was introduced for an issue where unnesting over a data source with a selector filter was planned as a QueryDataSource instead of a table data source. A FilteredDS is basically a base data source with a filter. It pushes a filter to the data source and not requiring a query data source in between does not need results to be materialized to the broker

Copy link
Contributor

@LakshSingla LakshSingla Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, should we just check that the delegate of the FilteredDataSource is not a JoinDataSource or an UnnestDataSource, instead of doing the check on the FilteredDataSource as a whole?
This will help if the FilteredDataSource evolves to have a delegate that is something other than the unnest data source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think, we just need to check if the FilteredDataSource or UnnestDataSource has a Join as its base

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be recursive which we already do. It would be good if we could cleanly merge this with the existing code path, that recurses on the data source's children, though not necessary.

// An UnnestDataSource or FilteredDataSource can have a join as a base
// In such a case a side channel is expected to be there
if (!(dataSource instanceof JoinDataSource
|| dataSource instanceof UnnestDataSource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If we are keeping it as is, then there should be a better pattern than doing an instanceof for all.

// An UnnestDataSource or FilteredDataSource can have a join as a base
// In such a case a side channel is expected to be there
if (!(dataSource instanceof JoinDataSource
|| dataSource instanceof UnnestDataSource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question:
What happens if we have a simple unnest query like -

SELECT * FROM tab1,UNNEST(col1)

What is expected to be in the side channel then? Seems like that they should be empty in this case

@@ -107,7 +113,8 @@ private static Pair<List<ReadableFrameChannel>, BroadcastJoinHelper> makeInputCh
inputChannels.add(baseInput.getChannel());
}

if (dataSource instanceof JoinDataSource) {

if (dataSource instanceof JoinDataSource || dataSource instanceof UnnestDataSource || dataSource instanceof FilteredDataSource) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing the same pattern as above, it should be a good refactor candidate.

broadcast
);
} else if (dataSource instanceof UnnestDataSource) {
checkQuerySegmentSpecIsEternity(dataSource, querySegmentSpec);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-confirm if this check is required here. I think it should be removed unless UNNEST on a table cannot have a time filter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you able to confirm if the check is required?

Copy link
Contributor Author

@somu-imply somu-imply Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have asked around, I'll update it once I get back the comments. I have removed it as the base data source types should take care of it, but will add it back otherwise

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any updates on this? I wanted to get a final check on this before the approval

@@ -137,6 +139,30 @@ public static DataSourcePlan forDataSource(
} else if (dataSource instanceof LookupDataSource) {
checkQuerySegmentSpecIsEternity(dataSource, querySegmentSpec);
return forLookup((LookupDataSource) dataSource, broadcast);
} else if (dataSource instanceof FilteredDataSource) {
checkQuerySegmentSpecIsEternity(dataSource, querySegmentSpec);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the other comment about the check. Perhaps this should also be removed since we would plan the base data source, and there would be individual checks there.
See the plan for broadcast joins - The base data source doesn't enforce the eternity check, however, it is required on the clauses. I think something of that sort should be applicable in Filtered and Unnest data sources (please confirm it though)

@somu-imply
Copy link
Contributor Author

@LakshSingla @cryptoe I have addressed the comments, added the refactoring, and also removed the context level feature to push it into individual engines. Please re-review

Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last minute review comments, overall LGTM.

@@ -123,6 +124,7 @@ public static List<RelOptRule> rules(PlannerContext plannerContext)
retVal.add(DruidFilterUnnestRule.DruidProjectOnUnnestRule.instance());
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Unnecessary

Suggested change

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left my final review. Please let me know once those are addressed. Will approve and merge.

querySegmentSpec,
null,
maxWorkerCount,
Math.max(minStageNumber, subQueryDefBuilder.getNextStageNumber()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wont this line add a new stage ?

Copy link
Contributor

@LakshSingla LakshSingla Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be fine as it won't be adding a new stage, however, this is effective Math.max(minStageNumber, 0) since the builder here doesn't contain any stages. We can simplify it as below:

Suggested change
Math.max(minStageNumber, subQueryDefBuilder.getNextStageNumber()),
minStageNumber,

Copy link
Contributor

@LakshSingla LakshSingla Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, thinking about it, we probably didn't need to create a separate subqueryDefBuilder since we aren't adding a new stage anyway, we can pass in the same one to the recursive call which might be causing this confusion. Since there are a few changes pending, let's clean this up as well.

@LakshSingla
Copy link
Contributor

From a convo with @somu-imply, it seems like queries with a time filter SELECT * FROM foo, unnest(col) WHERE __time >= t1 and __time < t2 doesn't prune over the segments it has to iterate. It will iterate over all of the segments, and then apply the filter on top of it.
If this is correct, we should change the query segment spec to break down the where clause in the filter and the segment spec. However, I am confused if it's the same behavior in native queries as well.

@LakshSingla
Copy link
Contributor

Seems like the above is a problem in the native query as well. #15020

Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once @cryptoe's comments have been addressed. The major concern behind the correct optimization of the filter is present in the native stack as well (occurs while query planning), and is unrelated to enabling unnest in MSQ.

@somu-imply
Copy link
Contributor Author

Addressed the comments and added the test case

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM!!

@cryptoe cryptoe merged commit c184b52 into apache:master Sep 25, 2023
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Querying
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants