Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit results generated by SELECT queries in MSQ #14370

Merged
merged 5 commits into from
Jun 15, 2023

Conversation

LakshSingla
Copy link
Contributor

Description

MSQ queries return the results in the query reports. Querying large data sources can generate a lot of rows, all of which will get materialized in the query reports, causing them to grow very large. This PR limits the results of the SELECT queries that we put in the reports.

Note: Only way to fetch the complete result set after this change would be to run the query using the native engine, however, we might add a way to execute long-running queries that post their complete results in a different. Therefore the rows in the reports would serve as a preview to the actual result set.

Release note

SELECT queries executed using MSQ generate only a subset of the results in the query reports.


Key changed/added classes in this PR
  • ControllerImpl

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Contributor

@adarshsanjeev adarshsanjeev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for including the patch notes too.

return Sequences.simple(retVal);
}
)
.limit(Limits.MAX_SELECT_RESULT_ROWS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the end user, we need to set a flag in the report mentioning the results are truncated.

).collect(Collectors.toList());

final List<SqlTypeName> sqlTypeNames = task.getSqlTypeNames();
final List<Object[]> retVal = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might blow up.
A good test case for this issue is to try to do a select * on a datasource with 1 million rows.

@cryptoe
Copy link
Contributor

cryptoe commented Jun 13, 2023

Please add the truncated flag in the read me.
LGTM to me otherwise. Also please test this locally on a very large file so that the controller heap does not blow up.

@cryptoe cryptoe added the Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 label Jun 15, 2023
@cryptoe cryptoe merged commit 4935f24 into apache:master Jun 15, 2023
@cryptoe
Copy link
Contributor

cryptoe commented Jun 15, 2023

cc @vogievetsky We now have a new field inside the task report : multiStageQuery.payload.results.resultsTruncated which denotes if the results are truncated.

@abhishekagarwal87 abhishekagarwal87 added this to the 27.0 milestone Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Documentation Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants