Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: cache transformations in PersistentQueryMetadata #3708

Merged
merged 3 commits into from
Nov 15, 2019

Conversation

vpapavas
Copy link
Member

@vpapavas vpapavas commented Oct 31, 2019

Description

Addresses #3542 . Don't build transformations for every request on the same state store but rather built them only the first time.

Two follow up issues: #3855 and #3854

Testing done

Describe the testing strategy. Unit and integration tests are expected for any behavior changes.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@vpapavas vpapavas requested a review from a team as a code owner October 31, 2019 00:10
@ghost
Copy link

ghost commented Oct 31, 2019

@confluentinc It looks like @vpapavas just signed our Contributor License Agreement. 👍

Always at your service,

clabot

Copy link
Contributor

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracing through the code, this LGTM. Persistent queries are registered once at EngineExecutor::executePersistentQuery and the queryMetadata is held in the hashmap.

engineContext.registerQuery(queryMetadata); So should be safe to cache this.

Good to get one more review from @rodesai or @agavra as well. And may be add a test to check the materialization is cached (this way if we accidentally undo this caching later on, it won't silently fail)

agavra
agavra previously requested changes Oct 31, 2019
@@ -134,6 +137,10 @@ public PhysicalSchema getPhysicalSchema() {
final QueryId queryId,
final QueryContext.Stacker contextStacker
) {
return materializationProvider.map(builder -> builder.build(queryId, contextStacker));
if (!materialization.isPresent()) {
materialization = materializationProvider.map(builder ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like we're caching a value that may not have the same queyId that the caller is making (i.e. the first time I call it with queryId=123 and second time I call it with queryId=124, it will use 123 for both of them) - digging into the code, it looks like it might cause our logging pipeline to be a little messed up. We should look at this with a critical eye and see what we can do

@agavra agavra requested a review from a team October 31, 2019 17:04
@vpapavas
Copy link
Member Author

vpapavas commented Nov 1, 2019

It seems that the QueryId is only used for the ProcessingLogger which is used to log errors. This logger is used for writing record processing log messages. Is this something we want to do for pull queries?

@vinothchandar
Copy link
Contributor

good catch @agavra .. One way to fix could be to to pull queryID out of the materialization itself and just pass it in during execution..

@rodesai
Copy link
Contributor

rodesai commented Nov 1, 2019

good catch @agavra .. One way to fix could be to to pull queryID out of the materialization itself and just pass it in during execution..

Another way of looking at it would be (if we think that what's really expensive here is codegen) to pull out all the codegen stuff and just do that up front. For example, for projections store the result of codegen (List<SelectExpression>) in ProjectInfo, and have SelectValueMapperFactory take a List<SelectInfo> instead of List<SelectExpression>.

@vpapavas vpapavas force-pushed the pull-transformations branch from 11be4a5 to 9f88966 Compare November 1, 2019 18:49
@vpapavas
Copy link
Member Author

vpapavas commented Nov 1, 2019

For streaming queries, it makes sense every query to get a different id and to write to a different logger. For pull queries, since they are one-off queries, it doesn't make sense to create random id and write every time to a different logger. That's why I create one queryId per sink and all pull queries to that sink share the same id and hence write to the same logger. I synchronized the access to the logger for multi-threading.

@agavra agavra dismissed their stale review November 4, 2019 16:58

the bug I identified is no longer present

@agavra agavra requested a review from a team November 4, 2019 16:59
@vpapavas
Copy link
Member Author

vpapavas commented Nov 4, 2019

Hey @big-andy-coates , what do you think about this? Do you have a different suggestion?

Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vpapavas

As I comment below, I think a better approach is to pull the queryId out of the mappers and filters etc, so that it can be passed in as a parameter when being executed. This would allow us to create the baked code as we're building the source aggregate query, i.e. rather than build the materialization, and cache it, on first use, we just build it when building the source aggregate query and store it, rather than the provider, in the query metadata.

Are you up for giving that a go? Maybe leave this PR as is and have a go at creating a new one with the above approach. Then if we're stuck for time we can look to put this one if for the initial launch and then switch to the better approach later.

@@ -310,20 +311,20 @@ public void shouldMaterializeCorrectly() {
// When:
final PersistentQueryMetadata queryMetadata = queryBuilder.buildQuery(
STATEMENT_TEXT,
QUERY_ID,
PULL_QUERY_ID,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be QUERY_ID as this is the id of the query that is building the materialized state, not the id of the pull query accessing it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this should be actually the pull query ID because we do queryMetadata.getMaterialization() which is only called by pull queries.

@vpapavas vpapavas changed the title feat: cache transformations in PersistentQueryMetadata refactor: cache transformations in PersistentQueryMetadata Nov 6, 2019
@vpapavas
Copy link
Member Author

vpapavas commented Nov 7, 2019

Are you up for giving that a go? Maybe leave this PR as is and have a go at creating a new one with the above approach. Then if we're stuck for time we can look to put this one if for the initial launch and then switch to the better approach later.

Yes I am! I will create a separate PR for it.

@vpapavas vpapavas force-pushed the pull-transformations branch from 2b98680 to d21e720 Compare November 7, 2019 23:53
Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vpapavas

Yes I am! I will create a separate PR for it.

Can you create a github issue in the pull query project to track this please, so that we don't forget, and reference it in this PR.

Can you also create another github issue to revert to a unique query id per request, (as this PR makes it a static id).

I've flagged up a couple of nits, (please fix), and also a couple of design issues. However, as noted below, I think we can live with the design issues as we have Github issues to track then and we quickly fix them post launch. I'm happy to help out fixing!

@@ -161,11 +162,10 @@ public static void validate(

final WhereInfo whereInfo = extractWhereInfo(analysis, query);

final QueryId queryId = uniqueQueryId();
final QueryContext.Stacker contextStacker = new Stacker();
final QueryId queryId = new QueryId(QUERY_ID_PREFIX + getSourceName(analysis).name());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

design note: this duplicates functionality in PersistentQueryMetadata that builds the query id. Such duplication is generally a poor design. If someone changes one bit of code without also changing the other, then the two get out of sync, and the query id returned to users will no longer reflect the query id logged to the processing log.

Hence, it is general good practice to avoid such duplication of code. Instead, define the code in one place and call it from the other, e.g. in this instance I would probably define QUERY_ID_PREFIX in this class, (as its specific to pull queries), not in PersistentQueryMetadata, which deals with all query types. I'd then add a method to build the query id from the prefix and the sink name and would then call this new method from here and from the PersistentQueryMetadata constructor.

However, in this case this is hopefully just a temporary fix to get ksqlDB over the line. As long as we have a Github issue to track this and we resolve it quickly after launch, then I'm happy for this to be merged as-is. Please reference the github issue here for future reference.

Copy link
Member Author

@vpapavas vpapavas Nov 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a far better design. The reason I duplicated the code is that there is currently no dependency between the two modules, ksql-engine -> ksql-rest-app. Adding this dependency, creates a circular dependency between these modules. To address this, I added the method in PersistenQueryMetadata although I agree that it is more suited in StaticQueryExecutor.

Copy link
Contributor

@agavra agavra Nov 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we even need a queryID here anymore (now that we're not passing it in)? it seems wrong to build it here at all to be honest... It's just being used in the TableRowsEntity and it looks like that queryID there is never being surfaced (cc @big-andy-coates you added it, what is it used for?). I think we should remove it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also don't block your PR on this, it can be a follow-up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TableRowsEntity and it looks like that queryID there is never being surfaced (cc @big-andy-coates you added it, what is it used for?

Before we switch pull queries over to /query the TableRowsEntity was the entity returned to the caller - i.e. the caller got to see the query id, and each call got a unique query id. Hence users could use this query id when checking the processing log for any errors.

removed extranneous synchronized and queryid

fixed race condition

Addressed Andy's comments
@vpapavas vpapavas force-pushed the pull-transformations branch from d21e720 to b0ac339 Compare November 14, 2019 22:32
@vpapavas vpapavas merged commit 537d23d into confluentinc:master Nov 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants