Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposed execution plan schema #3969

Closed
wants to merge 2 commits into from

Conversation

rodesai
Copy link
Contributor

@rodesai rodesai commented Nov 25, 2019

Description

This patch isn't meant to be merged. It's a place to review/finalize the proposed schema for
the KSQL physical plan.

Join example: https://gist.github.com/rodesai/19ff1591a5c0d17cccd2936da4d7704d
Agg example: https://gist.github.com/rodesai/3a032b7ec8388eac230de4fcbb6ecde2
Project example: https://gist.github.com/rodesai/806613157f115f90ac567d31cc11f162

@rodesai rodesai requested a review from a team as a code owner November 25, 2019 04:42
Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rodesai

This is looking good IMHO. I've flagged a lot of stuff in the review. With some of the things I've flagged we may choose that fixing them in our schema is not worth the work. However, I think we should at least discussion/consider each.

I also think some of these points may highlight the benefits of splitting our serialized from from our implementation. Splitting is a pain up front, but:

  • gives us a more clearly defined delimitation of the model / impl
  • makes us more flexible in how we map between the two
  • allows the model to be more exact representation of what we support, while allowing the impl to be more flexible. A good example would be the current JoinType enum is the super set. The schema for the stream-table joins step should be a subset if we're being accurate.

Of course, this comes at a cost, and that's what we need to think-about/discuss. Maybe its unnecessary. I'd like to hear others thoughts on this.

Have you thought about splitting this schema file up into smaller, per-type, files? I think this will help with merging / management / reviewing down the line. (Ref: https://stackoverflow.com/questions/18376215/jsonschema-split-one-big-schema-file-into-multiple-logical-smaller-files). It might be nice to have a file structure such as:

 resources/json/schema
         |---- Ksql.json
         |---- KsqlPlanV1.json
         |---- /query
         |            |-----QueryPlanV1.json
         |            |-----/steps
         |                       |----- StreamGroupByV1.json
         |                       |----- ... other step types
         |
         |---- /cmd
                      |---- CreateTableV1.json
                      |---- ... other command types

Also, can you link to some gists with example plans again please?

We've some related types in the schema, e.g. StreamSource, WindowedStreamSource, TableSource, etc. It might be useful to make use of a base Source type in the schema and have these types extend that one type. (https://json-schema.org/understanding-json-schema/structuring.html#extending)

@purplefox
Copy link
Contributor

I think what I am missing from this discussion is a clear and unambiguous explanation of why we want to version this stuff. I think the assumption seems to be that we must do this, and all the discussion is about the implementation of a solution. Now, it may well be that there is no alternative but to do this, but I am not convinced yet, and maybe that's just because I wasn't around when the original discussions about this happened.
@rodesai would it possible to put together a short doc, maybe just a few paragraphs, not looking for anything formal like a one pager, just explaining the motivation for this, and including some concrete examples? (E.g. involving topic partitioning and whatnot). I think this would also help others coming to this anew.

@agavra
Copy link
Contributor

agavra commented Nov 26, 2019

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so familiar with the individual steps that I can give feedback on whether or not every single field should/needs to be there, but this seems sound to me (I think it's also easier to review than looking at Java code to define our spec).

My only concern is with trying to use JSON for this (see #3969 (comment)), I think it isn't really well equipped to handle schema evolution.

"title" : "streamAggregateV1",
"required" : [ "@type", "properties", "source", "formats", "nonFuncColumnCount", "aggregations" ]
},
"DefaultExecutionStepProperties" : {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we have finalized the schemas, it'd be nice to add docs to all of the objects to understand what they are used for (and setup good practice for people extending the schema). For example, what does this one represent (why is it "default" - that seems like an implementation thing, not a schema thing)? (And what is "queryContext"?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Is there a way to get such comments into the schema - that would be awesome.

@purplefox
Copy link
Contributor

@purplefox https://github.com/confluentinc/ksql/blob/master/design-proposals/klip-6-execution-plans.md may help

Hey @agavra I've read that before and the only example there is the case of changing a key from a concatenated string to a struct. It seems to me that using concatenated string for a key is a design smell and we should change this to a struct anyway (I think that is already planned?). If the key is a generic struct it's unlikely we'd ever want to change it to anything else in the future?

@rodesai
Copy link
Contributor Author

rodesai commented Nov 27, 2019

I also think some of these points may highlight the benefits of splitting our serialized from from our implementation. Splitting is a pain up front, but:

I agree that it makes sense to keep the serialized model logically decoupled from KSQL's implementation (even if it was originally derived from the implementation =P). What's proposed here is intended to be that. You pointed out a bunch of problems inline, but those things will be cleaned up.

I also think having a decoupled model is orthogonal to whether or not we serialize from jackson-annotated types. You can have a decoupled or tightly-coupled model that uses jackson-annotated types or generated avro or protobuf.

@purplefox
Copy link
Contributor

I'm going to say my final words on the approach taken here, as I don't want to hold up the work.

I continue to have serious concerns about the general approach taken here, as discussed in other places, we're effectively defining a very broad protocol that is strongly coupled to our internal implementation, and I worry that is going to reduce our ability to evolve the project effectively going ahead. I suspect this will trip up us in the future, but maybe you will prove me wrong. We'll see ;)

Even though I don't agree with the approach taken, I appreciate that others want to press ahead with it, so I will respect that, and I won't block anything related to this work.

@big-andy-coates
Copy link
Contributor

I'm going to say my final words on the approach taken here, as I don't want to hold up the work.

I continue to have serious concerns about the general approach taken here, as discussed in other places, we're effectively defining a very broad protocol that is strongly coupled to our internal implementation, and I worry that is going to reduce our ability to evolve the project effectively going ahead. I suspect this will trip up us in the future, but maybe you will prove me wrong. We'll see ;)

Even though I don't agree with the approach taken, I appreciate that others want to press ahead with it, so I will respect that, and I won't block anything related to this work.

Hey @purplefox, what do you propose as an alternative?

@rodesai
Copy link
Contributor Author

rodesai commented Jan 6, 2020

I'm closing this out. We should be able to enable the feature with the current schema. There's a few issues in review that have not been addressed yet, but can be fixed even after plans are turned on. I've filed github issues for those and linked to them inline.

@rodesai rodesai closed this Jan 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants