Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't allow partial (intra-value) schema inference #5959

Open
agavra opened this issue Aug 6, 2020 · 1 comment
Open

Don't allow partial (intra-value) schema inference #5959

agavra opened this issue Aug 6, 2020 · 1 comment
Labels
breaking-change P1 Slightly lower priority to P0 ;) streaming-engine Tickets owned by the ksqlDB Streaming Team

Comments

@agavra
Copy link
Contributor

agavra commented Aug 6, 2020

Thanks for the detailed thoughts @big-andy-coates!

If a prior schema does exist, then there are a few points where we care about this:

  1. CREATE statements where the users supplies a list of value columns:
    a. ksqlDB needs to ensure the supplied column definitions can read the data in the source topic.
    b. ksqlDB needs to configure the the deserializer with an appropriate schema.
  2. INSERT INTO statements _inserting into sources created by the above type of CREATE statement:
    a. ksqlDB needs to check it can write data that is compatible with the existing schema.
    b. ksqlDB needs to configure the serializer with an appropriate schema.

This frames the problem nicely, but I want to suggest a different solution: don't allow inline specification of columns in ksql for schemas already registered in schema registry. I think allowing this is really a bug - it introduces a lot of problems (including some more serious, like #5673) and there are compatibility features in some data formats that don't have a sql/ksql equivalent (e.g. additionlProperties described in #5798). If you want a projection of fields in a schema (I'm guessing the only use case for specifying subset fields), you should specify it as a projection - not do it sneakily in the serde suite.

The big benefit is that we don't have to reason about compatibility - that's a hard problem and we should delegate it 100% to schema registry. We also don't want to be in the business of doing this per format, which is something we'd need to get into because each format has their own compatibility requirements.

We can get away with this if we separate the case-sensitivity from the serialization layer. We can always coerce the GenericRow into the desired serialization schema and vice-versa if the SQL schema and the serialization schema are compatible (which we know they are in CT/CS statements because we generate the SQL schema from it).

Originally posted by @agavra in #5801 (comment)

@guozhangwang
Copy link
Contributor

I'm big +1 on disallowing CSAS/CTAS to define a schema on selected columns for those cases that have schema registration integrated. If users are only interested in a subset of the fields, they can just create a new stream/table right after with projected fields. Behind the scene, naively we would simply:

  • Use the registered serde to read from source topic, and optionally to write to materialized store / changelog if materialization is needed. In practice, materialization would not usually be happening (ideally) since users would only manipulate on the second projected table.
  • Use the new serde defined in the second CREATE statement to write to materialized store / changelog for the projected table.

In the future, we can go very fancy if needed, to e.g. optimize and read directly from Kafka and push down the projection to broker's consumption interceptor if we know that the first table would not need to be materialized. But in all I think by disallowing it we can clear a tricky challenge to separate our logical/physical formats.

@jzaralim jzaralim added the streaming-engine Tickets owned by the ksqlDB Streaming Team label Feb 4, 2021
@agavra agavra removed their assignment Sep 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change P1 Slightly lower priority to P0 ;) streaming-engine Tickets owned by the ksqlDB Streaming Team
Projects
None yet
Development

No branches or pull requests

3 participants