Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Primary keys #4986

Merged
merged 10 commits into from
Apr 8, 2020
Merged

Conversation

big-andy-coates
Copy link
Contributor

@big-andy-coates big-andy-coates commented Apr 3, 2020

Description

KLIP for this (which needs approving first): #5008

fixes: #3681

The commit introduces PRIMARY KEY columns into the ksqlDB syntax for tables. Streams will continue to have KEY columns. For example,

CREATE TABLE ORDERS (ID BIGINT PRIMARY KEY, USER_ID BIGINT, ...
--vs
CREATE STREAM ORDER_UPDATES (ID BIGINT KEY, USER_ID BIGINT, ...

Note: this change only introduces the PRIMARY KEY syntax. It does not change how data is processed by ksqlDB.

This change in syntax differentiates the key handling semantics for tables vs streams:

A ksqlDB TABLE works much like tables in other SQL systems: Each row is identified by its PRIMARY KEY. PRIMARY KEY column(s) can not be NULL.
A message in the underlying Kafka topic with the same key as an existing row will replaces the earlier row in the table,
or deletes the row if the message's value is NULL, as long as the earlier row does not have a later timestamp / ROWTIME.

A ksqlDB STREAM is a stream of facts. Each fact is immutable and is unique. A stream can store its data in either KEY or VALUE columns. Columns stored in the key of the Kafka message are not PRIMARY KEY columns. They are just non-primary-key columns that happen to be stored in the key of the Kafka message.
Both KEY and VALUE columns can be NULL. No special processing is done if two rows have the same key.

The table below contrasts key handling for streams and tables:

STREAM TABLE
Key column type KEY PRIMARY KEY
NON NULL key constraint No Yes
Messages in the Kafka topic with a NULL PRIMARY KEY are ignored
Unique key constraint No Yes
Messages with the same key as another have no special meaning : Later messages with the same key replace earlier
Tombstones No
Messages with NULL values are ignored
Yes
NULL message values are treated as a tombstone
Any existing row with a matching key is deleted

Testing done

usual

Reviewing notes:

Commits broken down into:

  • First commit: Prod code changes and associated tests.
  • Second commit: doc updates
  • Third commit: Updates to other tests, i.e. adding the PRIMARY to other tests that use tables.
  • Fourth commit: New historical query plans (The change is just the addition of the PRIMARY key word in the SQL).
  • Fifth commit: doc update to fix table layout. Turns out github does support multi-line tables :(

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

Sorry, something went wrong.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fixes: confluentinc#3681

The commit introduces `PRIMARY KEY` columns into the ksqlDB syntax for tables. Streams will continue to have `KEY` columns. For example,

```sql
CREATE TABLE ORDERS (ID BIGINT PRIMARY KEY, USER_ID BIGINT, ...
--vs
CREATE STREAM ORDER_UPDATES (ID BIGINT KEY, USER_ID BIGINT, ...
```

This change in syntax differentiates the key handling semantics for tables vs streams:

A ksqlDB TABLE works much like tables in other SQL systems. Each row is identified by its `PRIMARY KEY`. `PRIMARY KEY` values can not be NULL.
A message in the underlying Kafka topic with the same key as an existing row will _replace_ the earlier row in the table,
or _delete_ the row if the message's value is NULL, as long as the earlier row does not have a later timestamp / `ROWTIME`.

A ksqlDB STREAM is a stream of _facts_. Each _fact_ is immutable and is unique. A stream can store its data in either `KEY` or `VALUE` columns.
Both `KEY` and `VALUE` columns can be NULL. No special processing is done if two rows have the same key.

The table below contrasts key handling for streams and tables:

|                          |  STREAM                                                       | TABLE                                                             |
| ------------------------ | --------------------------------------------------------------| ----------------------------------------------------------------- |
| Key column type          | `KEY`                                                         | `PRIMARY KEY`                                                     |
| NON NULL key constraint  | No                                                            | Yes                                                               |
:                          :                                                               : Messages in the Kafka topic with a NULL `PRIMARY KEY` are ignored :
| Unique key constraint    | No                                                            | Yes                                                               |
:                          : Messages with the same key as another have no special meaning : Later messages with the same key _replace_ earlier                :
| Tombstones               | No                                                            | Yes                                                               |
:                          : Messages with NULL values are ignored                         : NULL message values are treated as a _tombstone_                  :
:                          :                                                               : Any existing row with a matching key is deleted                   :
| ------------------------ | --------------------------------------------------------------| ----------------------------------------------------------------- |
@big-andy-coates big-andy-coates requested review from JimGalasyn and a team as code owners April 3, 2020 11:36
@big-andy-coates big-andy-coates changed the title Primitive keys Primary keys Apr 3, 2020
Copy link
Member

@JimGalasyn JimGalasyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with some suggestions. Thanks for the great doc updates!

Copy link
Contributor

@rodesai rodesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change LGTM. Is this discussed in a KLIP somewhere?

| | STREAM | TABLE |
| ------------------------ | --------------------------------------------------------------| ----------------------------------------------------------------- |
| Key column type | `KEY` | `PRIMARY KEY` |
| NON NULL key constraint | No | Yes <br> Messages in the Kafka topic with a NULL `PRIMARY KEY` are ignored |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true for now (since we only support unwrapped primitive keys), but it won't be true in the future if we support single-element wrapped keys right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a strict SQL sense this should remain true for wrapped and multi-column primary keys. Primary key columns in SQL must be NON NULL.

The main reason for this is that an SQL NULL is not comparable to another SQL NULL, i.e. NULL = NULL returns false. So a NULL value in a primary key could never be matched.

@big-andy-coates big-andy-coates requested review from rodesai and a team April 6, 2020 09:44
@big-andy-coates
Copy link
Contributor Author

big-andy-coates commented Apr 6, 2020

@rodesai

Is this discussed in a KLIP somewhere?

I'm not aware it's been covered explicitly in a KLIP. However, I have discussed with Product about this change.

cc @MichaelDrogalis @derekjn

UPDATE: KLIP available here: #5008

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending #5008

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use PRIMARY KEY syntax on tables vs existing KEY syntax on streams.
4 participants