Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest unstructured json records or capture unrecognized fields #12207

Closed
lmatz opened this issue Sep 11, 2023 · 11 comments · Fixed by #18437
Closed

ingest unstructured json records or capture unrecognized fields #12207

lmatz opened this issue Sep 11, 2023 · 11 comments · Fixed by #18437

Comments

@lmatz
Copy link
Contributor

lmatz commented Sep 11, 2023

In a discussion with @fuyufjh that is inspired by a user's question

When a user has some rows in JSON format, he would like to:

  1. Ingest some/all the columns by defining a concrete data type for each column.
  2. Ingest the entire row as a JSONB column.

Right now, when (1), if a JSON field is in the row data but is not defined in the table schema, it will not be parsed and ingested into Risingwave.

For (2), right now, the user has to wrap the entire row by another field in JSON, e.g. 'data': {the original data}. Therefore, from time to time, it requires the user to do another transformation before ingesting data into Risingwave. However, the user may not have control of the data format in the source as the source data is collected by some other data team.
By enabling users to do so, they can do more ETL workload all in RW instead of bringing up another system.

If we give the option that users can group all the JSON fields undefined in the table schema as one field, (2) is naturally solved.

Also, the user often wants to take the primary key out as a single column to be the primary key to duplicate the source stream of the table but keep everything else in a huge data JSONB column.

Welcome more observations and counter-examples

@github-actions github-actions bot added this to the release-1.3 milestone Sep 11, 2023
@fuyufjh fuyufjh removed this from the release-1.3 milestone Oct 10, 2023
@fuyufjh
Copy link
Member

fuyufjh commented Oct 10, 2023

Will be tracked on User-requested issues (Notion)

@BugenZhao
Copy link
Member

Also, the user often wants to take the primary key out as a single column to be the primary key to duplicate the source stream of the table but keep everything else in a huge data JSONB column.

IIUC, this can be done by defining a generated column accessing that JSONB.

Copy link
Contributor

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

@BugenZhao
Copy link
Member

FYI, this is somehow similar to #[serde(flatten)]:

#[derive(Serialize, Deserialize)]
struct S {
    a: u32,
    b: String,
    #[serde(flatten)]
    other: Map<String, Value>,
}

from serde-rs/serde#941 (comment)

We need to find a way to mark the column.

@BugenZhao BugenZhao changed the title ingest a row in JSON with many fields on the top level as one single jsonb column ingest unstructured json records into a single column, or capture unrecognized fields Jun 3, 2024
@BugenZhao BugenZhao changed the title ingest unstructured json records into a single column, or capture unrecognized fields ingest unstructured json records or capture unrecognized fields Jun 3, 2024
@fuyufjh
Copy link
Member

fuyufjh commented Aug 12, 2024

We are enhancing the schemaless features recently. Shall we work out this out based on include columns? Example:

CREATE TABLE t1 (
-- can be empty
)
INCLUDE PAYLOAD AS payload JSONB
WITH (  connector = 'kafka',
	topic = 'test_include_key')
FORMAT PLAIN ENCODE JSON

where the payload can be either JSONB, VARCHAR or BYTEA

@xiangjinwu
Copy link
Contributor

Related: #17959

create source foo (
  raw bytea,
  data jsonb as convert_from(raw, 'utf-8')::jsonb
) with (...) format plain encode bytes;

@fuyufjh
Copy link
Member

fuyufjh commented Aug 12, 2024

Related: #17959

create source foo (
  raw bytea,
  data jsonb as convert_from(raw, 'utf-8')::jsonb
) with (...) format plain encode bytes;

Yeah, this leverages generated column, which is like a work-around to me. Now I am considering making it a more "formal" usage.

@lmatz
Copy link
Contributor Author

lmatz commented Sep 2, 2024

#17650 (comment) proposed another variant of syntax

FORMAT DYNAMODB_CDC ENCODE JSON (
    single_blob_column = 'data'
)

@tabVersion
Copy link
Contributor

#17650 (comment) proposed another variant of syntax

FORMAT DYNAMODB_CDC ENCODE JSON (
    single_blob_column = 'data'
)

I prefer this approach because the new collecting column has to be JSON type. Encode json guarantees a jsonb column can ingest the raw data losslessly. I am not sure JSONB also works for other encodes.

@fuyufjh
Copy link
Member

fuyufjh commented Sep 6, 2024

I still vote for INCLUDE PAYLOAD AS payload JSONB because include is exactly the syntax to to introduce a new column. I'd like to keep this style consistent.

Conversely, if we choose ( single_blob_column = 'data' ), why not using ( kafka_headers_column = 'headers' ) to add Kafka message headers as well? 😄

@tabVersion
Copy link
Contributor

I still vote for INCLUDE PAYLOAD AS payload JSONB because include is exactly the syntax to to introduce a new column. I'd like to keep this style consistent.

From the impl side, this approach is more doable.

Conversely, if we choose ( single_blob_column = 'data' ), why not using ( kafka_headers_column = 'headers' ) to add Kafka message headers as well? 😄

In the original design, additional means the field comes from places other than the message payload.
But PAYLOAD exactly means the message payload, which is out of the additional's scope. So I object to the idea at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants