ingest unstructured json records or capture unrecognized fields #12207

lmatz · 2023-09-11T09:49:48Z

In a discussion with @fuyufjh that is inspired by a user's question

When a user has some rows in JSON format, he would like to:

Ingest some/all the columns by defining a concrete data type for each column.
Ingest the entire row as a JSONB column.

Right now, when (1), if a JSON field is in the row data but is not defined in the table schema, it will not be parsed and ingested into Risingwave.

For (2), right now, the user has to wrap the entire row by another field in JSON, e.g. 'data': {the original data}. Therefore, from time to time, it requires the user to do another transformation before ingesting data into Risingwave. However, the user may not have control of the data format in the source as the source data is collected by some other data team.
By enabling users to do so, they can do more ETL workload all in RW instead of bringing up another system.

If we give the option that users can group all the JSON fields undefined in the table schema as one field, (2) is naturally solved.

Also, the user often wants to take the primary key out as a single column to be the primary key to duplicate the source stream of the table but keep everything else in a huge data JSONB column.

Welcome more observations and counter-examples

The text was updated successfully, but these errors were encountered:

fuyufjh · 2023-10-10T09:22:33Z

Will be tracked on User-requested issues (Notion)

BugenZhao · 2023-10-10T09:25:04Z

Also, the user often wants to take the primary key out as a single column to be the primary key to duplicate the source stream of the table but keep everything else in a huge data JSONB column.

IIUC, this can be done by defining a generated column accessing that JSONB.

github-actions · 2023-12-11T01:50:39Z

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

BugenZhao · 2023-12-11T05:41:24Z

FYI, this is somehow similar to #[serde(flatten)]:

#[derive(Serialize, Deserialize)]
struct S {
    a: u32,
    b: String,
    #[serde(flatten)]
    other: Map<String, Value>,
}

from serde-rs/serde#941 (comment)

We need to find a way to mark the column.

fuyufjh · 2024-08-12T06:24:17Z

We are enhancing the schemaless features recently. Shall we work out this out based on include columns? Example:

CREATE TABLE t1 (
-- can be empty
)
INCLUDE PAYLOAD AS payload JSONB
WITH (  connector = 'kafka',
	topic = 'test_include_key')
FORMAT PLAIN ENCODE JSON

where the payload can be either JSONB, VARCHAR or BYTEA

xiangjinwu · 2024-08-12T06:42:04Z

Related: #17959

create source foo (
  raw bytea,
  data jsonb as convert_from(raw, 'utf-8')::jsonb
) with (...) format plain encode bytes;

fuyufjh · 2024-08-12T06:44:11Z

Related: #17959

create source foo (
  raw bytea,
  data jsonb as convert_from(raw, 'utf-8')::jsonb
) with (...) format plain encode bytes;

Yeah, this leverages generated column, which is like a work-around to me. Now I am considering making it a more "formal" usage.

lmatz · 2024-09-02T03:38:02Z

#17650 (comment) proposed another variant of syntax

FORMAT DYNAMODB_CDC ENCODE JSON (
    single_blob_column = 'data'
)

tabVersion · 2024-09-06T03:57:51Z

#17650 (comment) proposed another variant of syntax
FORMAT DYNAMODB_CDC ENCODE JSON (
    single_blob_column = 'data'
)

I prefer this approach because the new collecting column has to be JSON type. Encode json guarantees a jsonb column can ingest the raw data losslessly. I am not sure JSONB also works for other encodes.

fuyufjh · 2024-09-06T05:58:17Z

I still vote for INCLUDE PAYLOAD AS payload JSONB because include is exactly the syntax to to introduce a new column. I'd like to keep this style consistent.

Conversely, if we choose ( single_blob_column = 'data' ), why not using ( kafka_headers_column = 'headers' ) to add Kafka message headers as well? 😄

tabVersion · 2024-09-06T06:32:35Z

I still vote for INCLUDE PAYLOAD AS payload JSONB because include is exactly the syntax to to introduce a new column. I'd like to keep this style consistent.

From the impl side, this approach is more doable.

Conversely, if we choose ( single_blob_column = 'data' ), why not using ( kafka_headers_column = 'headers' ) to add Kafka message headers as well? 😄

In the original design, additional means the field comes from places other than the message payload.
But PAYLOAD exactly means the message payload, which is out of the additional's scope. So I object to the idea at the moment.

github-actions bot added this to the release-1.3 milestone Sep 11, 2023

lmatz added type/feature needs-discussion labels Sep 11, 2023

fuyufjh removed this from the release-1.3 milestone Oct 10, 2023

github-actions bot added the no-issue-activity label Dec 11, 2023

github-actions bot removed the no-issue-activity label Dec 12, 2023

BugenZhao changed the title ~~ingest a row in JSON with many fields on the top level as one single jsonb column~~ ingest unstructured json records into a single column, or capture unrecognized fields Jun 3, 2024

BugenZhao changed the title ~~ingest unstructured json records into a single column, or capture unrecognized fields~~ ingest unstructured json records or capture unrecognized fields Jun 3, 2024

fuyufjh assigned tabVersion Aug 13, 2024

lmatz mentioned this issue Aug 21, 2024

a user-friendly way to suppress the undefined field warnings during connector parsing #18153

Open

tabVersion mentioned this issue Sep 11, 2024

feat: schemaless ingestion for encode json (INCLUDE payload) #18437

Merged

9 tasks

tabVersion closed this as completed in #18437 Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest unstructured json records or capture unrecognized fields #12207

ingest unstructured json records or capture unrecognized fields #12207

lmatz commented Sep 11, 2023 •

edited

Loading

fuyufjh commented Oct 10, 2023

BugenZhao commented Oct 10, 2023

github-actions bot commented Dec 11, 2023

BugenZhao commented Dec 11, 2023

fuyufjh commented Aug 12, 2024 •

edited

Loading

xiangjinwu commented Aug 12, 2024

fuyufjh commented Aug 12, 2024

lmatz commented Sep 2, 2024

tabVersion commented Sep 6, 2024

fuyufjh commented Sep 6, 2024 •

edited

Loading

tabVersion commented Sep 6, 2024

ingest unstructured json records or capture unrecognized fields #12207

ingest unstructured json records or capture unrecognized fields #12207

Comments

lmatz commented Sep 11, 2023 • edited Loading

fuyufjh commented Oct 10, 2023

BugenZhao commented Oct 10, 2023

github-actions bot commented Dec 11, 2023

BugenZhao commented Dec 11, 2023

fuyufjh commented Aug 12, 2024 • edited Loading

xiangjinwu commented Aug 12, 2024

fuyufjh commented Aug 12, 2024

lmatz commented Sep 2, 2024

tabVersion commented Sep 6, 2024

fuyufjh commented Sep 6, 2024 • edited Loading

tabVersion commented Sep 6, 2024

lmatz commented Sep 11, 2023 •

edited

Loading

fuyufjh commented Aug 12, 2024 •

edited

Loading

fuyufjh commented Sep 6, 2024 •

edited

Loading