Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improve][Docs][Clickhouse] Reconstruct the clickhouse connec… #5085

Merged
merged 2 commits into from
Jul 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
207 changes: 102 additions & 105 deletions docs/en/connector-v2/sink/Clickhouse.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,122 +2,137 @@

> Clickhouse sink connector

## Description
## Support Those Engines

Used to write data to Clickhouse.
> Spark<br/>
> Flink<br/>
> SeaTunnel Zeta<br/>

## Key features
## Key Features

- [ ] [exactly-once](../../concept/connector-v2-features.md)

The Clickhouse sink plug-in can achieve accuracy once by implementing idempotent writing, and needs to cooperate with aggregatingmergetree and other engines that support deduplication.

- [x] [cdc](../../concept/connector-v2-features.md)

## Options

| name | type | required | default value |
|---------------------------------------|---------|----------|---------------|
| host | string | yes | - |
| database | string | yes | - |
| table | string | yes | - |
| username | string | yes | - |
| password | string | yes | - |
| clickhouse.config | map | no | |
| bulk_size | string | no | 20000 |
| split_mode | string | no | false |
| sharding_key | string | no | - |
| primary_key | string | no | - |
| support_upsert | boolean | no | false |
| allow_experimental_lightweight_delete | boolean | no | false |
| common-options | | no | - |

### host [string]

`ClickHouse` cluster address, the format is `host:port` , allowing multiple `hosts` to be specified. Such as `"host1:8123,host2:8123"` .

### database [string]

The `ClickHouse` database

### table [string]

The table name

### username [string]

`ClickHouse` user username

### password [string]

`ClickHouse` user password

### clickhouse.config [map]

In addition to the above mandatory parameters that must be specified by `clickhouse-jdbc` , users can also specify multiple optional parameters, which cover all the [parameters](https://github.com/ClickHouse/clickhouse-jdbc/tree/master/clickhouse-client#configuration) provided by `clickhouse-jdbc` .

### bulk_size [number]

The number of rows written through [Clickhouse-jdbc](https://github.com/ClickHouse/clickhouse-jdbc) each time, the `default is 20000`, if checkpoints are enabled, writing will also occur at the times when the checkpoints are satisfied .

### split_mode [boolean]

This mode only support clickhouse table which engine is 'Distributed'.And `internal_replication` option
should be `true`. They will split distributed table data in seatunnel and perform write directly on each shard. The shard weight define is clickhouse will be
counted.

### sharding_key [string]
> The Clickhouse sink plug-in can achieve accuracy once by implementing idempotent writing, and needs to cooperate with aggregatingmergetree and other engines that support deduplication.

When use split_mode, which node to send data to is a problem, the default is random selection, but the
'sharding_key' parameter can be used to specify the field for the sharding algorithm. This option only
worked when 'split_mode' is true.

### primary_key [string]

Mark the primary key column from clickhouse table, and based on primary key execute INSERT/UPDATE/DELETE to clickhouse table

### support_upsert [boolean]
## Description

Support upsert row by query primary key
Used to write data to Clickhouse.

### allow_experimental_lightweight_delete [boolean]
## Supported DataSource Info

In order to use the Clickhouse connector, the following dependencies are required.
They can be downloaded via install-plugin.sh or from the Maven central repository.

| Datasource | Supported Versions | Dependency |
|------------|--------------------|------------------------------------------------------------------------------------------------------------------|
| Clickhouse | universal | [Download](https://mvnrepository.com/artifact/org.apache.seatunnel/seatunnel-connectors-v2/connector-clickhouse) |

## Data Type Mapping

| SeaTunnel Data type | Clickhouse Data type |
|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| STRING | String / Int128 / UInt128 / Int256 / UInt256 / Point / Ring / Polygon MultiPolygon |
| INT | Int8 / UInt8 / Int16 / UInt16 / Int32 |
| BIGINT | UInt64 / Int64 / IntervalYear / IntervalQuarter / IntervalMonth / IntervalWeek / IntervalDay / IntervalHour / IntervalMinute / IntervalSecond |
| DOUBLE | Float64 |
| DECIMAL | Decimal |
| FLOAT | Float32 |
| DATE | Date |
| TIME | DateTime |
| ARRAY | Array |
| MAP | Map |

## Sink Options
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You lost Data Type Mapper?


| Name | Type | Required | Default | Description |
|---------------------------------------|---------|----------|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| host | String | Yes | - | `ClickHouse` cluster address, the format is `host:port` , allowing multiple `hosts` to be specified. Such as `"host1:8123,host2:8123"`. |
| database | String | Yes | - | The `ClickHouse` database. |
| table | String | Yes | - | The table name. |
| username | String | Yes | - | `ClickHouse` user username. |
| password | String | Yes | - | `ClickHouse` user password. |
| clickhouse.config | Map | No | | In addition to the above mandatory parameters that must be specified by `clickhouse-jdbc` , users can also specify multiple optional parameters, which cover all the [parameters](https://github.com/ClickHouse/clickhouse-jdbc/tree/master/clickhouse-client#configuration) provided by `clickhouse-jdbc`. |
| bulk_size | String | No | 20000 | The number of rows written through [Clickhouse-jdbc](https://github.com/ClickHouse/clickhouse-jdbc) each time, the `default is 20000`. |
| split_mode | String | No | false | This mode only support clickhouse table which engine is 'Distributed'.And `internal_replication` option-should be `true`.They will split distributed table data in seatunnel and perform write directly on each shard. The shard weight define is clickhouse will counted. |
| sharding_key | String | No | - | When use split_mode, which node to send data to is a problem, the default is random selection, but the 'sharding_key' parameter can be used to specify the field for the sharding algorithm. This option only worked when 'split_mode' is true. |
| primary_key | String | No | - | Mark the primary key column from clickhouse table, and based on primary key execute INSERT/UPDATE/DELETE to clickhouse table. |
| support_upsert | Boolean | No | false | Support upsert row by query primary key. |
| allow_experimental_lightweight_delete | Boolean | No | false | Allow experimental lightweight delete based on `*MergeTree` table engine. |
| common-options | | No | - | Sink plugin common parameters, please refer to [Sink Common Options](common-options.md) for details. |

## How to Create a Clickhouse Data Synchronization Jobs

The following example demonstrates how to create a data synchronization job that writes randomly generated data to a Clickhouse database:

```bash
# Set the basic configuration of the task to be performed
env {
execution.parallelism = 1
job.mode = "BATCH"
checkpoint.interval = 1000
}

Allow experimental lightweight delete based on `*MergeTree` table engine
source {
FakeSource {
row.num = 2
bigint.min = 0
bigint.max = 10000000
split.num = 1
split.read-interval = 300
schema {
fields {
c_bigint = bigint
}
}
}
}

### common options
sink {
Clickhouse {
host = "127.0.0.1:9092"
database = "default"
table = "test"
username = "xxxxx"
password = "xxxxx"
}
}
```

Sink plugin common parameters, please refer to [Sink Common Options](common-options.md) for details
### Tips

## Examples
> 1.[SeaTunnel Deployment Document](../../start-v2/locally/deployment.md). <br/>
> 2.The table to be written to needs to be created in advance before synchronization.<br/>
> 3.When sink is writing to the ClickHouse table, you don't need to set its schema because the connector will query ClickHouse for the current table's schema information before writing.<br/>

Simple
## Clickhouse Sink Config

```hocon
sink {
Clickhouse {
host = "localhost:8123"
database = "default"
table = "fake_all"
username = "default"
password = ""
clickhouse.confg = {
username = "xxxxx"
password = "xxxxx"
clickhouse.config = {
max_rows_to_read = "100"
read_overflow_mode = "throw"
}
}
}
```

Split mode
## Split Mode

```hocon
sink {
Clickhouse {
host = "localhost:8123"
database = "default"
table = "fake_all"
username = "default"
password = ""
username = "xxxxx"
password = "xxxxx"

# split mode options
split_mode = true
Expand All @@ -126,16 +141,16 @@ sink {
}
```

CDC(Change data capture)
## CDC(Change data capture) Sink

```hocon
sink {
Clickhouse {
host = "localhost:8123"
database = "default"
table = "fake_all"
username = "default"
password = ""
username = "xxxxx"
password = "xxxxx"

# cdc options
primary_key = "id"
Expand All @@ -144,16 +159,16 @@ sink {
}
```

CDC(Change data capture) for *MergeTree engine
## CDC(Change data capture) for *MergeTree engine

```hocon
sink {
Clickhouse {
host = "localhost:8123"
database = "default"
table = "fake_all"
username = "default"
password = ""
username = "xxxxx"
password = "xxxxx"

# cdc options
primary_key = "id"
Expand All @@ -163,21 +178,3 @@ sink {
}
```

## Changelog

### 2.2.0-beta 2022-09-26

- Add ClickHouse Sink Connector

### 2.3.0-beta 2022-10-20

- [Improve] Clickhouse Support Int128,Int256 Type ([3067](https://github.com/apache/seatunnel/pull/3067))

### next version

- [Improve] Clickhouse Sink support nest type and array type([3047](https://github.com/apache/seatunnel/pull/3047))
- [Improve] Clickhouse Sink support geo type([3141](https://github.com/apache/seatunnel/pull/3141))
- [Feature] Support CDC write DELETE/UPDATE/INSERT events ([3653](https://github.com/apache/seatunnel/pull/3653))
- [Improve] Remove Clickhouse Fields Config ([3826](https://github.com/apache/seatunnel/pull/3826))
- [Improve] Change Connector Custom Config Prefix To Map [3719](https://github.com/apache/seatunnel/pull/3719)

Loading