Skip to content

Commit

Permalink
[Feature][Connector-V2][File] Support compress (#3899)
Browse files Browse the repository at this point in the history
* [Feature][Connector-V2][File] Support compress

* [Feature][Connector-V2][File] Update e2e tests

* [Feature][Connector-V2][File] Update docs

* [Improve][Connector-V2][File] Update docs

* [Improve][Connector-V2][File] Update option rule
  • Loading branch information
TyrantLucifer authored Jan 16, 2023
1 parent 84508fc commit 55602f6
Show file tree
Hide file tree
Showing 26 changed files with 268 additions and 126 deletions.
15 changes: 14 additions & 1 deletion docs/en/connector-v2/sink/FtpFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ By default, we use 2PC commit to ensure `exactly-once`
| sink_columns | array | no | | When this parameter is empty, all fields are sink columns |
| is_enable_transaction | boolean | no | true | |
| batch_size | int | no | 1000000 | |
| compress_codec | string | no | none | |
| common-options | object | no | - | |

### host [string]
Expand Down Expand Up @@ -157,6 +158,16 @@ Only support `true` now.

The maximum number of rows in a file. For SeaTunnel Engine, the number of lines in the file is determined by `batch_size` and `checkpoint.interval` jointly decide. If the value of `checkpoint.interval` is large enough, sink writer will write rows in a file until the rows in the file larger than `batch_size`. If `checkpoint.interval` is small, the sink writer will create a new file when a new checkpoint trigger.

### compress_codec [string]

The compress codec of files and the details that supported as the following shown:

- txt: `lzo` `none`
- json: `lzo` `none`
- csv: `lzo` `none`
- orc: `lzo` `snappy` `lz4` `zlib` `none`
- parquet: `lzo` `snappy` `lz4` `gzip` `brotli` `zstd` `none`

### common options

Sink plugin common parameters, please refer to [Sink Common Options](common-options.md) for details.
Expand Down Expand Up @@ -224,4 +235,6 @@ FtpFile {
- Sink columns mapping failed
- When restore writer from states getting transaction directly failed

- [Improve] Support setting batch size for every file ([3625](https://github.com/apache/incubator-seatunnel/pull/3625))
- [Improve] Support setting batch size for every file ([3625](https://github.com/apache/incubator-seatunnel/pull/3625))

- [Improve] Support file compress ([3899](https://github.com/apache/incubator-seatunnel/pull/3899))
17 changes: 13 additions & 4 deletions docs/en/connector-v2/sink/HdfsFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ By default, we use 2PC commit to ensure `exactly-once`
| sink_columns | array | no | | When this parameter is empty, all fields are sink columns |
| is_enable_transaction | boolean | no | true | |
| batch_size | int | no | 1000000 | |
| compress_codec | string | no | none | |
| kerberos_principal | string | no | - |
| kerberos_keytab_path | string | no | - | |
| compress_codec | string | no | none | |
Expand Down Expand Up @@ -153,6 +154,16 @@ Only support `true` now.

The maximum number of rows in a file. For SeaTunnel Engine, the number of lines in the file is determined by `batch_size` and `checkpoint.interval` jointly decide. If the value of `checkpoint.interval` is large enough, sink writer will write rows in a file until the rows in the file larger than `batch_size`. If `checkpoint.interval` is small, the sink writer will create a new file when a new checkpoint trigger.

### compress_codec [string]

The compress codec of files and the details that supported as the following shown:

- txt: `lzo` `none`
- json: `lzo` `none`
- csv: `lzo` `none`
- orc: `lzo` `snappy` `lz4` `zlib` `none`
- parquet: `lzo` `snappy` `lz4` `gzip` `brotli` `zstd` `none`
-
### kerberos_principal [string]

The principal of kerberos
Expand All @@ -161,9 +172,6 @@ The principal of kerberos

The keytab path of kerberos

### compressCodec [string]
Support lzo compression for text in file format. The file name ends with ".lzo.txt" .

### common options
Sink plugin common parameters, please refer to [Sink Common Options](common-options.md) for details

Expand Down Expand Up @@ -245,4 +253,5 @@ HdfsFile {
### Next version
- [Improve] Support setting batch size for every file ([3625](https://github.com/apache/incubator-seatunnel/pull/3625))
- [Improve] Support lzo compression for text in file format ([3782](https://github.com/apache/incubator-seatunnel/pull/3782))
- [Improve] Support kerberos authentication ([3840](https://github.com/apache/incubator-seatunnel/pull/3840))
- [Improve] Support kerberos authentication ([3840](https://github.com/apache/incubator-seatunnel/pull/3840))
- [Improve] Support file compress ([3899](https://github.com/apache/incubator-seatunnel/pull/3899))
13 changes: 13 additions & 0 deletions docs/en/connector-v2/sink/LocalFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ By default, we use 2PC commit to ensure `exactly-once`
| sink_columns | array | no | | When this parameter is empty, all fields are sink columns |
| is_enable_transaction | boolean | no | true | |
| batch_size | int | no | 1000000 | |
| compress_codec | string | no | none | |
| common-options | object | no | - | |

### path [string]
Expand Down Expand Up @@ -137,6 +138,16 @@ Only support `true` now.

The maximum number of rows in a file. For SeaTunnel Engine, the number of lines in the file is determined by `batch_size` and `checkpoint.interval` jointly decide. If the value of `checkpoint.interval` is large enough, sink writer will write rows in a file until the rows in the file larger than `batch_size`. If `checkpoint.interval` is small, the sink writer will create a new file when a new checkpoint trigger.

### compress_codec [string]

The compress codec of files and the details that supported as the following shown:

- txt: `lzo` `none`
- json: `lzo` `none`
- csv: `lzo` `none`
- orc: `lzo` `snappy` `lz4` `zlib` `none`
- parquet: `lzo` `snappy` `lz4` `gzip` `brotli` `zstd` `none`

### common options

Sink plugin common parameters, please refer to [Sink Common Options](common-options.md) for details.
Expand Down Expand Up @@ -206,3 +217,5 @@ LocalFile {
- When restore writer from states getting transaction directly failed

- [Improve] Support setting batch size for every file ([3625](https://github.com/apache/incubator-seatunnel/pull/3625))

- [Improve] Support file compress ([3899](https://github.com/apache/incubator-seatunnel/pull/3899))
15 changes: 14 additions & 1 deletion docs/en/connector-v2/sink/OssFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ By default, we use 2PC commit to ensure `exactly-once`
| sink_columns | array | no | | When this parameter is empty, all fields are sink columns |
| is_enable_transaction | boolean | no | true | |
| batch_size | int | no | 1000000 | |
| compress_codec | string | no | none | |
| common-options | object | no | - | |

### path [string]
Expand Down Expand Up @@ -160,6 +161,16 @@ Only support `true` now.

The maximum number of rows in a file. For SeaTunnel Engine, the number of lines in the file is determined by `batch_size` and `checkpoint.interval` jointly decide. If the value of `checkpoint.interval` is large enough, sink writer will write rows in a file until the rows in the file larger than `batch_size`. If `checkpoint.interval` is small, the sink writer will create a new file when a new checkpoint trigger.

### compress_codec [string]

The compress codec of files and the details that supported as the following shown:

- txt: `lzo` `none`
- json: `lzo` `none`
- csv: `lzo` `none`
- orc: `lzo` `snappy` `lz4` `zlib` `none`
- parquet: `lzo` `snappy` `lz4` `gzip` `brotli` `zstd` `none`

### common options

Sink plugin common parameters, please refer to [Sink Common Options](common-options.md) for details.
Expand Down Expand Up @@ -244,4 +255,6 @@ For orc file format simple config
- Sink columns mapping failed
- When restore writer from states getting transaction directly failed

- [Improve] Support setting batch size for every file ([3625](https://github.com/apache/incubator-seatunnel/pull/3625))
- [Improve] Support setting batch size for every file ([3625](https://github.com/apache/incubator-seatunnel/pull/3625))

- [Improve] Support file compress ([3899](https://github.com/apache/incubator-seatunnel/pull/3899))
25 changes: 22 additions & 3 deletions docs/en/connector-v2/sink/OssJindoFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ By default, we use 2PC commit to ensure `exactly-once`
| sink_columns | array | no | | When this parameter is empty, all fields are sink columns |
| is_enable_transaction | boolean | no | true | |
| batch_size | int | no | 1000000 | |
| compress_codec | string | no | none | |
| common-options | object | no | - | |

### path [string]
Expand Down Expand Up @@ -156,6 +157,20 @@ Please note that, If `is_enable_transaction` is `true`, we will auto add `${tran

Only support `true` now.

### batch_size [int]

The maximum number of rows in a file. For SeaTunnel Engine, the number of lines in the file is determined by `batch_size` and `checkpoint.interval` jointly decide. If the value of `checkpoint.interval` is large enough, sink writer will write rows in a file until the rows in the file larger than `batch_size`. If `checkpoint.interval` is small, the sink writer will create a new file when a new checkpoint trigger.

### compress_codec [string]

The compress codec of files and the details that supported as the following shown:

- txt: `lzo` `none`
- json: `lzo` `none`
- csv: `lzo` `none`
- orc: `lzo` `snappy` `lz4` `zlib` `none`
- parquet: `lzo` `snappy` `lz4` `gzip` `brotli` `zstd` `none`

### common options

Sink plugin common parameters, please refer to [Sink Common Options](common-options.md) for details.
Expand All @@ -166,7 +181,7 @@ For text file format with `have_partition` and `custom_filename` and `sink_colum

```hocon
OssFile {
OssJindoFile {
path="/seatunnel/sink"
bucket = "oss://tyrantlucifer-image-bed"
access_key = "xxxxxxxxxxx"
Expand All @@ -192,7 +207,7 @@ For parquet file format with `sink_columns`

```hocon
OssFile {
OssJindoFile {
path = "/seatunnel/sink"
bucket = "oss://tyrantlucifer-image-bed"
access_key = "xxxxxxxxxxx"
Expand Down Expand Up @@ -221,6 +236,10 @@ For orc file format simple config

## Changelog

### 2.3.0 2022-12-30

- Add OSS Jindo File Sink Connector

### Next version

- Add OSS Jindo File Sink Connector
- [Improve] Support file compress ([3899](https://github.com/apache/incubator-seatunnel/pull/3899))
Loading

0 comments on commit 55602f6

Please sign in to comment.