-
Notifications
You must be signed in to change notification settings - Fork 7.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store user defined insert_id in zookeper #7461
Comments
Makes sense. We can introduce a setting named Looks easy to implement. |
Please, could you elaborate here - how exactly is the user supposed to do this? |
The setting allows a user to provide own deduplication semantic in Replicated*MergeTree If provided, it's used instead of data digest to generate block ID So, for example, by providing a unique value for the setting in each INSERT statement, user can avoid the same inserted data being deduplicated Issue: ClickHouse#7461
The setting allows a user to provide own deduplication semantic in Replicated*MergeTree If provided, it's used instead of data digest to generate block ID So, for example, by providing a unique value for the setting in each INSERT statement, user can avoid the same inserted data being deduplicated Inserting data within the same INSERT statement are split into blocks according to the *insert_block_size* settings (max_insert_block_size, min_insert_block_size_rows, min_insert_block_size_bytes). Each block with the same INSERT statement will get an ordinal number. The ordinal number is added to insert_deduplication_token to get block dedup token i.e. <token>_0, <token>_1, ... Deduplication is done per block So, to guarantee deduplication for two same INSERT queries, dedup token and number of blocks to have to be the same Issue: ClickHouse#7461
+ reduce number of allocations on replication merge tree path + bash test: move insert block settings into variable Issue: ClickHouse#7461
Use case
I want to gurantee exactly once delivery to clickhouse without requiring separate persistence for replicated merge tree families when i use upload from some message broker (kafka-like).
Most message brokers has stream api. We read buffer of data, push it to clickhouse and commit offsets when it done. So in case of failures such client code may experience double read. If worker push data to clikchouse it may lead to double writes aswell. Clickhouse already has well defined deduplication mechanism, clickhouse will dedublicate data if you push exact same batch multiple time. So in our worker (assuming we write in single thread to CH) we must know last request offsets. To achive these i have to use separate storage and do 2 writes into it (before insert with right offset, and after ack from CH to confirm left offset).
User flow looks something like this:
Describe the solution you'd like
Main idea to eliminate need of separate storage. Right now clickhouse already utilize persistence via ZK, so we could use it for these need. If user will provide some meaningfull (for him) string with each write request we can consistent store it to ZK. Then on client side we may look at it and make a decision of what offset may be skipped.
Ideal user flow looks something like this:
Describe alternatives you've considered
We could continue to use separate offset store.
Additional context
It may be also usefull for internal use of kafka storage (wich also have same proble).
The text was updated successfully, but these errors were encountered: