Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stale lock during tidb rolling restart results in resolved ts lag and cdc lag increase #52108

Closed
fubinzh opened this issue Mar 26, 2024 · 6 comments
Assignees
Labels
affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 may-affects-7.1 severity/major sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug.

Comments

@fubinzh
Copy link

fubinzh commented Mar 26, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. TiDB cluster with titan on, cluster with 30 TiKV nodes, 10 TiDB nodes, cluster size: 70TB, througtput ~40MB/s
    3 CDC changefeed running to sync 4000 tables.
  2. Rolling restart TiDB

2. What did you expect to see? (Required)

CDC lag should not be <10s

3. What did you see instead (Required)

TiKV resolved ts lag increases and cdc lag increases as a results.

image

CDC log indicates that cdc tries to resolved lock.

[2024/03/26 07:56:35.689 +00:00] [INFO] [lock_resolver.go:56] ["resolve lock starts"] [regionID=751118934] [maxVersion=448643672832999424] [namespace=default] [changefeed=jxxu-all]
[2024/03/26 08:02:25.688 +00:00] [INFO] [lock_resolver.go:56] ["resolve lock starts"] [regionID=317512213] [maxVersion=448643763796967424] [namespace=default] [changefeed=jxxu-all]

4. What is your TiDB version? (Required)

Release Version: v8.0.0-alpha-629-g75c8347
Edition: Community
Git Commit Hash: 75c834728e82f9efe7cc12263ff7d33cd810acbf
Git Branch: HEAD
UTC Build Time: 2024-03-26 06:01:54
GoVersion: go1.21.4
Race Enabled: false
Check Table Before Drop: false
Store: unistore

@fubinzh fubinzh added the type/bug The issue is confirmed as a bug. label Mar 26, 2024
@fubinzh fubinzh changed the title stale lock during restarting tikv results in TiKV resolved ts lag and cdc lag increase stale lock during tidb rolling restart results in TiKV resolved ts lag and cdc lag increase Mar 26, 2024
@seiya-annie seiya-annie added the sig/transaction SIG:Transaction label Mar 27, 2024
@cfzjywxk cfzjywxk added type/question The issue belongs to a question. sig/sql-infra SIG: SQL Infra and removed type/bug The issue is confirmed as a bug. sig/transaction SIG:Transaction labels Mar 27, 2024
@ti-chi-bot ti-chi-bot added the affects-8.1 This bug affects the 8.1.x(LTS) versions. label Apr 9, 2024
@fubinzh
Copy link
Author

fubinzh commented Apr 26, 2024

/remove-type question
/type bug

@ti-chi-bot ti-chi-bot bot added type/bug The issue is confirmed as a bug. and removed type/question The issue belongs to a question. labels Apr 26, 2024
@fubinzh
Copy link
Author

fubinzh commented Apr 26, 2024

/severity major

@fubinzh fubinzh changed the title stale lock during tidb rolling restart results in TiKV resolved ts lag and cdc lag increase stale lock during tidb rolling restart results in resolved ts lag and cdc lag increase May 9, 2024
@fubinzh
Copy link
Author

fubinzh commented May 10, 2024

/assign @bb7133

@ti-chi-bot ti-chi-bot added affects-7.5 This bug affects the 7.5.x(LTS) versions. and removed may-affects-7.5 labels May 21, 2024
@YangKeao
Copy link
Member

YangKeao commented Aug 20, 2024

/assign @YangKeao

After doing some experiments and investigating the logic of TiDB's graceful shutdown, here is a brief introduction to the routine and how to avoid this issue. The shutdown process will have three stages:

  1. TiDB announces itself as unhealthy. Then LB will not create new connections on it. This stage will last for graceful-wait-before-shutdown seconds (by default, 0). During this stage, TiDB itself will still be able to accept new connections, execute new queries / transactions.
  2. Stop the listener, enter the shutdown-mode. In this stage, TiDB cannot accept new connections, and cannot begin new transactions or execute new commands (including PREPARE/EXECUTE/QUERY/..., all commands). The client will get an error if it runs new commands on the existing connection. It'll last for 15s, not configurable.
  3. Kill all queries, and then stop the internal services, finally stop the process.

If we don't expect to leak locks, we can have two choices:

  1. Avoid killing queries / transactions. Let these running queries / transactions to stop by themselves.
  2. Avoid leaking locks even when the queries / transactions are killed.

The second goal is relatively hard to reach. However, the first one should be able to handle most of the cases.

  1. For short connection scenario, we can increase graceful-wait-before-shutdown. Therefore, all connections will end their lifetime during the stage 1, and this tidb will finally have no connections, then no lock will be leaked.
  2. For long connection scenario and small transactions, all transactions / running queries will finish normally (ending with COMMIT or ROLLBACK), and return an error to the client. TiDB will also have no connections (at least, no running transactions), so no lock will leak during reboot.
    However, there is a tiny issue that the auto-commit queries are not waited TiDB will not wait for auto-commit statements during graceful shutdown #55464. I'm fixing it right now 🍻 .
  3. For long connection (> graceful-wait-before-shutdown) and big transactions (> 15s), TiDB cannot do well now 🤦. It's still under investigation and I'm not sure how difficult it is to fix it.

And finally, no matter which options you choose, make sure to also increase terminationGracePeriodSeconds if you are using kubernetes, or TiDB will not have enough time to graceful shutdown.

@YangKeao
Copy link
Member

YangKeao commented Aug 26, 2024

After fixing the auto-commit issues, I found that the background async-commit goroutines are not waited. Therefore, if most of the workload is async-commit transactions, TiDB may exit too early and kill background async-commit goroutines and leak the lock. (Also, committing secondary keys happens in background, so if there are many transactions with more than one keys, it'll cause similar issues).

I've submitted PRs to wait for the goroutines which are used to async commit or commit secondary keys: ref tikv/client-go#1432 and #55608

@YangKeao
Copy link
Member

For the most of the cases (the transactions which last for shorter than 15s), the lock will not leak after merging #55608. Therefore I'll close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 may-affects-7.1 severity/major sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

6 participants