Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddl stucks and keeps printing "syncer check all versions, someone is not synced" even though the job have already synced #57003

Open
D3Hunter opened this issue Oct 30, 2024 · 3 comments
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. component/ddl This issue is related to DDL of TiDB. severity/moderate type/bug The issue is confirmed as a bug.

Comments

@D3Hunter
Copy link
Contributor

D3Hunter commented Oct 30, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

timeline:

  • older owner node A start schedule DDL, but hasn't started query the table
  • another node B become new owner, and some client insert a create-table DDL job J
  • node B move J to done state and start waiting all node synced
  • node A query the job J out, found the job is in Done state and have un-synced MDL info, so it tries to sync it
  • node B finish the sync, deleted all MDL keys of the job and finish the job J
  • node A calls waitSchemaSyncedForMDL, it never ends as the MDL keys will not be write again, and it keeps printing the log

the reason is in older version TiDB, DDL scheduler loop doesn't cancel and exit, it keeps running even after it's not the owner anymore. it's detects whether it's the new owner inside the loop, so once it starts schedule jobs it won't return until it finished one job step

tidb/pkg/ddl/job_table.go

Lines 279 to 284 in a7df4f9

if !d.isOwner() {
isOnce = true
d.onceMap = make(map[int64]struct{}, jobOnceCapacity)
time.Sleep(dispatchLoopWaitingDuration)
continue
}

marked as moderate as it's a corner case that's hard to reproduce in real world. I found it during upgrade to current master where we will force new node to be the owner, so there will be a owner change and have more chance to trigger this bug.

since v8.2.0, we have refactor this part to cancel the scheduler loop on retire owner, so no such issue

tidb/pkg/ddl/job_table.go

Lines 176 to 177 in 821e491

func (s *jobScheduler) close() {
s.cancel()

2. What did you expect to see? (Required)

no stuck

3. What did you see instead (Required)

ddl stucks and keeps printing "syncer check all versions, someone is not synced"

4. What is your TiDB version? (Required)

@D3Hunter D3Hunter added type/bug The issue is confirmed as a bug. severity/moderate affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. component/ddl This issue is related to DDL of TiDB. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. labels Oct 30, 2024
@D3Hunter
Copy link
Contributor Author

workaround: restart the tidb that stuck

@D3Hunter
Copy link
Contributor Author

D3Hunter commented Oct 31, 2024

duplicate with #53073, but the issue is still there

@D3Hunter
Copy link
Contributor Author

D3Hunter commented Jan 9, 2025

the fix in #53234 only mitigate for non-system table DDLs, won't work for system DDL, such as during upgrade

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. component/ddl This issue is related to DDL of TiDB. severity/moderate type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

1 participant