Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dm: Duplicate entry xx for key 'PRIMARY' on test-infra case tolerance 002 #3698

Closed
D3Hunter opened this issue Dec 2, 2021 · 11 comments
Closed
Assignees
Labels
affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.0 affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.2 area/dm Issues or PRs related to DM. found/automation Bugs found by automation cases severity/major type/bug The issue is confirmed as a bug.

Comments

@D3Hunter
Copy link
Contributor

D3Hunter commented Dec 2, 2021

dm-dm-worker-0.log

What did you do?

test-infra case tolerance 002 (unstable, may not reproduce)

  • task test runs on worker-1
  • after 1 minute, kill worker-1
  • test scheduled to worker-0
  • error after a while

What did you expect to see?

sync-diff success after sysbench finish

What did you see instead?

test paused due to error in title.

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

Release Version: v5.2.0-master
Git Commit Hash: 77029891ff124a5a0e07d5b4dbd4c4f40c6c6268
Git Branch: master
UTC Build Time: 2021-12-01 14:43:41
Go Version: go version go1.16.4 linux/amd64
(paste DM version here, and you must ensure versions of dmctl, DM-worker and DM-master are same)

Upstream MySQL/MariaDB server version:

(paste upstream MySQL/MariaDB server version here)

Downstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)

How did you deploy DM: tiup or manually?

(leave TiUP or manually here)

Other interesting information (system version, hardware config, etc):

>
>

current status of DM cluster (execute query-status <task-name> in dmctl)

{
    "result": true,
    "msg": "",
    "sources": [
        {
            "result": true,
            "msg": "",
            "sourceStatus": {
                "source": "source-0",
                "worker": "dm-dm-worker-0",
                "result": null,
                "relayStatus": null
            },
            "subTaskStatus": [
                {
                    "name": "test",
                    "stage": "Paused",
                    "unit": "Sync",
[dm-dm-worker-0.log](https://github.com/pingcap/ticdc/files/7639643/dm-dm-worker-0.log)

                    "result": {
                        "isCanceled": false,
                        "errors": [
                            {
                                "ErrCode": 10006,
                                "ErrClass": "database",
                                "ErrScope": "not-set",
                                "ErrLevel": "high",
                                "Message": "startLocation: [position: (, 0), gtid-set: ], endLocation: [position: (mysql-bin.000004, 132471114), gtid-set: aeca5109-532d-11ec-93ba-a2baf50ca1f8:1-692071]: execute statement failed: UPDATE `dm_tolerance_002`.`sbtest3` SET `id` = ?, `k` = ?, `c` = ?, `pad` = ? WHERE `id` = ? LIMIT 1",
                                "RawCause": "context deadline exceeded",
                                "Workaround": ""
                            },
                            {
                                "ErrCode": 10006,
                                "ErrClass": "database",
                                "ErrScope": "not-set",
                                "ErrLevel": "high",
                                "Message": "startLocation: [position: (, 0), gtid-set: ], endLocation: [position: (mysql-bin.000004, 132861590), gtid-set: aeca5109-532d-11ec-93ba-a2baf50ca1f8:1-692296]: execute statement failed: DELETE FROM `dm_tolerance_002`.`sbtest3` WHERE `id` = ? LIMIT 1",
                                "RawCause": "context deadline exceeded",
                                "Workaround": ""
                            },
                            {
                                "ErrCode": 10006,
                                "ErrClass": "database",
                                "ErrScope": "not-set",
                                "ErrLevel": "high",
                                "Message": "startLocation: [position: (, 0), gtid-set: ], endLocation: [position: (mysql-bin.000004, 133203087), gtid-set: aeca5109-532d-11ec-93ba-a2baf50ca1f8:1-692493]: execute statement failed: commit",
                                "RawCause": "Error 1062: Duplicate entry '4796' for key 'PRIMARY'",
                                "Workaround": ""
                            }
                        ],
                        "detail": null
                    },
                    "unresolvedDDLLockID": "",
                    "sync": {
                        "totalEvents": "2307031",
                        "totalTps": "389",
                        "recentTps": "0",
                        "masterBinlog": "(mysql-bin.000005, 62909898)",
                        "masterBinlogGtid": "aeca5109-532d-11ec-93ba-a2baf50ca1f8:1-1271853",
                        "syncerBinlog": "(mysql-bin.000004, 112292682)",
                        "syncerBinlogGtid": "aeca5109-532d-11ec-93ba-a2baf50ca1f8:1-680420",
                        "blockingDDLs": [
                        ],
                        "unresolvedGroups": [
                        ],
                        "synced": false,
                        "binlogType": "remote",
                        "secondsBehindMaster": "142"
                    }
                }
            ]
        }
    ]
}
@D3Hunter D3Hunter added type/bug The issue is confirmed as a bug. area/dm Issues or PRs related to DM. labels Dec 2, 2021
@XuJianxu XuJianxu added found/automation Bugs found by automation cases severity/major labels Dec 2, 2021
@GMHDBJD
Copy link
Contributor

GMHDBJD commented Dec 6, 2021

May be the same as the root cause of #3737

@lance6716
Copy link
Contributor

close it due to we have a duplicate issue #3737

@D3Hunter
Copy link
Contributor Author

D3Hunter commented Jan 11, 2022

May be the same as the root cause of #3737

even we kill worker immediately in #4035, may still face this problem as safe-mode already disabled.

reopen this issue.

[2021/12/02 05:10:11.987 +00:00] [INFO] [mode.go:60] ["enable safe-mode because of task initialization"] [task=test] [unit="binlog replication"] ["duration in seconds"=60]
[2021/12/02 05:11:11.987 +00:00] [INFO] [mode.go:49] ["disable safe-mode after task initialization finished"] [task=test] [unit="binlog replication"]

...

[2021/12/02 05:11:52.197 +00:00] [WARN] [db.go:212] ["execute transaction too slow"] [task=test] [unit="binlog replication"]

...

[2021/12/02 05:14:16.183 +00:00] [ERROR] [db.go:221] ["execute statements failed after retry"] [task=test] [unit="binlog replication"] [queries="[INSERT INTO `dm_tolerance_002`.`sbtest4` (`id`,`k`,`c`,`pad`) VALUES (?,?,?,?) DELETE FROM `dm_tolerance_002`.`sbtest1` WHERE `id` = ? LIMIT 1 INSERT INTO `dm_tolerance_002`.`sbtest1` (`id`,`k`,`c`,`pad`) VALUES (?,?,?,?) UPDATE `dm_tolerance_002`.`sbtest1` SET `id` = ?, `k` = ?, `c` = ?, `pad` = ? WHERE `id` = ? LIMIT 1 UPDATE `dm_tolerance_002`.`sbtest2` SET `id` = ?, `k` = ?, `c` = ?, `pad` = ? WHERE `id` = ? LIMIT 1 DELETE FROM `dm_tolerance_002`.`sbtest1` WHERE `id` = ? LIMIT 1 INSERT INTO `dm_tolerance_002`.`sbtest1` (`id`,`k`,`c`,`pad`) VALUES (?,?,?,?) UPDATE `dm_tolerance_002`.`sbtest1` SET `id` = ?, `k` = ?, `c` = ?, `pad` = ? WHERE `id` = ? LIMIT 1 UPDATE `dm_tolerance_002`.`sbtest2` SET `id` = ?, `k` = ?, `c` = ?, `pad` = ? WHERE `id` = ? LIMIT 1 DELETE FROM `dm_tolerance_002`.`sbtest1` WHERE `id` = ? LIMIT 1 INSERT INTO `dm_tolerance_002`.`sbtest1` (`id`,`k`,`c`,`pad`) VALUES (?,?,?,?) UPDATE `dm_tolerance_002`.`sbtest1` SET `id` = ?, `k` = ?, `c` = ?, `pad` = ? WHERE `id` = ? LIMIT 1 UPDATE `dm..."] [arguments="[[4796 4033 00332538472-45300838740-16284058127-32233360499-15203057860-98779257262-59238368617-03103262563-94446194243-90885811681 52024155751-45996748575-68153529193-44617116940-31285824166] [5034] [5034 4978 59996577916-23220472548-86238799157-72258394656-49288020166-16175076139-82537802842-64491766574-28476777507-78479440960 92878987294-51102153908-87313354701-78211721587-98859087547] [5029 5719 43124473913-51491576245-20131002579-11047162146-81071920017-73857341512-41299613510-64468582292-81662157463-03074572800 26892742410-57882156655-83786876840-63735946207-29073029099 5029] [5027 5049 59224392607-81587276637-05345634467-76993296610-20251163487-25727552778-42994873654-71255443841-52194027527-30750576176 78526239341-32295437887-44733862143-29464260710-89343040520 5027] [5029] [5029 4398 62199317981-90946196588-72004110037-43591317516-99521048507-40500245829-19006959631-53139166283-28562673495-48410416800 83277985731-31640806611-43116984061-96256850333-95053762979] [5002 4994 43413062843-97829892709-8446..."] [error="[code=10006:class=database:scope=not-set:level=high], Message: execute statement failed: commit, RawCause: Error 1062: Duplicate entry '4796' for key 'PRIMARY'"]

@Ehco1996
Copy link
Contributor

May be the same as the root cause of #3737

even we kill worker immediately in #4035, may still face this problem as safe-mode already disabled.

reopen this issue.

[2021/12/02 05:10:11.987 +00:00] [INFO] [mode.go:60] ["enable safe-mode because of task initialization"] [task=test] [unit="binlog replication"] ["duration in seconds"=60]
[2021/12/02 05:11:11.987 +00:00] [INFO] [mode.go:49] ["disable safe-mode after task initialization finished"] [task=test] [unit="binlog replication"]

...

[2021/12/02 05:11:52.197 +00:00] [WARN] [db.go:212] ["execute transaction too slow"] [task=test] [unit="binlog replication"]

...

[2021/12/02 05:14:16.183 +00:00] [ERROR] [db.go:221] ["execute statements failed after retry"] [task=test] [unit="binlog replication"] [queries="[INSERT INTO `dm_tolerance_002`.`sbtest4` (`id`,`k`,`c`,`pad`) VALUES (?,?,?,?) DELETE FROM `dm_tolerance_002`.`sbtest1` WHERE `id` = ? LIMIT 1 INSERT INTO `dm_tolerance_002`.`sbtest1` (`id`,`k`,`c`,`pad`) VALUES (?,?,?,?) UPDATE `dm_tolerance_002`.`sbtest1` SET `id` = ?, `k` = ?, `c` = ?, `pad` = ? WHERE `id` = ? LIMIT 1 UPDATE `dm_tolerance_002`.`sbtest2` SET `id` = ?, `k` = ?, `c` = ?, `pad` = ? WHERE `id` = ? LIMIT 1 DELETE FROM `dm_tolerance_002`.`sbtest1` WHERE `id` = ? LIMIT 1 INSERT INTO `dm_tolerance_002`.`sbtest1` (`id`,`k`,`c`,`pad`) VALUES (?,?,?,?) UPDATE `dm_tolerance_002`.`sbtest1` SET `id` = ?, `k` = ?, `c` = ?, `pad` = ? WHERE `id` = ? LIMIT 1 UPDATE `dm_tolerance_002`.`sbtest2` SET `id` = ?, `k` = ?, `c` = ?, `pad` = ? WHERE `id` = ? LIMIT 1 DELETE FROM `dm_tolerance_002`.`sbtest1` WHERE `id` = ? LIMIT 1 INSERT INTO `dm_tolerance_002`.`sbtest1` (`id`,`k`,`c`,`pad`) VALUES (?,?,?,?) UPDATE `dm_tolerance_002`.`sbtest1` SET `id` = ?, `k` = ?, `c` = ?, `pad` = ? WHERE `id` = ? LIMIT 1 UPDATE `dm..."] [arguments="[[4796 4033 00332538472-45300838740-16284058127-32233360499-15203057860-98779257262-59238368617-03103262563-94446194243-90885811681 52024155751-45996748575-68153529193-44617116940-31285824166] [5034] [5034 4978 59996577916-23220472548-86238799157-72258394656-49288020166-16175076139-82537802842-64491766574-28476777507-78479440960 92878987294-51102153908-87313354701-78211721587-98859087547] [5029 5719 43124473913-51491576245-20131002579-11047162146-81071920017-73857341512-41299613510-64468582292-81662157463-03074572800 26892742410-57882156655-83786876840-63735946207-29073029099 5029] [5027 5049 59224392607-81587276637-05345634467-76993296610-20251163487-25727552778-42994873654-71255443841-52194027527-30750576176 78526239341-32295437887-44733862143-29464260710-89343040520 5027] [5029] [5029 4398 62199317981-90946196588-72004110037-43591317516-99521048507-40500245829-19006959631-53139166283-28562673495-48410416800 83277985731-31640806611-43116984061-96256850333-95053762979] [5002 4994 43413062843-97829892709-8446..."] [error="[code=10006:class=database:scope=not-set:level=high], Message: execute statement failed: commit, RawCause: Error 1062: Duplicate entry '4796' for key 'PRIMARY'"]

yes this is another problem when worker stopped but execSQL thread are still retry 😂, maybe add a cancel func to cancel DMLWorker.tctx

@D3Hunter
Copy link
Contributor Author

also we should make sure safe-mode is enabled when executing the half-done transaction in worker2 as when downstream is slow, 1 minute may not enough.

@Ehco1996
Copy link
Contributor

also we should make sure safe-mode is enabled when executing the half-done transaction in worker2 as when downstream is slow, 1 minute may not enough.

dose worker-2 is the new worker? if so, it will enable safe-mode until exitSafeModeLoc

so if we can make sure all execSQL thread was shutdown before old worker killed, there is no more write conflicts

@D3Hunter
Copy link
Contributor Author

dose worker-2 is the new worker?

yes

it will enable safe-mode until exitSafeModeLoc

if so, seems the error should not happen. see attached log, safe-mode is enabled between 05:10:11.987 and 05:11:11.987, but after 3 minutes, at 05:14:16.183 we got a Duplicate entry '4796' for key 'PRIMARY' error.

@Ehco1996
Copy link
Contributor

if so, seems the error should not happen. see attached log, safe-mode is enabled between 05:10:11.987 and 05:11:11.987, but after 3 minutes, at 05:14:16.183 we got a Duplicate entry '4796' for key 'PRIMARY' error.

i think this is caused by old-worker's retry-write, after #4035 is merged, woker will cancel all execSQL thread before close, i will try reproduce and check wheather that pr can fix this issue case today

@jebter jebter added the affects-5.4 This bug affects the 5.4.x(LTS) versions. label Jan 18, 2022
@niubell
Copy link
Contributor

niubell commented Apr 27, 2022

/assign WizardXiao

@niubell
Copy link
Contributor

niubell commented Apr 27, 2022

/unassign D3Hunter

@VelocityLight VelocityLight added the affects-6.1 This bug affects the 6.1.x(LTS) versions. label May 20, 2022
@WizardXiao
Copy link
Contributor

WizardXiao commented Jul 25, 2022

case tolerance 002 has been changed to dm_tolerance_worker_down, and previous comments indicated that there was no repetition, so this issue may be old, we can close it and if similar problems occur, we can find it out in new use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.0 affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.2 area/dm Issues or PRs related to DM. found/automation Bugs found by automation cases severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

9 participants