snap_backup: snapshot backup isn't compatible with importing #46850

YuJuncen · 2023-09-11T06:32:39Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Run importing.
Run snapshot backup.
Restore the backup.

2. What did you expect to see? (Required)

The restore should be success, because the backup has succeeded.

3. What did you see instead (Required)

The restored cluster (sometimes) keep panicking due to ingest sst not found.

4. What is your TiDB version? (Required)

current master.

The text was updated successfully, but these errors were encountered:

YuJuncen · 2023-09-11T07:17:47Z

Applying the raft command Ingest requires external context (the SST to be ingested must be in the local disk) beyond the raft log. But snapshot restoring only guards the consistency of raft state machine. Any outer events might be unordered. (That is reasonable in most cases, because the consistency of TiKV itself is basically based on raft.)

For example, one of the error-prone event sequence (order is defined by the wall clock) would be:

[1] ----^-------|-$--->
[2] --------|-^---$--->

Legends
^: `write` or `download` RPC done (which creates the SST to be imported).
|: the snapshot taken.
$: the node applied the `Ingest` command.

Here, the Ingest command is applied by the Node 1, hence it will try to replicate it to all of its followers once restored. But when the Node 2 is taking the snapshot, there isn't that SST. Once trying to apply the Ingest command, it will panic.

YuJuncen · 2023-09-11T07:34:10Z

Trivially wait lightning exit will be fine because tidb-lightning and br internally kept the order of downloading, writeing and ingesting. And exiting tidb-lightning should imply that no further events relates to it will happen.

So, given the "importing context"(The state of SSTs to be ingested and the Ingest command of a (instant, node_id) tuple, for all node_ids.) is consistent and will no longer being changed after tidb-lightning or br exits. It is easy to prove that for each node, choosing ANY instant from each node after stopping them the "importing context" is consistent.

For each nodes:

[Many events triggered by importing]-->Exit
                                       ^ Consistent kept here here.
                                         And no further events will break the consistent.

lance6716 · 2023-09-11T07:59:46Z

So RegisterTask is very critical to support this check. Will it return error when keepalive fails?

lance6716 · 2023-09-11T08:02:36Z

And lightning task may run for multiple hours, is it acceptable that RPO is larger due to import? Backup or import, which has higher priority?

YuJuncen · 2023-09-11T08:08:40Z

And lightning task may run for multiple hours, is it acceptable that RPO is larger due to import? Backup or import, which has higher priority?

I think given taking snapshot backup lasts for a tiny time period(the CreateVolumeSnapshot request usually response within seconds), It might be acceptable to temporarily stop importing? (thanks to checkpoints)

lance6716 · 2023-09-11T08:11:29Z

And lightning task may run for multiple hours, is it acceptable that RPO is larger due to import? Backup or import, which has higher priority?

I think given taking snapshot backup lasts for a tiny time period(the CreateVolumeSnapshot request usually response within seconds), It might be acceptable to temporarily stop importing? (thanks to checkpoints)

LGTM, I think you can ask PM to make a final decision. Maybe let SSTImporter return some error message to let lightning restart from write API

YuJuncen · 2023-09-11T08:22:19Z

So RegisterTask is very critical to support this check. Will it return error when keepalive fails?

Unfortunately, not for now. It just print errors and retry to register itself. I think an onError hook for register might be useful.

BornChanger · 2023-09-11T16:40:18Z

/component br

ti-chi-bot · 2023-09-11T16:40:21Z

@BornChanger: The label(s) component/backup-restore cannot be applied, because the repository doesn't have them.

In response to this:

/component backup-restore

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

…napshot_backup (#47001) ref #46850

…napshot_backup (#47001) (#47341) ref #46850

…napshot_backup (pingcap#47001) (pingcap#47341) ref pingcap#46850

…napshot_backup (#47001) (#47342) ref #46850

BornChanger · 2023-11-06T15:06:30Z

@YuJuncen I think we can close this issue.

lance6716 · 2023-11-06T15:10:21Z

@BornChanger let me check if lightning handles this new behaviour tomorrow

lance6716 · 2023-11-07T05:39:22Z

lightning side will see a ~~"ServerIdBusy"~~ error with the message of ~~imports are suspended for {time_to_lease_expire:?}~~ Suspended { time_to_lease_expire: 292.97s }, and lightning will retry ingest later, no changes need to be made.

lance6716 · 2023-11-21T05:44:50Z

lightning see RPC error instead of a RPC response with error fields. Although lightning can handle it as a default error, some unnecessary retry can be skipped and we should record this error to display to user.

mittalrishabh · 2023-11-21T17:18:35Z

It retry from beginning instead of checkpoint which impact the speed of ingestion. This problem is severe because backup is taken every 30 min.
Even though we are talking about master here, i would assume that it exist in 6.5 as well.

…napshot_backup (pingcap#47001) (pingcap#47341) (pingcap#20) ref pingcap#46850 Co-authored-by: Ti Chi Robot <[email protected]>

YuJuncen added the type/bug The issue is confirmed as a bug. label Sep 11, 2023

YuJuncen mentioned this issue Sep 11, 2023

snap_backup: make snapshot backup exit once importing task detected #46854

Closed

4 tasks

YuJuncen added severity/major affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. labels Sep 11, 2023

ti-chi-bot bot added may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 labels Sep 11, 2023

YuJuncen removed may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 labels Sep 11, 2023

ti-chi-bot bot added the component/br This issue is related to BR of TiDB. label Sep 11, 2023

This was referenced Sep 13, 2023

EBS BR support to multiple k8s cluster TiDB pingcap/tidb-operator#5003

Open

ebs br: EBS snapshot restore could panic during lightning ingest SST phase pingcap/tidb-operator#5289

Closed

This was referenced Sep 15, 2023

snapshot_backup: deny executing tidb-lightning import while running snapshot_backup #47001

Merged

support deny import operations for incompatible components tikv/tikv#15611

Closed

seiya-annie added the feature/developing the related feature is in development label Sep 21, 2023

ti-chi-bot bot pushed a commit that referenced this issue Sep 27, 2023

snapshot_backup: deny executing tidb-lightning import while running s…

32540a4

…napshot_backup (#47001) ref #46850

This was referenced Sep 27, 2023

snapshot_backup: deny executing tidb-lightning import while running snapshot_backup (#47001) #47341

Merged

snapshot_backup: deny executing tidb-lightning import while running snapshot_backup (#47001) #47342

Merged

ti-chi-bot bot pushed a commit that referenced this issue Sep 28, 2023

snapshot_backup: deny executing tidb-lightning import while running s…

f961aa5

…napshot_backup (#47001) (#47341) ref #46850

BornChanger pushed a commit to BornChanger/tidb that referenced this issue Sep 28, 2023

snapshot_backup: deny executing tidb-lightning import while running s…

bd058cf

…napshot_backup (pingcap#47001) (pingcap#47341) ref pingcap#46850

YuJuncen pushed a commit to YuJuncen/tidb that referenced this issue Sep 29, 2023

snapshot_backup: deny executing tidb-lightning import while running s…

1d9e8f5

…napshot_backup (pingcap#47001) (pingcap#47341) ref pingcap#46850

ti-chi-bot bot pushed a commit that referenced this issue Oct 16, 2023

snapshot_backup: deny executing tidb-lightning import while running s…

a56af1b

…napshot_backup (#47001) (#47342) ref #46850

ti-chi-bot added the affects-7.5 This bug affects the 7.5.x(LTS) versions. label Oct 23, 2023

lance6716 closed this as completed Nov 7, 2023

lance6716 reopened this Nov 21, 2023

3pointer mentioned this issue Jan 3, 2024

releases: add 6.5.7 release notes pingcap/docs-cn#15934

Merged

17 tasks

YuJuncen closed this as completed Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snap_backup: snapshot backup isn't compatible with importing #46850

snap_backup: snapshot backup isn't compatible with importing #46850

YuJuncen commented Sep 11, 2023

YuJuncen commented Sep 11, 2023 •

edited

Loading

YuJuncen commented Sep 11, 2023 •

edited

Loading

lance6716 commented Sep 11, 2023

lance6716 commented Sep 11, 2023

YuJuncen commented Sep 11, 2023

lance6716 commented Sep 11, 2023

YuJuncen commented Sep 11, 2023

BornChanger commented Sep 11, 2023 •

edited

Loading

ti-chi-bot bot commented Sep 11, 2023

BornChanger commented Nov 6, 2023

lance6716 commented Nov 6, 2023

lance6716 commented Nov 7, 2023 •

edited

Loading

lance6716 commented Nov 21, 2023

mittalrishabh commented Nov 21, 2023

snap_backup: snapshot backup isn't compatible with importing #46850

snap_backup: snapshot backup isn't compatible with importing #46850

Comments

YuJuncen commented Sep 11, 2023

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

YuJuncen commented Sep 11, 2023 • edited Loading

YuJuncen commented Sep 11, 2023 • edited Loading

lance6716 commented Sep 11, 2023

lance6716 commented Sep 11, 2023

YuJuncen commented Sep 11, 2023

lance6716 commented Sep 11, 2023

YuJuncen commented Sep 11, 2023

BornChanger commented Sep 11, 2023 • edited Loading

ti-chi-bot bot commented Sep 11, 2023

BornChanger commented Nov 6, 2023

lance6716 commented Nov 6, 2023

lance6716 commented Nov 7, 2023 • edited Loading

lance6716 commented Nov 21, 2023

mittalrishabh commented Nov 21, 2023

YuJuncen commented Sep 11, 2023 •

edited

Loading

YuJuncen commented Sep 11, 2023 •

edited

Loading

BornChanger commented Sep 11, 2023 •

edited

Loading

lance6716 commented Nov 7, 2023 •

edited

Loading