Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snap_backup: snapshot backup isn't compatible with importing #46850

Closed
YuJuncen opened this issue Sep 11, 2023 · 14 comments
Closed

snap_backup: snapshot backup isn't compatible with importing #46850

YuJuncen opened this issue Sep 11, 2023 · 14 comments
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. component/br This issue is related to BR of TiDB. feature/developing the related feature is in development severity/major type/bug The issue is confirmed as a bug.

Comments

@YuJuncen
Copy link
Contributor

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. Run importing.
  2. Run snapshot backup.
  3. Restore the backup.

2. What did you expect to see? (Required)

The restore should be success, because the backup has succeeded.

3. What did you see instead (Required)

The restored cluster (sometimes) keep panicking due to ingest sst not found.

4. What is your TiDB version? (Required)

current master.

@YuJuncen YuJuncen added the type/bug The issue is confirmed as a bug. label Sep 11, 2023
@YuJuncen
Copy link
Contributor Author

YuJuncen commented Sep 11, 2023

Applying the raft command Ingest requires external context (the SST to be ingested must be in the local disk) beyond the raft log. But snapshot restoring only guards the consistency of raft state machine. Any outer events might be unordered. (That is reasonable in most cases, because the consistency of TiKV itself is basically based on raft.)

For example, one of the error-prone event sequence (order is defined by the wall clock) would be:

[1] ----^-------|-$--->
[2] --------|-^---$--->

Legends
^: `write` or `download` RPC done (which creates the SST to be imported).
|: the snapshot taken.
$: the node applied the `Ingest` command.

Here, the Ingest command is applied by the Node 1, hence it will try to replicate it to all of its followers once restored. But when the Node 2 is taking the snapshot, there isn't that SST. Once trying to apply the Ingest command, it will panic.

@YuJuncen
Copy link
Contributor Author

YuJuncen commented Sep 11, 2023

Trivially wait lightning exit will be fine because tidb-lightning and br internally kept the order of downloading, writeing and ingesting. And exiting tidb-lightning should imply that no further events relates to it will happen.

So, given the "importing context"(The state of SSTs to be ingested and the Ingest command of a (instant, node_id) tuple, for all node_ids.) is consistent and will no longer being changed after tidb-lightning or br exits. It is easy to prove that for each node, choosing ANY instant from each node after stopping them the "importing context" is consistent.

For each nodes:

[Many events triggered by importing]-->Exit
                                       ^ Consistent kept here here.
                                         And no further events will break the consistent.                       

@lance6716
Copy link
Contributor

So RegisterTask is very critical to support this check. Will it return error when keepalive fails?

@lance6716
Copy link
Contributor

And lightning task may run for multiple hours, is it acceptable that RPO is larger due to import? Backup or import, which has higher priority?

@YuJuncen
Copy link
Contributor Author

And lightning task may run for multiple hours, is it acceptable that RPO is larger due to import? Backup or import, which has higher priority?

I think given taking snapshot backup lasts for a tiny time period(the CreateVolumeSnapshot request usually response within seconds), It might be acceptable to temporarily stop importing? (thanks to checkpoints)

@lance6716
Copy link
Contributor

And lightning task may run for multiple hours, is it acceptable that RPO is larger due to import? Backup or import, which has higher priority?

I think given taking snapshot backup lasts for a tiny time period(the CreateVolumeSnapshot request usually response within seconds), It might be acceptable to temporarily stop importing? (thanks to checkpoints)

LGTM, I think you can ask PM to make a final decision. Maybe let SSTImporter return some error message to let lightning restart from write API

@YuJuncen
Copy link
Contributor Author

So RegisterTask is very critical to support this check. Will it return error when keepalive fails?

Unfortunately, not for now. It just print errors and retry to register itself. I think an onError hook for register might be useful.

@YuJuncen YuJuncen added severity/major affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. labels Sep 11, 2023
@ti-chi-bot ti-chi-bot bot added may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 labels Sep 11, 2023
@YuJuncen YuJuncen removed may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 labels Sep 11, 2023
@BornChanger
Copy link
Contributor

BornChanger commented Sep 11, 2023

/component br

@ti-chi-bot
Copy link

ti-chi-bot bot commented Sep 11, 2023

@BornChanger: The label(s) component/backup-restore cannot be applied, because the repository doesn't have them.

In response to this:

/component backup-restore

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

BornChanger pushed a commit to BornChanger/tidb that referenced this issue Sep 28, 2023
ti-chi-bot bot pushed a commit that referenced this issue Oct 16, 2023
@ti-chi-bot ti-chi-bot added the affects-7.5 This bug affects the 7.5.x(LTS) versions. label Oct 23, 2023
@BornChanger
Copy link
Contributor

@YuJuncen I think we can close this issue.

@lance6716
Copy link
Contributor

@BornChanger let me check if lightning handles this new behaviour tomorrow

@lance6716
Copy link
Contributor

lance6716 commented Nov 7, 2023

lightning side will see a "ServerIdBusy" error with the message of imports are suspended for {time_to_lease_expire:?} Suspended { time_to_lease_expire: 292.97s }, and lightning will retry ingest later, no changes need to be made.

@lance6716 lance6716 reopened this Nov 21, 2023
@lance6716
Copy link
Contributor

lightning see RPC error instead of a RPC response with error fields. Although lightning can handle it as a default error, some unnecessary retry can be skipped and we should record this error to display to user.

@mittalrishabh
Copy link
Contributor

It retry from beginning instead of checkpoint which impact the speed of ingestion. This problem is severe because backup is taken every 30 min.
Even though we are talking about master here, i would assume that it exist in 6.5 as well.

guoshouyan pushed a commit to guoshouyan/tidb that referenced this issue Mar 5, 2024
@YuJuncen YuJuncen closed this as completed Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. component/br This issue is related to BR of TiDB. feature/developing the related feature is in development severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

6 participants