Deal with concurrent checkpointing from both old and new pageserver during timeline migration #971

LizardWizzard · 2021-12-08T12:02:11Z

Started from discussion here #874 (comment)

There is a dangerous possibility of two pageservers writing data concurrently to the same underlying s3 storage. This is scary even if it works given the incremental format we use.

There are some questions and invariants we need to uphold. We shouldn't be able to overwrite local metadata of an active timeline with something older/newer from s3. As far as I understand the opposite is not a problem because we save metadata to each checkpoint. Currently local overwrite shouldn't happen because we do not schedule downloads for timelines that are present locally. So this is good.

What if we have some non determinism in checkpointer so it can create different files on two pageservers?
There will be two checkpoints on the same lsn, or they'll interleave somehow?
Even if there are no nondeterminism what if there are two different pageserver versions which use different code to produce checkpoints? E g new format version, changed layout, new files, missing files etc.

Feel free to correct me, maybe I'm missing something

LizardWizzard · 2022-01-13T10:26:07Z

The current vision on that is that we should allow concurrent checkpointing without ant problems. I created a test for that, but faced unrelated errors which are currently being fixed on main. So I'll continue my investigation

LizardWizzard · 2022-01-19T22:42:19Z

Note: while concurrent checkpointing shouldn't lead to correctness issues we still might want to avoid that in some future scenarios when we have timeline attached to two pageservers e.g to spread get page requests or to support failover. Currently concurrent checkpointing might happen in the process of tenant migration, when new and old pageservers are active simultaneously

stepashka · 2022-04-08T11:01:20Z

this is waiting for #1396

LizardWizzard · 2022-04-21T10:20:00Z

Things changed and we decided to introduce etcd. The new vision is to use it in order to prevent concurrent uploads from happening.

problame · 2023-04-24T15:53:52Z

I think we can close this, given we're set to implement relocation as specified in RFC #3868

LizardWizzard · 2023-05-10T14:20:58Z

I think this is still relevant. So issue describes the problem that RFC should solve in one way or another. In first iteration we decided to not have this problem by detaching before the attach so there is no concurrent background activity from more than one pageserver at a time. The project currently takes into account only first stage, so I'm not sure whether we should keep the issue in the project (keep it with separate label?). WDYT @problame?

shanyp · 2023-12-26T11:08:25Z

fixed by generations

LizardWizzard mentioned this issue Dec 8, 2021

Epic: tenant relocation between pageserver nodes #886

Closed

1 task

LizardWizzard changed the title ~~Deal with concurrent checkpointing from both old and new pageserver~~ Deal with concurrent checkpointing from both old and new pageserver during timeline migration Dec 8, 2021

LizardWizzard mentioned this issue Dec 8, 2021

Use archives for s3 checkpoints, download timelines on demand #874

Merged

LizardWizzard added launch blocker c/storage/pageserver Component: storage: pageserver labels Dec 9, 2021

SomeoneToIgnore mentioned this issue Dec 9, 2021

Epic: s3 synchronisation follow ups #977

Closed

11 tasks

stepashka added this to the Technical preview milestone Dec 13, 2021

stepashka added s3 and removed launch blocker labels Dec 13, 2021

stepashka modified the milestones: Technical preview, Limited Preview Dec 17, 2021

neondatabase-bot bot modified the milestones: Limited Preview, Technical preview Jan 24, 2022

neondatabase-bot bot modified the milestones: Technical preview, 0.6 Towards Tech Prev Mar 15, 2022

LizardWizzard mentioned this issue Apr 21, 2022

Epic: s3 in pageserver stage 2 #1556

Closed

12 tasks

neondatabase-bot bot modified the milestones: 0.7 Towards Tech Prev, 1.0 Technical preview Apr 21, 2022

stepashka modified the milestones: 1.0 Technical preview, 2022/06 Jun 17, 2022

problame closed this as not planned Won't fix, can't repro, duplicate, stale Apr 24, 2023

LizardWizzard reopened this May 10, 2023

shanyp closed this as completed Dec 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deal with concurrent checkpointing from both old and new pageserver during timeline migration #971

Deal with concurrent checkpointing from both old and new pageserver during timeline migration #971

LizardWizzard commented Dec 8, 2021 •

edited

Loading

LizardWizzard commented Jan 13, 2022

LizardWizzard commented Jan 19, 2022

stepashka commented Apr 8, 2022

LizardWizzard commented Apr 21, 2022

problame commented Apr 24, 2023

LizardWizzard commented May 10, 2023

shanyp commented Dec 26, 2023

Deal with concurrent checkpointing from both old and new pageserver during timeline migration #971

Deal with concurrent checkpointing from both old and new pageserver during timeline migration #971

Comments

LizardWizzard commented Dec 8, 2021 • edited Loading

LizardWizzard commented Jan 13, 2022

LizardWizzard commented Jan 19, 2022

stepashka commented Apr 8, 2022

LizardWizzard commented Apr 21, 2022

problame commented Apr 24, 2023

LizardWizzard commented May 10, 2023

shanyp commented Dec 26, 2023

LizardWizzard commented Dec 8, 2021 •

edited

Loading