[RFC] pageserver s3 coordination #2676

LizardWizzard · 2022-10-21T20:50:54Z

During discussion with @SomeoneToIgnore about toggling background activity PR (#2665) decided to step back a little bit and explore alternatives.

Approach with avoiding metadata entirely looks promising to me.
Not really proposing to implement one of the approaches, rather wanted to brainstorm given possible options

Rendered

SomeoneToIgnore

I lean towards the general model, where PS nodes are able to read "small" (orders of magnitude smaller than the layer size) metadata about all available layers and decide, what to do with them.

For that, the broker approach seems to be the most promising to me among what's proposed.
I'd try doing that since we already have etcd and leader election implemented and tested on SKs.

docs/rfcs/020-pageserver-s3-coordination.md

LizardWizzard · 2022-10-24T09:00:29Z

For that, the broker approach seems to be the most promising to me among what's proposed.

With my current understanding broker approach cannot solve all mentioned problems. It can coordinate certain actions between pageservers but it cannot solve mutable metadata problem

koivunej

Just writing down "have read" review. Good writeup of the relocation and (hot?) standby issues. I have still so many questionmarks around latest_gc_cutoff_lsn and didn't even realize the wal ingestion wasn't deterministic so I cannot really comment more now.

docs/rfcs/020-pageserver-s3-coordination.md

arssher · 2022-10-25T16:29:12Z

So I there are two loosely related problems -- coordination of pageserver work with s3 and remote index maintenance. The latter is only needed to deal with slow s3 listings + update remote_consistent_lsn and latest_gc_cutoff_lsn.

Imagine for now s3 listings are fast -- we don't have index files. How should work with s3 look like, taking into account multiple pageservers?

Obviously we should strive to have only one pageserver should upload to (and clean) s3, otherwise it is redundant work.
However, s3 doesn't provide CAS API, which means we can never reliably linearize s3 operations -- s3 side can't tell an old uploading pageserver request from a newer uploader request -- they can interleave. Well, we could rely on physical time to ensure uploaders don't intersect, but that's fragile (and s3 request could be already in flight). External coordination can't improve on this. Consequently, any approach writing different contents to the same key (like current index file) is broken (unless we operate manually).
Fortunately, generally there is nothing bad with concurrent uploaders -- we might get intersection of data in layers, but redo is (or should be) able to cope with that; and data won't be corrupted.

That said, I think the following would work:

Clearly we need some way to determine the uploading pageserver, but without strict requirements on non-intersecting with previous uploader. Something similar to how safekeepers approach this would do -- just deterministically calculate based on data exchanged through the broker. Just reminding that (it might sound surprising) IMV specialized API which etcd provides here doesn't really save the efforts -- it doesn't help to distinguish old & new leaders, and without it (generational numbers) leader election per se is trivial.
This uploading pageserver lists remote contents and then maintains it -- uploads new layers, removes compacted and gc'ed ones. There might be some concurrent activity, but it won't harm. If pageserver maintains local cache of s3 contents, it might get out of date: 1) layer fetch might fail because it was compacted by a competing pageserver -- then it should relist s3 contents to find newer version. 2) also gc should sometimes delete layers uploaded by others (again relist required), but this can be very rare.
Other pageservers, if they accept stream of WAL, process it, compact and gc locally (creating another independent set of layers), but don't upload. If they evict unused layers and thus want to be able to fetch them back later, they need to relist s3 contents. We might at some point broadcast s3 operations through the broker to speed up the process of remote index syncing, but not sure.
Current index file also stores remote_consistent_lsn and latest_gc_cutoff_lsn. These can be written to s3 under (unique) uploading pageserver id; read would collect max() of all such values.

Getting back to slow s3 listing, the following approaches are possible:

Actually do nothing, i.e. drop index files. It would be nice to check/write down somewhere how slow s3 listing actually is (and how many layers a db might have), perhaps now it is not really required. metadata LSNs can be put under different prefixes, as said above.
Have a separate storage for the s3 index -- this seems to be a popular approach; PG would do, e.g. control plane database. So, on each layer creation, it is first uploaded and them inserted into the db. On each deletion, it is first deleted from the db and then from s3. Works fine, but the obvious cons is one more involved entity.
We can try to stay with current index file, but spread it to avoid overwrites (like in rfc's diagram example): similar to LSNs, each pageserver writes index under his id prefix. Creations are obvious; if pageserver A deletes (compacts) layer created by pageserver B, it writes 'deleted' op for it. To list files we need to load all indexes and union them, layers with 'delete' ops are removed. Such entries can be deleted themselves once they go behind gc horizon.

LizardWizzard · 2022-10-26T13:43:01Z

The latter is only needed to deal with slow s3 listings + update remote_consistent_lsn and latest_gc_cutoff_lsn.

To be more precise. Currently index file contains snapshot of TimelineMetadata file at the moment of upload.

Its structure is the following:

struct TimelineMetadataBodyV2 {
    disk_consistent_lsn: Lsn,
    // This is only set if we know it. We track it in memory when the page
    // server is running, but we only track the value corresponding to
    // 'last_record_lsn', not 'disk_consistent_lsn' which can lag behind by a
    // lot. We only store it in the metadata file when we flush *all* the
    // in-memory data so that 'last_record_lsn' is the same as
    // 'disk_consistent_lsn'.  That's OK, because after page server restart, as
    // soon as we reprocess at least one record, we will have a valid
    // 'prev_record_lsn' value in memory again. This is only really needed when
    // doing a clean shutdown, so that there is no more WAL beyond
    // 'disk_consistent_lsn'
    prev_record_lsn: Option<Lsn>,
    ancestor_timeline: Option<TimelineId>,
    ancestor_lsn: Lsn,
    latest_gc_cutoff_lsn: Lsn,
    initdb_lsn: Lsn,
    pg_version: u32,
}

remote_consistent_lsn is the disk_consistent_lsn from this metadata snapshot.

Consequently, any approach writing different contents to the same key (like current index file) is broken (unless we operate manually).

Agreed. Even with manual orchestration it is not so obvious. E g on startup pageserver needs to ask control plane whether its still the leader (at least it is how I see it, maybe there is a workaround, check the updated rfc for the note + diagram)

layer fetch might fail because it was compacted by a competing pageserver -- then it should relist s3 contents to find newer version. 2) also gc should sometimes delete layers uploaded by others (again relist required), but this can be very rare.

Depending on local cache size (basically the eviction frequency), compaction can cause frequent deletions which will cause higher latency for on demand downloads on the follower since it needs to relist things frequently and do that on the get_page request path.

I think coordination through broker is the way.

Have a separate storage for the s3 index -- this seems to be a popular approach; PG would do, e.g. control plane database.

That creates chicken or the egg problem which would be nice to avoid. Options remind me about ClickHouse way which used to store info about partitions in ZooKeeper so everyone knows what exists where

We can try to stay with current index file, but spread it to avoid overwrites

I like this idea because it will allow to continue using index file to store file sizes and checksums for layers

I updated the RFC, please let me know if I missed something

SomeoneToIgnore

Feels like everybody is fine with the approach selected, it's documented and we can merge it?

LizardWizzard · 2022-11-02T15:15:38Z

Merging since no objections were raised. Next step is working on the implementation. Progress is tracked here: #2739

LizardWizzard requested review from hlinnaka, SomeoneToIgnore and arssher October 21, 2022 20:50

LizardWizzard marked this pull request as ready for review October 21, 2022 20:51

pageserver s3 coordination

4ff5475

LizardWizzard force-pushed the dkr/pageserver-s3-coordination branch from 5f43282 to 4ff5475 Compare October 21, 2022 20:53

SomeoneToIgnore reviewed Oct 24, 2022

View reviewed changes

docs/rfcs/020-pageserver-s3-coordination.md Show resolved Hide resolved

docs/rfcs/020-pageserver-s3-coordination.md Outdated Show resolved Hide resolved

docs/rfcs/020-pageserver-s3-coordination.md Show resolved Hide resolved

SomeoneToIgnore requested a review from koivunej October 24, 2022 07:48

koivunej reviewed Oct 24, 2022

View reviewed changes

tychoish reviewed Oct 24, 2022

View reviewed changes

LizardWizzard added 2 commits October 26, 2022 16:15

update according to discussion and comments

6b544da

fix style if possible (cannot really split long lines in mermaid)

bce4581

SomeoneToIgnore mentioned this pull request Oct 27, 2022

protect from multiple simultaneous checkpoint uploads #1558

Closed

SomeoneToIgnore approved these changes Oct 28, 2022

View reviewed changes

LizardWizzard mentioned this pull request Nov 2, 2022

Epic: pageserver s3 coordination #2739

Closed

2 tasks

LizardWizzard merged commit e56d11c into main Nov 2, 2022

LizardWizzard deleted the dkr/pageserver-s3-coordination branch November 2, 2022 15:15

LizardWizzard mentioned this pull request Dec 30, 2022

Epic: tenant relocation between pageserver nodes #886

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] pageserver s3 coordination #2676

[RFC] pageserver s3 coordination #2676

LizardWizzard commented Oct 21, 2022

SomeoneToIgnore left a comment

LizardWizzard commented Oct 24, 2022

koivunej left a comment

arssher commented Oct 25, 2022 •

edited

Loading

LizardWizzard commented Oct 26, 2022

SomeoneToIgnore left a comment

LizardWizzard commented Nov 2, 2022

[RFC] pageserver s3 coordination #2676

[RFC] pageserver s3 coordination #2676

Conversation

LizardWizzard commented Oct 21, 2022

SomeoneToIgnore left a comment

Choose a reason for hiding this comment

LizardWizzard commented Oct 24, 2022

koivunej left a comment

Choose a reason for hiding this comment

arssher commented Oct 25, 2022 • edited Loading

LizardWizzard commented Oct 26, 2022

SomeoneToIgnore left a comment

Choose a reason for hiding this comment

LizardWizzard commented Nov 2, 2022

arssher commented Oct 25, 2022 •

edited

Loading