Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] pageserver s3 coordination #2676

Merged
merged 3 commits into from
Nov 2, 2022
Merged

Conversation

LizardWizzard
Copy link
Contributor

During discussion with @SomeoneToIgnore about toggling background activity PR (#2665) decided to step back a little bit and explore alternatives.

Approach with avoiding metadata entirely looks promising to me.
Not really proposing to implement one of the approaches, rather wanted to brainstorm given possible options

Rendered

@LizardWizzard LizardWizzard marked this pull request as ready for review October 21, 2022 20:51
@LizardWizzard LizardWizzard force-pushed the dkr/pageserver-s3-coordination branch from 5f43282 to 4ff5475 Compare October 21, 2022 20:53
Copy link
Contributor

@SomeoneToIgnore SomeoneToIgnore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean towards the general model, where PS nodes are able to read "small" (orders of magnitude smaller than the layer size) metadata about all available layers and decide, what to do with them.

For that, the broker approach seems to be the most promising to me among what's proposed.
I'd try doing that since we already have etcd and leader election implemented and tested on SKs.

docs/rfcs/020-pageserver-s3-coordination.md Show resolved Hide resolved
docs/rfcs/020-pageserver-s3-coordination.md Outdated Show resolved Hide resolved
docs/rfcs/020-pageserver-s3-coordination.md Show resolved Hide resolved
@LizardWizzard
Copy link
Contributor Author

For that, the broker approach seems to be the most promising to me among what's proposed.

With my current understanding broker approach cannot solve all mentioned problems. It can coordinate certain actions between pageservers but it cannot solve mutable metadata problem

Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just writing down "have read" review. Good writeup of the relocation and (hot?) standby issues. I have still so many questionmarks around latest_gc_cutoff_lsn and didn't even realize the wal ingestion wasn't deterministic so I cannot really comment more now.

docs/rfcs/020-pageserver-s3-coordination.md Outdated Show resolved Hide resolved
docs/rfcs/020-pageserver-s3-coordination.md Outdated Show resolved Hide resolved
docs/rfcs/020-pageserver-s3-coordination.md Show resolved Hide resolved
docs/rfcs/020-pageserver-s3-coordination.md Show resolved Hide resolved
docs/rfcs/020-pageserver-s3-coordination.md Outdated Show resolved Hide resolved
@arssher
Copy link
Contributor

arssher commented Oct 25, 2022

So I there are two loosely related problems -- coordination of pageserver work with s3 and remote index maintenance. The latter is only needed to deal with slow s3 listings + update remote_consistent_lsn and latest_gc_cutoff_lsn.

Imagine for now s3 listings are fast -- we don't have index files. How should work with s3 look like, taking into account multiple pageservers?

  • Obviously we should strive to have only one pageserver should upload to (and clean) s3, otherwise it is redundant work.
  • However, s3 doesn't provide CAS API, which means we can never reliably linearize s3 operations -- s3 side can't tell an old uploading pageserver request from a newer uploader request -- they can interleave. Well, we could rely on physical time to ensure uploaders don't intersect, but that's fragile (and s3 request could be already in flight). External coordination can't improve on this. Consequently, any approach writing different contents to the same key (like current index file) is broken (unless we operate manually).
  • Fortunately, generally there is nothing bad with concurrent uploaders -- we might get intersection of data in layers, but redo is (or should be) able to cope with that; and data won't be corrupted.

That said, I think the following would work:

  • Clearly we need some way to determine the uploading pageserver, but without strict requirements on non-intersecting with previous uploader. Something similar to how safekeepers approach this would do -- just deterministically calculate based on data exchanged through the broker. Just reminding that (it might sound surprising) IMV specialized API which etcd provides here doesn't really save the efforts -- it doesn't help to distinguish old & new leaders, and without it (generational numbers) leader election per se is trivial.
  • This uploading pageserver lists remote contents and then maintains it -- uploads new layers, removes compacted and gc'ed ones. There might be some concurrent activity, but it won't harm. If pageserver maintains local cache of s3 contents, it might get out of date: 1) layer fetch might fail because it was compacted by a competing pageserver -- then it should relist s3 contents to find newer version. 2) also gc should sometimes delete layers uploaded by others (again relist required), but this can be very rare.
  • Other pageservers, if they accept stream of WAL, process it, compact and gc locally (creating another independent set of layers), but don't upload. If they evict unused layers and thus want to be able to fetch them back later, they need to relist s3 contents. We might at some point broadcast s3 operations through the broker to speed up the process of remote index syncing, but not sure.
  • Current index file also stores remote_consistent_lsn and latest_gc_cutoff_lsn. These can be written to s3 under (unique) uploading pageserver id; read would collect max() of all such values.

Getting back to slow s3 listing, the following approaches are possible:

  1. Actually do nothing, i.e. drop index files. It would be nice to check/write down somewhere how slow s3 listing actually is (and how many layers a db might have), perhaps now it is not really required. metadata LSNs can be put under different prefixes, as said above.
  2. Have a separate storage for the s3 index -- this seems to be a popular approach; PG would do, e.g. control plane database. So, on each layer creation, it is first uploaded and them inserted into the db. On each deletion, it is first deleted from the db and then from s3. Works fine, but the obvious cons is one more involved entity.
  3. We can try to stay with current index file, but spread it to avoid overwrites (like in rfc's diagram example): similar to LSNs, each pageserver writes index under his id prefix. Creations are obvious; if pageserver A deletes (compacts) layer created by pageserver B, it writes 'deleted' op for it. To list files we need to load all indexes and union them, layers with 'delete' ops are removed. Such entries can be deleted themselves once they go behind gc horizon.

@LizardWizzard
Copy link
Contributor Author

The latter is only needed to deal with slow s3 listings + update remote_consistent_lsn and latest_gc_cutoff_lsn.

To be more precise. Currently index file contains snapshot of TimelineMetadata file at the moment of upload.

Its structure is the following:

struct TimelineMetadataBodyV2 {
    disk_consistent_lsn: Lsn,
    // This is only set if we know it. We track it in memory when the page
    // server is running, but we only track the value corresponding to
    // 'last_record_lsn', not 'disk_consistent_lsn' which can lag behind by a
    // lot. We only store it in the metadata file when we flush *all* the
    // in-memory data so that 'last_record_lsn' is the same as
    // 'disk_consistent_lsn'.  That's OK, because after page server restart, as
    // soon as we reprocess at least one record, we will have a valid
    // 'prev_record_lsn' value in memory again. This is only really needed when
    // doing a clean shutdown, so that there is no more WAL beyond
    // 'disk_consistent_lsn'
    prev_record_lsn: Option<Lsn>,
    ancestor_timeline: Option<TimelineId>,
    ancestor_lsn: Lsn,
    latest_gc_cutoff_lsn: Lsn,
    initdb_lsn: Lsn,
    pg_version: u32,
}

remote_consistent_lsn is the disk_consistent_lsn from this metadata snapshot.

Consequently, any approach writing different contents to the same key (like current index file) is broken (unless we operate manually).

Agreed. Even with manual orchestration it is not so obvious. E g on startup pageserver needs to ask control plane whether its still the leader (at least it is how I see it, maybe there is a workaround, check the updated rfc for the note + diagram)

  1. layer fetch might fail because it was compacted by a competing pageserver -- then it should relist s3 contents to find newer version. 2) also gc should sometimes delete layers uploaded by others (again relist required), but this can be very rare.

Depending on local cache size (basically the eviction frequency), compaction can cause frequent deletions which will cause higher latency for on demand downloads on the follower since it needs to relist things frequently and do that on the get_page request path.

I think coordination through broker is the way.

Have a separate storage for the s3 index -- this seems to be a popular approach; PG would do, e.g. control plane database.

That creates chicken or the egg problem which would be nice to avoid. Options remind me about ClickHouse way which used to store info about partitions in ZooKeeper so everyone knows what exists where

We can try to stay with current index file, but spread it to avoid overwrites

I like this idea because it will allow to continue using index file to store file sizes and checksums for layers


I updated the RFC, please let me know if I missed something

Copy link
Contributor

@SomeoneToIgnore SomeoneToIgnore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like everybody is fine with the approach selected, it's documented and we can merge it?

@LizardWizzard
Copy link
Contributor Author

Merging since no objections were raised. Next step is working on the implementation. Progress is tracked here: #2739

@LizardWizzard LizardWizzard merged commit e56d11c into main Nov 2, 2022
@LizardWizzard LizardWizzard deleted the dkr/pageserver-s3-coordination branch November 2, 2022 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants