-
Notifications
You must be signed in to change notification settings - Fork 479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] pageserver s3 coordination #2676
Conversation
5f43282
to
4ff5475
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lean towards the general model, where PS nodes are able to read "small" (orders of magnitude smaller than the layer size) metadata about all available layers and decide, what to do with them.
For that, the broker approach seems to be the most promising to me among what's proposed.
I'd try doing that since we already have etcd and leader election implemented and tested on SKs.
With my current understanding broker approach cannot solve all mentioned problems. It can coordinate certain actions between pageservers but it cannot solve mutable metadata problem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just writing down "have read" review. Good writeup of the relocation and (hot?) standby issues. I have still so many questionmarks around latest_gc_cutoff_lsn
and didn't even realize the wal ingestion wasn't deterministic so I cannot really comment more now.
So I there are two loosely related problems -- coordination of pageserver work with s3 and remote index maintenance. The latter is only needed to deal with slow s3 listings + update remote_consistent_lsn and latest_gc_cutoff_lsn. Imagine for now s3 listings are fast -- we don't have index files. How should work with s3 look like, taking into account multiple pageservers?
That said, I think the following would work:
Getting back to slow s3 listing, the following approaches are possible:
|
To be more precise. Currently index file contains snapshot of Its structure is the following: struct TimelineMetadataBodyV2 {
disk_consistent_lsn: Lsn,
// This is only set if we know it. We track it in memory when the page
// server is running, but we only track the value corresponding to
// 'last_record_lsn', not 'disk_consistent_lsn' which can lag behind by a
// lot. We only store it in the metadata file when we flush *all* the
// in-memory data so that 'last_record_lsn' is the same as
// 'disk_consistent_lsn'. That's OK, because after page server restart, as
// soon as we reprocess at least one record, we will have a valid
// 'prev_record_lsn' value in memory again. This is only really needed when
// doing a clean shutdown, so that there is no more WAL beyond
// 'disk_consistent_lsn'
prev_record_lsn: Option<Lsn>,
ancestor_timeline: Option<TimelineId>,
ancestor_lsn: Lsn,
latest_gc_cutoff_lsn: Lsn,
initdb_lsn: Lsn,
pg_version: u32,
}
Agreed. Even with manual orchestration it is not so obvious. E g on startup pageserver needs to ask control plane whether its still the leader (at least it is how I see it, maybe there is a workaround, check the updated rfc for the note + diagram)
Depending on local cache size (basically the eviction frequency), compaction can cause frequent deletions which will cause higher latency for on demand downloads on the follower since it needs to relist things frequently and do that on the I think coordination through broker is the way.
That creates chicken or the egg problem which would be nice to avoid. Options remind me about ClickHouse way which used to store info about partitions in ZooKeeper so everyone knows what exists where
I like this idea because it will allow to continue using index file to store file sizes and checksums for layers I updated the RFC, please let me know if I missed something |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like everybody is fine with the approach selected, it's documented and we can merge it?
Merging since no objections were raised. Next step is working on the implementation. Progress is tracked here: #2739 |
During discussion with @SomeoneToIgnore about toggling background activity PR (#2665) decided to step back a little bit and explore alternatives.
Approach with avoiding metadata entirely looks promising to me.
Not really proposing to implement one of the approaches, rather wanted to brainstorm given possible options
Rendered