-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: the state of pageserver tenant relocation #3868
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RFC clearly promotes the simplistic approach, but generally curious is it possible to make things work using the broker instead of some console commands, and have the simplistic approach implemented through that?
One idea: in the current model, we don't have multiple PS nodes with the tenant attach for a long time — the process is not normal and we use various calls to suppress the S3 writes, for example.
Since we still target the broker-based coordinated approach, can we use it now for the simplistic approach?
If PS nodes are able to detect through broker when a certain tenant gets attached on multiple nodes, they can stop background operations and S3 writes that may lead to discrepancies, and only start them when all but one PS nodes are left with this tenant attached.
That would save us multiple extra console calls for stopping the tasks and allow us to transition towards the coordination-based approach more gradually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very much in favor of the simplest approach, as the current plan is to make relocation a human-initiated "back-office" type operation.
Let's revisit east-west traffic between pageservers when we get to sharding.
IMO this is the the distinction between simplistic approach and more complicated ones. How do you see it? Theoretically we can move relocation logic to pageserver and let pageserver be the oracle who manages things, but it makes certain things more complicated, because 1) we can loose pageserver 3) changes need to be made in control plane database, so pageserver will need to go to it to change pageserver id, and to schedule availability check. Actually I like the idea from the point of testing. We can mock availability check and call to change pageserver id in our local control plane. So step two can be to move in this direction. I think this is a direction that worth exploring. Thoughts? |
Thinking about that a bit more, I think the main problem will be handling possible pageserver outages, so another pageserver can pick it up and finish the process. For console this problem is already solved with operations management. How this can look like? |
I think, the relocation could be kept as simplistic as it gets, but the background process handling could be done better than handling HTTP requests — hoped that syncing over the broker is not much harder than doing calls from the console. |
I'm not sure how this can look like, can you elaborate? Maybe write down the possible sequence? (can be high level and imprecise) |
ba492b5
to
0a1eba7
Compare
992 tests run: 944 passed, 0 failed, 48 skipped (full report for 45ff311)Flaky testsPostgreSQL 15 (release build)The comment gets automatically updated with the latest test results ♻️
|
0a1eba7
to
45ff311
Compare
Rendered