Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: the state of pageserver tenant relocation #3868

Merged
merged 2 commits into from
May 19, 2023

Conversation

LizardWizzard
Copy link
Contributor

Copy link
Contributor

@SomeoneToIgnore SomeoneToIgnore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFC clearly promotes the simplistic approach, but generally curious is it possible to make things work using the broker instead of some console commands, and have the simplistic approach implemented through that?

One idea: in the current model, we don't have multiple PS nodes with the tenant attach for a long time — the process is not normal and we use various calls to suppress the S3 writes, for example.

Since we still target the broker-based coordinated approach, can we use it now for the simplistic approach?
If PS nodes are able to detect through broker when a certain tenant gets attached on multiple nodes, they can stop background operations and S3 writes that may lead to discrepancies, and only start them when all but one PS nodes are left with this tenant attached.
That would save us multiple extra console calls for stopping the tasks and allow us to transition towards the coordination-based approach more gradually.

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much in favor of the simplest approach, as the current plan is to make relocation a human-initiated "back-office" type operation.

Let's revisit east-west traffic between pageservers when we get to sharding.

@LizardWizzard
Copy link
Contributor Author

RFC clearly promotes the simplistic approach, but generally curious is it possible to make things work using the broker instead of some console commands, and have the simplistic approach implemented through that?

IMO this is the the distinction between simplistic approach and more complicated ones. How do you see it?

Theoretically we can move relocation logic to pageserver and let pageserver be the oracle who manages things, but it makes certain things more complicated, because 1) we can loose pageserver 3) changes need to be made in control plane database, so pageserver will need to go to it to change pageserver id, and to schedule availability check.

Actually I like the idea from the point of testing. We can mock availability check and call to change pageserver id in our local control plane. So step two can be to move in this direction. I think this is a direction that worth exploring. Thoughts?

@LizardWizzard
Copy link
Contributor Author

Thinking about that a bit more, I think the main problem will be handling possible pageserver outages, so another pageserver can pick it up and finish the process. For console this problem is already solved with operations management.

How this can look like?

@SomeoneToIgnore
Copy link
Contributor

SomeoneToIgnore commented Mar 23, 2023

I think, the relocation could be kept as simplistic as it gets, but the background process handling could be done better than handling HTTP requests — hoped that syncing over the broker is not much harder than doing calls from the console.

@LizardWizzard
Copy link
Contributor Author

I think, the relocation could be kept as simplistic as it gets, but the background process handling could be done better than handling HTTP requests — hoped that syncing over the broker is not much harder than doing calls from the console.

I'm not sure how this can look like, can you elaborate? Maybe write down the possible sequence? (can be high level and imprecise)

@github-actions
Copy link

github-actions bot commented May 10, 2023

992 tests run: 944 passed, 0 failed, 48 skipped (full report for 45ff311)


Flaky tests

PostgreSQL 15 (release build)

The comment gets automatically updated with the latest test results ♻️

@LizardWizzard LizardWizzard force-pushed the dkr/the-state-of-ps-relocation-rfc branch from 0a1eba7 to 45ff311 Compare May 16, 2023 20:52
@LizardWizzard LizardWizzard merged commit 7529ee2 into main May 19, 2023
@LizardWizzard LizardWizzard deleted the dkr/the-state-of-ps-relocation-rfc branch May 19, 2023 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants