Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metas -> workers configuration propagation #91

Closed
slinkydeveloper opened this issue Feb 15, 2023 · 5 comments
Closed

Metas -> workers configuration propagation #91

slinkydeveloper opened this issue Feb 15, 2023 · 5 comments
Labels

Comments

@slinkydeveloper
Copy link
Contributor

slinkydeveloper commented Feb 15, 2023

We need to refine the mechanism for metas to propagate service configurations to workers.

@slinkydeveloper slinkydeveloper added the needs-refinement Issues that need further investigation label Feb 15, 2023
@slinkydeveloper
Copy link
Contributor Author

I'm gonna expand this issue with the details about the two proposals:

  • eagerly pull configuration when service not known
  • lazily push configuration to workers

@slinkydeveloper
Copy link
Contributor Author

slinkydeveloper commented Mar 2, 2023

01/03/23 discussion

There are essentially 2 sub-problems where the metas -> worker configuration propagation mechanism has user visible effects:

  • PP communicating with a PP living in another worker process
  • Ingresses

How the propagation physically happens, be it pull/push, tcp/http/grpc, is just an implementation detail, hidden behind some abstraction, such as LocalMeta or NodeMeta or something similar.

Worker-to-worker

  • Every cross-worker communication is tagged with a "global configuration revision" number.
  • Every time a worker receives a message from a partition processor located in another worker, it will check for this global revision number. If the local revision number is lower, it will wait to have an updated configuration. This specific "wait" point can be implemented in a bunch of places, be it the network layer or the partition processor. ATM we think the partition processor would be the right place, but it is also conceivable to have a third new component for that. In any case, it makes sense that we don't try to continue processing any additional messages from this worker with a newer global configuration revision than ours, given it is likely that most of those messages will require the new global configuration revision
  • From this point onward, every component that needs the configuration can be synchronous and assume the configuration is available in the local process, accessible through some ad-hoc interfaces provided by the LocalMeta/NodeMeta

Ingress

The ingress needs the service configuration for:

  • Extracting the key
  • Transcode JSON to Protobuf

There are essentially two solutions here, both with their pros-cons:

  • When the service cannot be found, or when the transcoding json to protobuf fails, wait to receive the next global configuration revision.
    • Pro: better user experience, as the user will never get a 404 or a transcoding failure after registering a service.
    • Cons: The blocking behaviour makes us susceptible to denial of service attacks, as one can easily fill the ingress global concurrency semaphore by sending empty requests with random service names, or alternatively correct service and method names, but wrong payloads.
  • (A slightly different variant of the above case) When the service cannot be found, or when the transcoding json to protobuf fails, ask the meta if there is a new service revision for that particular service.
    • Pro: Same as above, but simpler to implement (?)
    • Cons: More susceptible to denial of service attacks, and requires some implementation complexity (e.g. batching) to avoid overloading metas as well
  • Never wait on a configuration update, simply fail with correct failure (404, 400, etc). When registering a service to meta, optimistically try to respond to the registration when every known worker has a propagated configuration
    • Pro: Fail fast, doesn't suffer from denial of service attacks
    • Cons: There are some cases, such as when workers are suffering partitioning, where the user might get a false positive (e.g. a 404 for a service recently registered)

It is conceivable that we support both solutions, and make this behaviour configurable.

@slinkydeveloper
Copy link
Contributor Author

With the recent updates on retries, including #495, I think this discussion becomes much simpler: now we can let the resolution fail when there is a configuration misalignment between workers, and that error will simply be retried.

@slinkydeveloper
Copy link
Contributor Author

@tillrohrmann seems i can close this now

tillrohrmann added a commit to tillrohrmann/restate that referenced this issue May 16, 2024
…361ef..1898426f

1898426f Update min/max of protocol version fields in endpoint_manifest_schema
08871e0b Introduce service.Version and discovery.Version enums
2f3e461f Workflow api changes (restatedev#91)
1fa71a5b Add versioning info (restatedev#90)

git-subtree-dir: crates/service-protocol/service-protocol
git-subtree-split: 1898426fc98c16d704068594cd54394912845ff7
tillrohrmann added a commit to tillrohrmann/restate that referenced this issue May 16, 2024
…361ef..1898426f

1898426f Update min/max of protocol version fields in endpoint_manifest_schema
08871e0b Introduce service.Version and discovery.Version enums
2f3e461f Workflow api changes (restatedev#91)
1fa71a5b Add versioning info (restatedev#90)

git-subtree-dir: crates/service-protocol/service-protocol
git-subtree-split: 1898426fc98c16d704068594cd54394912845ff7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant