Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invoker concurrency quota #548

Merged
merged 8 commits into from
Jul 3, 2023

Conversation

slinkydeveloper
Copy link
Contributor

@slinkydeveloper slinkydeveloper commented Jun 28, 2023

Fix #540, introducing a configurable concurrency limit in the invoker. In order to be able to unit test this feature, I had to do quite some changes to allow mocking. Hopefully this is going to be useful in future anyway, to test the invoker state machine transitions in an isolated manner.

@slinkydeveloper slinkydeveloper added this to the 1B milestone Jun 28, 2023
@slinkydeveloper slinkydeveloper changed the title Invoker quota Invoker concurrency quota Jun 28, 2023
…oal of this commit is to make more manageable testing the state machine logic.

Also make sure we unreserve the quota when we abort an invocation task.
…plify the structure.

Move quota to a separate file.
Removed Codec from Invoker (this was unused)
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this PR @slinkydeveloper. The changes make sense and look good to me. I had some minor comments that we can resolve before merging this PR.

src/invoker/src/service/mod.rs Outdated Show resolved Hide resolved
src/invoker/src/service/state_machine_coordinator.rs Outdated Show resolved Hide resolved
src/invoker/src/service/state_machine_coordinator.rs Outdated Show resolved Hide resolved
src/invoker/src/service/state_machine_tree.rs Outdated Show resolved Hide resolved
src/invoker/src/service/mod.rs Outdated Show resolved Hide resolved
src/invoker/src/service/mod.rs Show resolved Hide resolved
Comment on lines +688 to +689
self.quota.unreserve_slot();
self.status_store.on_end(&partition, &service_invocation_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we unreserve whenever we end an invocation. Currently, we reserve a slot and only later we call InvocationStatusStore.on_start. Maybe there is some symmetry and we could combine it so that future changes don't run the risk to forget calling one of them (like whenever we start an invocation we reserve a slot, whenever we end and invocation, we unreserve it).

As a side effect, if we always start when reserving a slot, then InvocationStatusStore.on_failure wouldn't be called w/o having first called on_start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not easy to refactor as you propose:

  • We reserve a slot when we get the InvokeCommand, and we unreserve it whenever this invoker will "forget" about this sid, meaning it removes the sid from the InvocationStateMachines
  • InvocationStatusStore.on_start signals when we actually start the invocation task, setting the in_flight to true and last executed time to now. In particular, when coming from a retry, we need to "flip" the status from in_flight = false to in_flight = true and reset the last executed time to now. Meaning after reserving the quota once, we may invoke several times InvocationStatusStore.on_start in case of retries

The only case where it may make sense to unify is on_end and unreserve_slot, as they're both triggered when the invoker is "removing" the given sid from its internal state. But TBH I prefer to keep them separate, as it makes more explicit what's going on. We can revisit this later if we want to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. I guess to make it work the status needs to be slightly changed: We create the InvocationStateMachine when allocating the quota and release the quota when releasing it. That would bind the InvocationStateMachine to the lifecycle of its slot/quota reservation. Nothing we have to do right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could potentially resolve another asymmetry which is that if we fail on start_invocation_task, then we need to create a InvocationStateMachine to call handle_error_event. Moving the concern of creating this state machine out of the start_invocation_task could simplify the control flow a tad bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that to create the state machine I need the endpoint metadata first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could become part of the state variants that have a resolved endpoint. I could see that the endpoint metadata might change across restarts if there is a newer endpoint version and no committed journal entries yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see that the endpoint metadata might change across restarts if there is a newer endpoint version and no committed journal entries yet.

I think we should keep this "relaxed" and continue to do what we do today, which is that we don't "fix" the endpoint after resolving it, but we re-resolve on every restart, to allow the user to "fix it" even if there are committed journal entries. This is especially useful if we allow some "configuration options" of the endpoint to be mutable: A case for this could be, for example, that you have made some changes to the gateway between restate and the service endpoint, perhaps requiring a new header, but you still did not configure the header in the restate "endpoint". By adding the header to the endpoint configuration, the runtime will be able again to access again the service endpoint.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1 for merging.

src/invoker/src/service/mod.rs Outdated Show resolved Hide resolved
@slinkydeveloper slinkydeveloper merged commit fc35ad5 into restatedev:main Jul 3, 2023
@slinkydeveloper slinkydeveloper deleted the issues/540 branch July 3, 2023 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invoker concurrency limit
2 participants