Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add runners to startup the ocis' services #8802

Merged
merged 13 commits into from
Apr 29, 2024
Merged

Conversation

jvillafanez
Copy link
Member

Description

Add the runner package to startup the ocis' services. This is intended to replace the oklog library

See technical details below

Related Issue

No opened issue. However, there is no clean way to stop the services right now.

Motivation and Context

The package is designed to fit our use case and can be improved if needed. For now, it should be easy to use.

How Has This Been Tested?

Currently just tested with the upcoming collaboration service. Adoption for all the services is expected at some point.

See the technical details for the expected code changes in the services

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Technical debt
  • Tests only (no source changes)

Checklist:

  • Code changes
  • Unit tests added
  • Acceptance tests added
  • Documentation ticket raised:

Technical details (note that terminology might change)

In order to create a Runner, you need to provide a runnable task, and a stopper which will notify the task to stop.

There are clear responsibilities:

  • The Runner will execute the runnable task, which will be running indefinitely. This means that the thread / goroutine should get blocked somewhere within the task (or go in an infinite cycle of run - stop - run - stop). The runner will consider the task as completed (either with a success or failure) once the runnable task ends.
  • The Stopper will just notify the runnable task to end. It's still up to the runnable task to finish properly.

In general, creating a Runner for a particular service should be as follows:

runner := runner.NewRunner("aRandomId", func() error {
  server.Run()  // the run will block the thread
}, func() {
  server.Stop()
})

For our go-micro service it might be slightly different. The http server might look like:

ch := make(chan error, 1)
runner := runner.NewRunner("aRandomId", func() error {
  httpServer.Server().Start()
  return <-ch 
}, func() {
  ch <- httpServer.Server().Stop()
})

This is because the httpServer.Server().Run() method will wait for an OS signal, but we don't want the server to wait for it. So we need to use the Start() method instead, but it doesn't block the thread. We'll have to wait in a custom channel. Luckily, the Stop() method returns when the httpServer has stopped, so sending the result through the channel would finish the task.

Since practically all our services will need to run multiple servers in parallel (some of GRPC, HTTP, and / or debug), a GroupRunner is also provided.
The GroupRunner will execute the runners in an all-or-none fashion. Note that the GroupRunner will notify all of its runners to stop, but it won't force anything, which means that stopping the GroupRunner might take some time.

Both the "regular" Runner and the GroupRunner have a Run(ctx context.Context) method. The provided context will determine for how long the task will run. Once the context is marked as done, the Runner (and GroupRunner) will notify its tasks to stop using the provided stopper function. Again, this is just a notification, and the task might still take a while until it finishes.

For our use case, the expectation is to use signal.NotifyContext(...) to use a context that will be done when we receive an OS termination signal (and / or other similar signals)

Overall, the code should look something like:

ctx, cancel := signal.NotifyContext(context.Background(), os.Interrupt)  // any other context is fine too
defer cancel()

gr := runner.NewGroupRunner()

gr.Add(runner.NewRunner("grpcServer", func() error {
.....
}, func() {
.....
}))

gr.Add(runner.NewRunner("debugServer", func() error {
.....
}, func() {
......
}))

gr.Run(ctx)

Receiving an OS interrupt signal would mark the context as done, which would make the GroupRunner to call the stopper function on all its runner and make all the tasks in the group to eventually finish. The Run method would finish and the service could finish naturally.

important note: if any runner in the group finishes (maybe because some error in the grpcServer cases it to stop, for example), all the runners in the group will also finish (their stopper will be called, which should make the task finish).

Advantages

  • Clear definition of the behavior. Expectations and limitations are clearly documented in the code, specially the one with the unique runner ids.
  • Context package usage makes it more clear when the service should stop.
  • Results / errors are available if needed (might need minor improvements)

@jvillafanez jvillafanez self-assigned this Apr 8, 2024
Copy link

update-docs bot commented Apr 8, 2024

Thanks for opening this pull request! The maintainers of this repository would appreciate it if you would create a changelog item based on your changes.

@jvillafanez jvillafanez marked this pull request as draft April 9, 2024 08:03
@jvillafanez jvillafanez force-pushed the servers_startup branch 2 times, most recently from c7119a3 to a42c7a8 Compare April 9, 2024 09:07
Copy link
Collaborator

@kobergj kobergj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions/suggestions

ocis-pkg/runner/grouprunner.go Outdated Show resolved Hide resolved
ocis-pkg/runner/grouprunner.go Show resolved Hide resolved
ocis-pkg/runner/grouprunner.go Outdated Show resolved Hide resolved
ocis-pkg/runner/grouprunner.go Outdated Show resolved Hide resolved
ocis-pkg/runner/grouprunner.go Outdated Show resolved Hide resolved
ocis-pkg/runner/grouprunner.go Outdated Show resolved Hide resolved
ocis-pkg/runner/runner.go Outdated Show resolved Hide resolved
ocis-pkg/runner/runner.go Outdated Show resolved Hide resolved
ocis-pkg/runner/runner.go Outdated Show resolved Hide resolved
ocis-pkg/runner/types.go Show resolved Hide resolved
// Having notified that the context has been finished, we still need to
// wait for the rest of the results
for i := len(results); i < len(gr.runners); i++ {
result := <-ch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
result := <-ch
select {
case result := <-ch:
results[result.RunnerID] = result
case time.Tick(time.Minute):
log(...)
os.Exit(1)

something like this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmh. I don't think this is enough. A GroupRunner can be created with any Runner. There is no guarantee this is a InterruptedTimeoutRunner. If I just call this with a broken custom Runner this will hang forever. We need the timeout here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a matter of deciding who's responsible of ensuring the program won't hang forever.

My assumption is that the responsible is the one using the package, because he knows the task and how to stop the task, so there shouldn't be a reason for him to provide a faulty task (otherwise, it's a bug that he needs to fix). The InterruptedTimeoutRunner can help, knowing its limitations, to ensure we don't block the thread, but it's the developer's choice to use it or not.
Basically, if the task hangs, it's your fault (whoever is using the package). Code comments should be clear in this regard (if it isn't clear enough, we should add more info about it).

If we're going to be responsible, there are a couple of important things to notice:

  • We can't ensure that the resources used by the task will be ever freed even if we return an error result. This needs to be clarified because it's something we CAN'T guarantee.
    Note that, before, it was your responsibility to ensure this doesn't happen, but now it's ours, and we can't ensure it.
  • We'll add more complexity to the runners. The code is kind of delicate because we need to ensure it's thread-safe, so I'd rather move the complexity out of the way if possible. Adding more code also increases the probability of more bugs.

Copy link
Collaborator

@kobergj kobergj Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asically, if the task hangs, it's your fault (whoever is using the package). Code comments should be clear in this regard (if it isn't clear enough, we should add more info about it).

I strongly disagree. If the task is broken for some reason the programm MUST still exit. It cannot hang forever saying "this is your fault, I don't care". Since this is supposed to be the supervisor of all tasks it MUST make sure its tasks finish after a certain amount of time.

We can't ensure that the resources used by the task will be ever freed even if we return an error result.

Why not? If we exit within the grouprunner, all resources of our spawned go routines should be freed.

We'll add more complexity to the runners.

I tend to disagree again. We could remove the complete InterruptedTimeoutRunner and replace it with only one select statement. This would reduce complexity in my opinion

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jvillafanez this comment is still open? Could you add the timeout here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runner has a guaranteed exit with the timeout, so we'll eventually get a result. A deadlock isn't possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Does it really need to be that complicated? We now need another channel and another go-routine to make sure a result is delivered. We could omit all that with just one single line here:

case <-time.After(r.interruptDur):

Wouldn't this be much simpler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runner's Run and RunAsync method as well as group runner's RunAsync method should behave the same way (returning a result after the timeout period has been reached). Just checking for the timeout there would mean that the timeout behavior would be exclusive for that method.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking for the timeout there would mean that the timeout behavior would be exclusive for that method.

But that is exactly what it is. Only the GroupRunner cares about the timeout because it needs to govern several Runners. One Runner started alone doesn't necessarily need a timeout. It could deadlock forever if its creator wants it so. But the GroupRunner needs to make sure it finishes in a reasonable amount of time.

@jvillafanez jvillafanez force-pushed the servers_startup branch 2 times, most recently from 40fca5e to 5fe4718 Compare April 12, 2024 13:03
@jvillafanez jvillafanez marked this pull request as ready for review April 15, 2024 09:07
Copy link
Collaborator

@kobergj kobergj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add a changelog

ocis-pkg/runner/runner_suite_test.go Outdated Show resolved Hide resolved
// Having notified that the context has been finished, we still need to
// wait for the rest of the results
for i := len(results); i < len(gr.runners); i++ {
result := <-ch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmh. I don't think this is enough. A GroupRunner can be created with any Runner. There is no guarantee this is a InterruptedTimeoutRunner. If I just call this with a broken custom Runner this will hang forever. We need the timeout here.

ocis-pkg/runner/helper.go Outdated Show resolved Hide resolved
@butonic
Copy link
Member

butonic commented Apr 18, 2024

In reva gracefulShutdown has an os.Exit(0) call at the end, so to allow all services to end gracefully, we will need to fix that as well. Originally tracked in #6602

As far as this PR is concerned, I vote for merging it to:

  • get rid of a dependency to github.com/oklog/run
  • make the code to stop all servers more explicit - we can no longer forget to call cancel() when one of them ends

@rhafer
Copy link
Contributor

rhafer commented Apr 18, 2024

@jvillafanez Pardon my ignorance, but I still feel this needs some more explanation. Why do we need to replace oklog/run with this? What can this code do better than oklog/run? I am just trying to understand it, as this introduces an additional maintenance burden.

@jvillafanez
Copy link
Member Author

Mostly explicit context handling and behavior (as far as I know, with oklog you handle the context on your own), as well as clearly documented behavior (hopefully).

With the PR, once the context is marked as done anyhow, all the tasks associated with that context will be asked to stop. Our Run(ctx) method will be checking the state so we can ask the task to stop once the context is done.
In addition, some of the behavior of the oklog isn't clear (at least for me), so this PR also documents clearly the behavior of each method (if there are doubts about the expected behavior, we should include a comment and / or fix a possible bug).

My worry with oklog is that I'm not sure whether we're using oklog correctly, or if oklog is suited for our purposes. For example, it isn't clear for me when the interrupt function is called in oklog other than when the first task finishes, so how a task finishes?
There seem to be a lot of burden shifted to the caller:

  • You need to provide a task that it is ensured to finish somehow. If you want to use contexts, it's up to you how to manage them.
  • You'll get blocked if you provide a task with an undefined running time, such as a server. You'll have to rely on the server to have some mechanism to stop it, probably via context, signals, etc.
  • Running the tasks is always synchronous and it will always block. It isn't possible to run a task while waiting for another thing.

Since the our main usage will be to run servers, we have to assume that the servers won't stop unless explicitly requested. This is where I don't think oklog is suited for this: there is no explicit way to interrupt / stop the task from the group. If the way to stop the servers will be via context, why don't we cancel the parent / top context ourselves? Remember that oklog isn't doing anything with the context.

@jvillafanez
Copy link
Member Author

Runners are guaranteed to provide a result after being interrupted either manually or when the context (in the Run(ctx) method) is done.

The runner will execute the task indefinitely. Once the task is interrupted, a timeout will start based on the "interruption duration" provided when the runner was created. If the timeout is reached and the task hasn't provided a result yet, a timeout result will be returned instead. Note that the task might still be running in a different goroutine and consuming resources.

For the case of group runners, there is no change. Since the runners are guaranteed to return after a while, the group runner won't block forever after being interrupted. Maximum waiting time should be the same as the maximum timeout duration among the runners in the group.

@kobergj
Copy link
Collaborator

kobergj commented Apr 19, 2024

I still don't understand why you don't want to add a timeout check to the group runner. I still think this is the best way to go.

The problem with the current implementation is that it doesn't have a sane default. If I just pass 0 as interrupt duration, the task will stop immediately. If we want to go with it, we need a sane default value.

@jvillafanez
Copy link
Member Author

jvillafanez commented Apr 19, 2024

The behavior should be consistent for both the "regular" runner and the group runner. Making the change in the group runner implies that using just the "regular" runner can cause problems.
It's true that we don't have plans to use "regular" runners on their own, but having consistency should make maintenance easier.

The problem with the current implementation is that it doesn't have a sane default. If I just pass 0 as interrupt duration, the task will stop immediately. If we want to go with it, we need a sane default value.

I think the parameter needs to be included in the creation. Using a duration parameter in the Interrupt() could cause problems if the context is done (Run method would need to also ask for the duration). If we also include the duration in the Run method, not only we need to adjust code in the group runner, but also the behavior could be different with the RunAsync (both "regular" and group runner).

I'll add an options parameter, the same way it's done across oCIS. This way, the duration can be optional and we can set a default duration if it isn't provided.

@rhafer
Copy link
Contributor

rhafer commented Apr 22, 2024

Mostly explicit context handling and behavior (as far as I know, with oklog you handle the context on your own), as well as clearly documented behavior (hopefully).

Understood. And yeah. I really love the efforts you put into documentation and test!
[..]

My worry with oklog is that I'm not sure whether we're using oklog correctly, or if oklog is suited for our purposes. For example, it isn't clear for me when the interrupt function is called in oklog other than when the first task finishes, so how a task finishes? There seem to be a lot of burden shifted to the caller:

That's true oklog.run, doesn't worry about all of that. Though IIRC it provides helpers to add tasks that are ensured to finish to the group (ContextHandler, SignalHandler)

[..]

Since the our main usage will be to run servers, we have to assume that the servers won't stop unless explicitly requested. This is where I don't think oklog is suited for this: there is no explicit way to interrupt / stop the task from the group. If the way to stop the servers will be via context, why don't we cancel the parent / top context ourselves? Remember that oklog isn't doing anything with the context.

Ok, That makes sense I guess. Thanks a lot for taking the time to write it down. I still wonder a bit if we could (re-)use some already existing code. But since this isn't too big and really well documented I am fine adding it.

rhafer
rhafer previously requested changes Apr 22, 2024
case d := <-r.interruptedCh:
result = &Result{
RunnerID: r.ID,
RunnerError: fmt.Errorf("runner %s timed out after waiting for %s", r.ID, d.String()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return some explicitly typed error here. Something that can be easily checked with errors.Is(...) to know if a task was stopped properly, or ran into a timeout.

@micbar micbar requested review from kobergj and rhafer April 22, 2024 13:00
Copy link
Contributor

@rhafer rhafer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am fine with this now. But I'll leave the final decision to @kobergj, he's spend more time with it.

@rhafer rhafer dismissed their stale review April 22, 2024 13:19

addressed

Copy link
Contributor

@dragotin dragotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can not really comment on the details that were discussed, but a general remark is that the oklog module gets near to no substancial maintenance, and also does not seem to have a broader user base.

So IMHO this is a good move to pull the functionality into our codebase, especially with a code quality like this, with great documentation, tests etc. Very good job!

Copy link
Collaborator

@kobergj kobergj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still unresolved review points...

// Having notified that the context has been finished, we still need to
// wait for the rest of the results
for i := len(results); i < len(gr.runners); i++ {
result := <-ch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jvillafanez this comment is still open? Could you add the timeout here?

@DeepDiver1975
Copy link
Member

We cannot be the first in need of something like this .... isn't there any existing lib out there?

@rhafer
Copy link
Contributor

rhafer commented Apr 29, 2024

We cannot be the first in need of something like this .... isn't there any existing lib out there?

There is stuff like golang.org/x/sync/errgroup which can do similar things, but not quite the same ( e.g. it also lacks an explicit Stop() method )

Copy link

@kobergj kobergj merged commit d8cae78 into master Apr 29, 2024
3 checks passed
@delete-merged-branch delete-merged-branch bot deleted the servers_startup branch April 29, 2024 11:52
ownclouders pushed a commit that referenced this pull request Apr 29, 2024
feat: add runners to startup the ocis' services
@micbar micbar mentioned this pull request Jun 19, 2024
24 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants