-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add runners to startup the ocis' services #8802
Conversation
Thanks for opening this pull request! The maintainers of this repository would appreciate it if you would create a changelog item based on your changes. |
c7119a3
to
a42c7a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions/suggestions
d4ad735
to
208c53a
Compare
ocis-pkg/runner/grouprunner.go
Outdated
// Having notified that the context has been finished, we still need to | ||
// wait for the rest of the results | ||
for i := len(results); i < len(gr.runners); i++ { | ||
result := <-ch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result := <-ch | |
select { | |
case result := <-ch: | |
results[result.RunnerID] = result | |
case time.Tick(time.Minute): | |
log(...) | |
os.Exit(1) |
something like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be done via helper functions (https://github.com/owncloud/ocis/pull/8802/files#diff-4fc1db913125260b4b30987a3e771134f13a1d1326d41231f311296c717645a1R28) if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmh. I don't think this is enough. A GroupRunner
can be created with any Runner
. There is no guarantee this is a InterruptedTimeoutRunner
. If I just call this with a broken custom Runner
this will hang forever. We need the timeout here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a matter of deciding who's responsible of ensuring the program won't hang forever.
My assumption is that the responsible is the one using the package, because he knows the task and how to stop the task, so there shouldn't be a reason for him to provide a faulty task (otherwise, it's a bug that he needs to fix). The InterruptedTimeoutRunner
can help, knowing its limitations, to ensure we don't block the thread, but it's the developer's choice to use it or not.
Basically, if the task hangs, it's your fault (whoever is using the package). Code comments should be clear in this regard (if it isn't clear enough, we should add more info about it).
If we're going to be responsible, there are a couple of important things to notice:
- We can't ensure that the resources used by the task will be ever freed even if we return an error result. This needs to be clarified because it's something we CAN'T guarantee.
Note that, before, it was your responsibility to ensure this doesn't happen, but now it's ours, and we can't ensure it. - We'll add more complexity to the runners. The code is kind of delicate because we need to ensure it's thread-safe, so I'd rather move the complexity out of the way if possible. Adding more code also increases the probability of more bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asically, if the task hangs, it's your fault (whoever is using the package). Code comments should be clear in this regard (if it isn't clear enough, we should add more info about it).
I strongly disagree. If the task is broken for some reason the programm MUST still exit. It cannot hang forever saying "this is your fault, I don't care". Since this is supposed to be the supervisor
of all tasks it MUST make sure its tasks finish after a certain amount of time.
We can't ensure that the resources used by the task will be ever freed even if we return an error result.
Why not? If we exit within the grouprunner, all resources of our spawned go routines should be freed.
We'll add more complexity to the runners.
I tend to disagree again. We could remove the complete InterruptedTimeoutRunner
and replace it with only one select
statement. This would reduce complexity in my opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jvillafanez this comment is still open? Could you add the timeout here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runner has a guaranteed exit with the timeout, so we'll eventually get a result. A deadlock isn't possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Does it really need to be that complicated? We now need another channel and another go-routine to make sure a result is delivered. We could omit all that with just one single line here:
case <-time.After(r.interruptDur):
Wouldn't this be much simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runner's Run
and RunAsync
method as well as group runner's RunAsync
method should behave the same way (returning a result after the timeout period has been reached). Just checking for the timeout there would mean that the timeout behavior would be exclusive for that method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking for the timeout there would mean that the timeout behavior would be exclusive for that method.
But that is exactly what it is. Only the GroupRunner
cares about the timeout because it needs to govern several Runners. One Runner
started alone doesn't necessarily need a timeout. It could deadlock forever if its creator wants it so. But the GroupRunner
needs to make sure it finishes in a reasonable amount of time.
40fca5e
to
5fe4718
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add a changelog
ocis-pkg/runner/grouprunner.go
Outdated
// Having notified that the context has been finished, we still need to | ||
// wait for the rest of the results | ||
for i := len(results); i < len(gr.runners); i++ { | ||
result := <-ch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmh. I don't think this is enough. A GroupRunner
can be created with any Runner
. There is no guarantee this is a InterruptedTimeoutRunner
. If I just call this with a broken custom Runner
this will hang forever. We need the timeout here.
In reva gracefulShutdown has an As far as this PR is concerned, I vote for merging it to:
|
@jvillafanez Pardon my ignorance, but I still feel this needs some more explanation. Why do we need to replace oklog/run with this? What can this code do better than oklog/run? I am just trying to understand it, as this introduces an additional maintenance burden. |
Mostly explicit context handling and behavior (as far as I know, with oklog you handle the context on your own), as well as clearly documented behavior (hopefully). With the PR, once the context is marked as done anyhow, all the tasks associated with that context will be asked to stop. Our My worry with oklog is that I'm not sure whether we're using oklog correctly, or if oklog is suited for our purposes. For example, it isn't clear for me when the interrupt function is called in oklog other than when the first task finishes, so how a task finishes?
Since the our main usage will be to run servers, we have to assume that the servers won't stop unless explicitly requested. This is where I don't think oklog is suited for this: there is no explicit way to interrupt / stop the task from the group. If the way to stop the servers will be via context, why don't we cancel the parent / top context ourselves? Remember that oklog isn't doing anything with the context. |
Runners are guaranteed to provide a result after being interrupted either manually or when the context (in the The runner will execute the task indefinitely. Once the task is interrupted, a timeout will start based on the "interruption duration" provided when the runner was created. If the timeout is reached and the task hasn't provided a result yet, a timeout result will be returned instead. Note that the task might still be running in a different goroutine and consuming resources. For the case of group runners, there is no change. Since the runners are guaranteed to return after a while, the group runner won't block forever after being interrupted. Maximum waiting time should be the same as the maximum timeout duration among the runners in the group. |
I still don't understand why you don't want to add a timeout check to the group runner. I still think this is the best way to go. The problem with the current implementation is that it doesn't have a sane default. If I just pass |
The behavior should be consistent for both the "regular" runner and the group runner. Making the change in the group runner implies that using just the "regular" runner can cause problems.
I think the parameter needs to be included in the creation. Using a duration parameter in the I'll add an options parameter, the same way it's done across oCIS. This way, the duration can be optional and we can set a default duration if it isn't provided. |
a63bd8e
to
0d5756b
Compare
Understood. And yeah. I really love the efforts you put into documentation and test!
That's true oklog.run, doesn't worry about all of that. Though IIRC it provides helpers to add tasks that are ensured to finish to the group (ContextHandler, SignalHandler) [..]
Ok, That makes sense I guess. Thanks a lot for taking the time to write it down. I still wonder a bit if we could (re-)use some already existing code. But since this isn't too big and really well documented I am fine adding it. |
ocis-pkg/runner/runner.go
Outdated
case d := <-r.interruptedCh: | ||
result = &Result{ | ||
RunnerID: r.ID, | ||
RunnerError: fmt.Errorf("runner %s timed out after waiting for %s", r.ID, d.String()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return some explicitly typed error here. Something that can be easily checked with errors.Is(...) to know if a task was stopped properly, or ran into a timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am fine with this now. But I'll leave the final decision to @kobergj, he's spend more time with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can not really comment on the details that were discussed, but a general remark is that the oklog module gets near to no substancial maintenance, and also does not seem to have a broader user base.
So IMHO this is a good move to pull the functionality into our codebase, especially with a code quality like this, with great documentation, tests etc. Very good job!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still unresolved review points...
ocis-pkg/runner/grouprunner.go
Outdated
// Having notified that the context has been finished, we still need to | ||
// wait for the rest of the results | ||
for i := len(results); i < len(gr.runners); i++ { | ||
result := <-ch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jvillafanez this comment is still open? Could you add the timeout here?
We cannot be the first in need of something like this .... isn't there any existing lib out there? |
There is stuff like golang.org/x/sync/errgroup which can do similar things, but not quite the same ( e.g. it also lacks an explicit Stop() method ) |
8dee314
to
05f684a
Compare
Quality Gate passedIssues Measures |
feat: add runners to startup the ocis' services
Description
Add the
runner
package to startup the ocis' services. This is intended to replace the oklog librarySee technical details below
Related Issue
No opened issue. However, there is no clean way to stop the services right now.
Motivation and Context
The package is designed to fit our use case and can be improved if needed. For now, it should be easy to use.
How Has This Been Tested?
Currently just tested with the upcoming collaboration service. Adoption for all the services is expected at some point.
See the technical details for the expected code changes in the services
Screenshots (if appropriate):
Types of changes
Checklist:
Technical details (note that terminology might change)
In order to create a
Runner
, you need to provide a runnable task, and a stopper which will notify the task to stop.There are clear responsibilities:
Runner
will execute the runnable task, which will be running indefinitely. This means that the thread / goroutine should get blocked somewhere within the task (or go in an infinite cycle of run - stop - run - stop). The runner will consider the task as completed (either with a success or failure) once the runnable task ends.Stopper
will just notify the runnable task to end. It's still up to the runnable task to finish properly.In general, creating a
Runner
for a particular service should be as follows:For our go-micro service it might be slightly different. The http server might look like:
This is because the
httpServer.Server().Run()
method will wait for an OS signal, but we don't want the server to wait for it. So we need to use theStart()
method instead, but it doesn't block the thread. We'll have to wait in a custom channel. Luckily, theStop()
method returns when the httpServer has stopped, so sending the result through the channel would finish the task.Since practically all our services will need to run multiple servers in parallel (some of GRPC, HTTP, and / or debug), a
GroupRunner
is also provided.The
GroupRunner
will execute the runners in an all-or-none fashion. Note that theGroupRunner
will notify all of its runners to stop, but it won't force anything, which means that stopping theGroupRunner
might take some time.Both the "regular"
Runner
and theGroupRunner
have aRun(ctx context.Context)
method. The provided context will determine for how long the task will run. Once the context is marked as done, theRunner
(andGroupRunner
) will notify its tasks to stop using the provided stopper function. Again, this is just a notification, and the task might still take a while until it finishes.For our use case, the expectation is to use
signal.NotifyContext(...)
to use a context that will be done when we receive an OS termination signal (and / or other similar signals)Overall, the code should look something like:
Receiving an OS interrupt signal would mark the context as done, which would make the
GroupRunner
to call the stopper function on all its runner and make all the tasks in the group to eventually finish. TheRun
method would finish and the service could finish naturally.important note: if any runner in the group finishes (maybe because some error in the grpcServer cases it to stop, for example), all the runners in the group will also finish (their stopper will be called, which should make the task finish).
Advantages