feat: add runners to startup the ocis' services #8802

jvillafanez · 2024-04-08T13:41:00Z

Description

Add the runner package to startup the ocis' services. This is intended to replace the oklog library

See technical details below

Related Issue

No opened issue. However, there is no clean way to stop the services right now.

Motivation and Context

The package is designed to fit our use case and can be improved if needed. For now, it should be easy to use.

How Has This Been Tested?

Currently just tested with the upcoming collaboration service. Adoption for all the services is expected at some point.

See the technical details for the expected code changes in the services

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Technical debt
Tests only (no source changes)

Checklist:

Code changes
Unit tests added
Acceptance tests added
Documentation ticket raised:

Technical details (note that terminology might change)

In order to create a Runner, you need to provide a runnable task, and a stopper which will notify the task to stop.

There are clear responsibilities:

The Runner will execute the runnable task, which will be running indefinitely. This means that the thread / goroutine should get blocked somewhere within the task (or go in an infinite cycle of run - stop - run - stop). The runner will consider the task as completed (either with a success or failure) once the runnable task ends.
The Stopper will just notify the runnable task to end. It's still up to the runnable task to finish properly.

In general, creating a Runner for a particular service should be as follows:

runner := runner.NewRunner("aRandomId", func() error {
  server.Run()  // the run will block the thread
}, func() {
  server.Stop()
})

For our go-micro service it might be slightly different. The http server might look like:

ch := make(chan error, 1)
runner := runner.NewRunner("aRandomId", func() error {
  httpServer.Server().Start()
  return <-ch 
}, func() {
  ch <- httpServer.Server().Stop()
})

This is because the httpServer.Server().Run() method will wait for an OS signal, but we don't want the server to wait for it. So we need to use the Start() method instead, but it doesn't block the thread. We'll have to wait in a custom channel. Luckily, the Stop() method returns when the httpServer has stopped, so sending the result through the channel would finish the task.

Since practically all our services will need to run multiple servers in parallel (some of GRPC, HTTP, and / or debug), a GroupRunner is also provided.
The GroupRunner will execute the runners in an all-or-none fashion. Note that the GroupRunner will notify all of its runners to stop, but it won't force anything, which means that stopping the GroupRunner might take some time.

Both the "regular" Runner and the GroupRunner have a Run(ctx context.Context) method. The provided context will determine for how long the task will run. Once the context is marked as done, the Runner (and GroupRunner) will notify its tasks to stop using the provided stopper function. Again, this is just a notification, and the task might still take a while until it finishes.

For our use case, the expectation is to use signal.NotifyContext(...) to use a context that will be done when we receive an OS termination signal (and / or other similar signals)

Overall, the code should look something like:

ctx, cancel := signal.NotifyContext(context.Background(), os.Interrupt)  // any other context is fine too
defer cancel()

gr := runner.NewGroupRunner()

gr.Add(runner.NewRunner("grpcServer", func() error {
.....
}, func() {
.....
}))

gr.Add(runner.NewRunner("debugServer", func() error {
.....
}, func() {
......
}))

gr.Run(ctx)

Receiving an OS interrupt signal would mark the context as done, which would make the GroupRunner to call the stopper function on all its runner and make all the tasks in the group to eventually finish. The Run method would finish and the service could finish naturally.

important note: if any runner in the group finishes (maybe because some error in the grpcServer cases it to stop, for example), all the runners in the group will also finish (their stopper will be called, which should make the task finish).

Advantages

Clear definition of the behavior. Expectations and limitations are clearly documented in the code, specially the one with the unique runner ids.
Context package usage makes it more clear when the service should stop.
Results / errors are available if needed (might need minor improvements)

update-docs · 2024-04-08T13:41:05Z

Thanks for opening this pull request! The maintainers of this repository would appreciate it if you would create a changelog item based on your changes.

kobergj

Some questions/suggestions

ocis-pkg/runner/grouprunner.go

ocis-pkg/runner/runner.go

ocis-pkg/runner/types.go

kobergj · 2024-04-11T08:14:12Z

ocis-pkg/runner/grouprunner.go

+	// Having notified that the context has been finished, we still need to
+	// wait for the rest of the results
+	for i := len(results); i < len(gr.runners); i++ {
+		result := <-ch


Suggested change

result := <-ch

select {

case result := <-ch:

results[result.RunnerID] = result

case time.Tick(time.Minute):

log(...)

os.Exit(1)

something like this

It will be done via helper functions (https://github.com/owncloud/ocis/pull/8802/files#diff-4fc1db913125260b4b30987a3e771134f13a1d1326d41231f311296c717645a1R28) if needed.

Mmh. I don't think this is enough. A GroupRunner can be created with any Runner. There is no guarantee this is a InterruptedTimeoutRunner. If I just call this with a broken custom Runner this will hang forever. We need the timeout here.

I think it's a matter of deciding who's responsible of ensuring the program won't hang forever.

My assumption is that the responsible is the one using the package, because he knows the task and how to stop the task, so there shouldn't be a reason for him to provide a faulty task (otherwise, it's a bug that he needs to fix). The InterruptedTimeoutRunner can help, knowing its limitations, to ensure we don't block the thread, but it's the developer's choice to use it or not.
Basically, if the task hangs, it's your fault (whoever is using the package). Code comments should be clear in this regard (if it isn't clear enough, we should add more info about it).

If we're going to be responsible, there are a couple of important things to notice:

We can't ensure that the resources used by the task will be ever freed even if we return an error result. This needs to be clarified because it's something we CAN'T guarantee.
Note that, before, it was your responsibility to ensure this doesn't happen, but now it's ours, and we can't ensure it.

We'll add more complexity to the runners. The code is kind of delicate because we need to ensure it's thread-safe, so I'd rather move the complexity out of the way if possible. Adding more code also increases the probability of more bugs.

asically, if the task hangs, it's your fault (whoever is using the package). Code comments should be clear in this regard (if it isn't clear enough, we should add more info about it).

I strongly disagree. If the task is broken for some reason the programm MUST still exit. It cannot hang forever saying "this is your fault, I don't care". Since this is supposed to be the supervisor of all tasks it MUST make sure its tasks finish after a certain amount of time.

We can't ensure that the resources used by the task will be ever freed even if we return an error result.

Why not? If we exit within the grouprunner, all resources of our spawned go routines should be freed.

We'll add more complexity to the runners.

I tend to disagree again. We could remove the complete InterruptedTimeoutRunner and replace it with only one select statement. This would reduce complexity in my opinion

@jvillafanez this comment is still open? Could you add the timeout here?

The runner has a guaranteed exit with the timeout, so we'll eventually get a result. A deadlock isn't possible.

I see. Does it really need to be that complicated? We now need another channel and another go-routine to make sure a result is delivered. We could omit all that with just one single line here:

case <-time.After(r.interruptDur):

Wouldn't this be much simpler?

runner's Run and RunAsync method as well as group runner's RunAsync method should behave the same way (returning a result after the timeout period has been reached). Just checking for the timeout there would mean that the timeout behavior would be exclusive for that method.

Just checking for the timeout there would mean that the timeout behavior would be exclusive for that method.

But that is exactly what it is. Only the GroupRunner cares about the timeout because it needs to govern several Runners. One Runner started alone doesn't necessarily need a timeout. It could deadlock forever if its creator wants it so. But the GroupRunner needs to make sure it finishes in a reasonable amount of time.

kobergj

Please also add a changelog

ocis-pkg/runner/runner_suite_test.go

kobergj · 2024-04-17T14:51:01Z

ocis-pkg/runner/grouprunner.go

+	// Having notified that the context has been finished, we still need to
+	// wait for the rest of the results
+	for i := len(results); i < len(gr.runners); i++ {
+		result := <-ch


Mmh. I don't think this is enough. A GroupRunner can be created with any Runner. There is no guarantee this is a InterruptedTimeoutRunner. If I just call this with a broken custom Runner this will hang forever. We need the timeout here.

ocis-pkg/runner/helper.go

butonic · 2024-04-18T10:04:15Z

In reva gracefulShutdown has an os.Exit(0) call at the end, so to allow all services to end gracefully, we will need to fix that as well. Originally tracked in #6602

As far as this PR is concerned, I vote for merging it to:

get rid of a dependency to github.com/oklog/run
make the code to stop all servers more explicit - we can no longer forget to call cancel() when one of them ends

rhafer · 2024-04-18T11:18:14Z

@jvillafanez Pardon my ignorance, but I still feel this needs some more explanation. Why do we need to replace oklog/run with this? What can this code do better than oklog/run? I am just trying to understand it, as this introduces an additional maintenance burden.

jvillafanez · 2024-04-18T12:43:08Z

Mostly explicit context handling and behavior (as far as I know, with oklog you handle the context on your own), as well as clearly documented behavior (hopefully).

With the PR, once the context is marked as done anyhow, all the tasks associated with that context will be asked to stop. Our Run(ctx) method will be checking the state so we can ask the task to stop once the context is done.
In addition, some of the behavior of the oklog isn't clear (at least for me), so this PR also documents clearly the behavior of each method (if there are doubts about the expected behavior, we should include a comment and / or fix a possible bug).

My worry with oklog is that I'm not sure whether we're using oklog correctly, or if oklog is suited for our purposes. For example, it isn't clear for me when the interrupt function is called in oklog other than when the first task finishes, so how a task finishes?
There seem to be a lot of burden shifted to the caller:

You need to provide a task that it is ensured to finish somehow. If you want to use contexts, it's up to you how to manage them.
You'll get blocked if you provide a task with an undefined running time, such as a server. You'll have to rely on the server to have some mechanism to stop it, probably via context, signals, etc.
Running the tasks is always synchronous and it will always block. It isn't possible to run a task while waiting for another thing.

Since the our main usage will be to run servers, we have to assume that the servers won't stop unless explicitly requested. This is where I don't think oklog is suited for this: there is no explicit way to interrupt / stop the task from the group. If the way to stop the servers will be via context, why don't we cancel the parent / top context ourselves? Remember that oklog isn't doing anything with the context.

jvillafanez · 2024-04-19T09:10:36Z

Runners are guaranteed to provide a result after being interrupted either manually or when the context (in the Run(ctx) method) is done.

The runner will execute the task indefinitely. Once the task is interrupted, a timeout will start based on the "interruption duration" provided when the runner was created. If the timeout is reached and the task hasn't provided a result yet, a timeout result will be returned instead. Note that the task might still be running in a different goroutine and consuming resources.

For the case of group runners, there is no change. Since the runners are guaranteed to return after a while, the group runner won't block forever after being interrupted. Maximum waiting time should be the same as the maximum timeout duration among the runners in the group.

kobergj · 2024-04-19T09:22:50Z

I still don't understand why you don't want to add a timeout check to the group runner. I still think this is the best way to go.

The problem with the current implementation is that it doesn't have a sane default. If I just pass 0 as interrupt duration, the task will stop immediately. If we want to go with it, we need a sane default value.

jvillafanez · 2024-04-19T10:28:42Z

The behavior should be consistent for both the "regular" runner and the group runner. Making the change in the group runner implies that using just the "regular" runner can cause problems.
It's true that we don't have plans to use "regular" runners on their own, but having consistency should make maintenance easier.

The problem with the current implementation is that it doesn't have a sane default. If I just pass 0 as interrupt duration, the task will stop immediately. If we want to go with it, we need a sane default value.

I think the parameter needs to be included in the creation. Using a duration parameter in the Interrupt() could cause problems if the context is done (Run method would need to also ask for the duration). If we also include the duration in the Run method, not only we need to adjust code in the group runner, but also the behavior could be different with the RunAsync (both "regular" and group runner).

I'll add an options parameter, the same way it's done across oCIS. This way, the duration can be optional and we can set a default duration if it isn't provided.

rhafer · 2024-04-22T08:54:09Z

Mostly explicit context handling and behavior (as far as I know, with oklog you handle the context on your own), as well as clearly documented behavior (hopefully).

Understood. And yeah. I really love the efforts you put into documentation and test!
[..]

My worry with oklog is that I'm not sure whether we're using oklog correctly, or if oklog is suited for our purposes. For example, it isn't clear for me when the interrupt function is called in oklog other than when the first task finishes, so how a task finishes? There seem to be a lot of burden shifted to the caller:

That's true oklog.run, doesn't worry about all of that. Though IIRC it provides helpers to add tasks that are ensured to finish to the group (ContextHandler, SignalHandler)

[..]

Since the our main usage will be to run servers, we have to assume that the servers won't stop unless explicitly requested. This is where I don't think oklog is suited for this: there is no explicit way to interrupt / stop the task from the group. If the way to stop the servers will be via context, why don't we cancel the parent / top context ourselves? Remember that oklog isn't doing anything with the context.

Ok, That makes sense I guess. Thanks a lot for taking the time to write it down. I still wonder a bit if we could (re-)use some already existing code. But since this isn't too big and really well documented I am fine adding it.

rhafer · 2024-04-22T09:04:10Z

ocis-pkg/runner/runner.go

+	case d := <-r.interruptedCh:
+		result = &Result{
+			RunnerID:    r.ID,
+			RunnerError: fmt.Errorf("runner %s timed out after waiting for %s", r.ID, d.String()),


I think we should return some explicitly typed error here. Something that can be easily checked with errors.Is(...) to know if a task was stopped properly, or ran into a timeout.

rhafer

I think I am fine with this now. But I'll leave the final decision to @kobergj, he's spend more time with it.

addressed

dragotin

I can not really comment on the details that were discussed, but a general remark is that the oklog module gets near to no substancial maintenance, and also does not seem to have a broader user base.

So IMHO this is a good move to pull the functionality into our codebase, especially with a code quality like this, with great documentation, tests etc. Very good job!

kobergj

Still unresolved review points...

kobergj · 2024-04-26T07:14:51Z

ocis-pkg/runner/grouprunner.go

+	// Having notified that the context has been finished, we still need to
+	// wait for the rest of the results
+	for i := len(results); i < len(gr.runners); i++ {
+		result := <-ch


@jvillafanez this comment is still open? Could you add the timeout here?

DeepDiver1975 · 2024-04-26T07:34:46Z

We cannot be the first in need of something like this .... isn't there any existing lib out there?

rhafer · 2024-04-29T07:51:46Z

We cannot be the first in need of something like this .... isn't there any existing lib out there?

There is stuff like golang.org/x/sync/errgroup which can do similar things, but not quite the same ( e.g. it also lacks an explicit Stop() method )

sonarqubecloud · 2024-04-29T10:08:07Z

Quality Gate passed

Issues
15 New issues
0 Accepted issues

Measures
0 Security Hotspots
99.4% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

feat: add runners to startup the ocis' services

jvillafanez self-assigned this Apr 8, 2024

jvillafanez marked this pull request as draft April 9, 2024 08:03

jvillafanez force-pushed the servers_startup branch 2 times, most recently from c7119a3 to a42c7a8 Compare April 9, 2024 09:07

kobergj requested changes Apr 9, 2024

View reviewed changes

jvillafanez force-pushed the servers_startup branch from d4ad735 to 208c53a Compare April 10, 2024 12:57

kobergj reviewed Apr 11, 2024

View reviewed changes

jvillafanez force-pushed the servers_startup branch 2 times, most recently from 40fca5e to 5fe4718 Compare April 12, 2024 13:03

jvillafanez marked this pull request as ready for review April 15, 2024 09:07

kobergj requested changes Apr 17, 2024

View reviewed changes

jvillafanez added 11 commits April 19, 2024 14:47

feat: add runners to startup the ocis' services

ef32af6

refactor: reuse functions and name changes

da71059

fix: panic if there are duplicates in the group

0da6810

fix: ensure the task hasn't finished before interrupt it

6ddc0ad

fix: additional guarantees for concurrent calls

ff346c2

test: add unit tests

b6a6b61

feat: helper to ensure the task is interrupted and doesn't block

5ea30f7

test: unit tests for the helper

cb2e8e0

fix: ensure runners provide a result after being interrupted

df3c496

feat: make the interrupt duration optional and with a default

59051e2

docs: add changelog entry

0d5756b

jvillafanez force-pushed the servers_startup branch from a63bd8e to 0d5756b Compare April 19, 2024 12:47

rhafer previously requested changes Apr 22, 2024

View reviewed changes

fix: use custom timeout error if the runner times out

08c4763

micbar requested review from kobergj and rhafer April 22, 2024 13:00

rhafer reviewed Apr 22, 2024

View reviewed changes

dragotin reviewed Apr 24, 2024

View reviewed changes

jvillafanez mentioned this pull request Apr 25, 2024

List of os.Exit calls that need to be checked #8968

Open

kobergj requested changes Apr 26, 2024

View reviewed changes

fix: add group runner's timeout and make some channels buffered.

05f684a

jvillafanez force-pushed the servers_startup branch from 8dee314 to 05f684a Compare April 29, 2024 09:56

kobergj approved these changes Apr 29, 2024

View reviewed changes

kobergj merged commit d8cae78 into master Apr 29, 2024
3 checks passed

delete-merged-branch bot deleted the servers_startup branch April 29, 2024 11:52

ownclouders pushed a commit that referenced this pull request Apr 29, 2024

Merge pull request #8802 from owncloud/servers_startup

ad0990e

feat: add runners to startup the ocis' services

micbar mentioned this pull request Jun 19, 2024

6.0.0 Rolling Release #9391

Closed

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add runners to startup the ocis' services #8802

feat: add runners to startup the ocis' services #8802

jvillafanez commented Apr 8, 2024

update-docs bot commented Apr 8, 2024

kobergj left a comment

kobergj Apr 11, 2024

jvillafanez Apr 15, 2024

kobergj Apr 17, 2024

jvillafanez Apr 17, 2024

kobergj Apr 18, 2024 •

edited

Loading

kobergj Apr 26, 2024

jvillafanez Apr 26, 2024

kobergj Apr 26, 2024

jvillafanez Apr 26, 2024

kobergj Apr 26, 2024

kobergj left a comment

kobergj Apr 17, 2024

butonic commented Apr 18, 2024 •

edited

Loading

rhafer commented Apr 18, 2024

jvillafanez commented Apr 18, 2024

jvillafanez commented Apr 19, 2024

kobergj commented Apr 19, 2024

jvillafanez commented Apr 19, 2024 •

edited

Loading

rhafer commented Apr 22, 2024

rhafer Apr 22, 2024

rhafer left a comment

dragotin left a comment

kobergj left a comment

kobergj Apr 26, 2024

DeepDiver1975 commented Apr 26, 2024

rhafer commented Apr 29, 2024

sonarqubecloud bot commented Apr 29, 2024

-		result := <-ch
+        select {
+		case result := <-ch:
+		    results[result.RunnerID] = result
+		case time.Tick(time.Minute):
+		    log(...)
+		    os.Exit(1)

feat: add runners to startup the ocis' services #8802

feat: add runners to startup the ocis' services #8802

Conversation

jvillafanez commented Apr 8, 2024

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Technical details (note that terminology might change)

Advantages

update-docs bot commented Apr 8, 2024

kobergj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kobergj Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kobergj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

butonic commented Apr 18, 2024 • edited Loading

rhafer commented Apr 18, 2024

jvillafanez commented Apr 18, 2024

jvillafanez commented Apr 19, 2024

kobergj commented Apr 19, 2024

jvillafanez commented Apr 19, 2024 • edited Loading

rhafer commented Apr 22, 2024

Choose a reason for hiding this comment

rhafer left a comment

Choose a reason for hiding this comment

dragotin left a comment

Choose a reason for hiding this comment

kobergj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DeepDiver1975 commented Apr 26, 2024

rhafer commented Apr 29, 2024

sonarqubecloud bot commented Apr 29, 2024

Quality Gate passed

kobergj Apr 18, 2024 •

edited

Loading

butonic commented Apr 18, 2024 •

edited

Loading

jvillafanez commented Apr 19, 2024 •

edited

Loading