Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful Shutdown #6602

Closed
rhafer opened this issue Jun 22, 2023 · 7 comments · Fixed by #6840
Closed

Graceful Shutdown #6602

rhafer opened this issue Jun 22, 2023 · 7 comments · Fixed by #6840
Assignees
Labels
Interaction:Blue-Ticket Priority:p2-high Escalation, on top of current planning, release blocker Type:Epic Epic is the parent of user stories

Comments

@rhafer
Copy link
Contributor

rhafer commented Jun 22, 2023

Describe the bug

When running in single binary mode, sending a SIGTERM (or SIGINT or SIGQUIT) to the ocis process will trigger a signal handlers for each reva service (https://github.com/cs3org/reva/blob/master/cmd/revad/internal/grace/grace.go#L265) where the first Handler that finishes with just call os.Exit() and end the complete ocis process.

We need to find a way to cleanly shutdown an ocis, finishing all in-flight request in order to prevent things like: https://github.com/owncloud/enterprise/issues/5783 from happening during regluar operation.

@micbar micbar added the Priority:p2-high Escalation, on top of current planning, release blocker label Jun 22, 2023
@micbar micbar moved this from Qualification to Prio 2 in Infinite Scale Team Board Jun 22, 2023
@rhafer
Copy link
Contributor Author

rhafer commented Jun 22, 2023

Interesting read on this topic: https://www.rudderstack.com/blog/implementing-graceful-shutdown-in-go/

@micbar micbar added Type:Epic Epic is the parent of user stories and removed Type:Bug labels Jun 26, 2023
@exalate-issue-sync exalate-issue-sync bot changed the title ocis can't gracefully shutdown Ocis can shutdown gracefully Jun 26, 2023
@exalate-issue-sync exalate-issue-sync bot added p2-high and removed Priority:p2-high Escalation, on top of current planning, release blocker labels Jun 26, 2023
@exalate-issue-sync exalate-issue-sync bot changed the title Ocis can shutdown gracefully Graceful Shutdown Jun 26, 2023
@exalate-issue-sync exalate-issue-sync bot added the Priority:p2-high Escalation, on top of current planning, release blocker label Jun 26, 2023
@rhafer
Copy link
Contributor Author

rhafer commented Jun 26, 2023

We should probably first tackle this issue for the case when we're not using the ocis runtime. (I.e. fix the case when running the service as separated processes. This should at least avoid the race condition with the signal handlers of the different reva services.

We then just need to make the graceful shutdown behavior, which is currently triggered with a SIGQUIT, the default behavior for SIGTERM and SIGINT. (maybe based on a config setting). And get rid of those os.Exit() calls.

In second step we could then figure out how to disable the signal handlers in reva and trigger a proper reva shutdown from within the ocis runtime.

@micbar
Copy link
Contributor

micbar commented Jun 27, 2023

@rhafer I agree, we should focus on the single service runtime first.

IMO we could even look only on the storage users and storage system services for now.

Another idea might be to gather some information how severe the problem shows up in real life. We could run the inspector on the load test instance on the VP

@wkloucek Is that system still there?

@rhafer
Copy link
Contributor Author

rhafer commented Jun 27, 2023

To scan all personal and project spaces this might be helpful:

for dir in $(find  /var/lib/ocis/storage/users/spaces  -mindepth 2 -maxdepth 2); do
    idpath=${dir#/var/lib/ocis/storage/users/spaces}
    id=${idpath///}
    ocis decomposedfs check-treesize -r /var/lib/ocis/storage/users/ -n $id
done

@rhafer rhafer moved this from Prio 2 to In progress in Infinite Scale Team Board Jul 11, 2023
@rhafer rhafer self-assigned this Jul 11, 2023
@wkloucek
Copy link
Contributor

@rhafer I agree, we should focus on the single service runtime first.

IMO we could even look only on the storage users and storage system services for now.

There are also more reasons to NOT invest too much in the runtime: #6819

rhafer added a commit that referenced this issue Jul 18, 2023
Add STORAGE_USERS_GRACEFUL_SHUTDOWN_TIMEOUT setting to allow an graceful
shutdown of the storage-users service. This currently only applicable
when running storage-user as a sepearate service.

Setting STORAGE_USERS_GRACEFUL_SHUTDOWN_TIMEOUT to a non-zero value
gives the storage-users service a chance to cleanly shutdown and finish
any in progess tasks (e.g. metadata propagation) before exiting.

Partial-Fix: #6602
@rhafer
Copy link
Contributor Author

rhafer commented Jul 18, 2023

There are also more reasons to NOT invest too much in the runtime: #6819

The main problem here is the reva runtime on the ocis one.

@rhafer
Copy link
Contributor Author

rhafer commented Jul 18, 2023

#6840 together with cs3org/reva#4072 should allow a more graceful shutdown of the storage-users service, when running it standalone. It might make sense to introduce the *_GRACEFUL_SHUTDOWN_TIMEOUT setting to other reva-backed services as well (e.g. storage-system)

To do a proper shutdown when running everything in one process, we'll need to get rid of the exit calls inside reva and initiate a proper shutdown from ocis in a different way. (e.g. using the mechanism outlined in the above referenced blog post).

rhafer added a commit that referenced this issue Jul 18, 2023
Add STORAGE_USERS_GRACEFUL_SHUTDOWN_TIMEOUT setting to allow an graceful
shutdown of the storage-users service. This currently only applicable
when running storage-user as a sepearate service.

Setting STORAGE_USERS_GRACEFUL_SHUTDOWN_TIMEOUT to a non-zero value
gives the storage-users service a chance to cleanly shutdown and finish
any in progess tasks (e.g. metadata propagation) before exiting.

Partial-Fix: #6602
rhafer added a commit that referenced this issue Jul 18, 2023
Add STORAGE_USERS_GRACEFUL_SHUTDOWN_TIMEOUT setting to allow a graceful
shutdown of the storage-users service. This currently only applicable
when running storage-user as a sepearate service.

Setting STORAGE_USERS_GRACEFUL_SHUTDOWN_TIMEOUT to a non-zero value
gives the storage-users service a chance to cleanly shutdown and finish
any in progess tasks (e.g. metadata propagation) before exiting.

Partial-Fix: #6602
rhafer added a commit that referenced this issue Jul 18, 2023
Add STORAGE_USERS_GRACEFUL_SHUTDOWN_TIMEOUT setting to allow a graceful
shutdown of the storage-users service. This currently only applicable
when running storage-user as a sepearate service.

Setting STORAGE_USERS_GRACEFUL_SHUTDOWN_TIMEOUT to a non-zero value
gives the storage-users service a chance to cleanly shutdown and finish
any in progess tasks (e.g. metadata propagation) before exiting.

Partial-Fix: #6602
@github-project-automation github-project-automation bot moved this from In progress to Done in Infinite Scale Team Board Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interaction:Blue-Ticket Priority:p2-high Escalation, on top of current planning, release blocker Type:Epic Epic is the parent of user stories
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants