-
Notifications
You must be signed in to change notification settings - Fork 911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix container downtime #622
base: main
Are you sure you want to change the base?
Conversation
I'd like to apologize for the additional commits. I made some Golang beginner mistakes. Whoops xD. Any feedback and advice regarding the changes is appreciated :) Thank you! |
I will have a look as soon as I get some free time on my hands, but this definitely sounds like a really interesting change! 🥳 |
An initial thought: if I understand the change correctly, this would mean we could put every list in its own go-routine. That has the potential of considerably lowering the time needed for each check/update cycle even further. Do you share this thought, @darkaether? |
Thank you!
This is indeed an excellent idea! However, there are some additional concerns to address when execution becomes concurrent as is the case with goroutines:
|
For sure! My suggested additional improvement is outside the scope of this PR in any case, but it sure is interesting! 😅 |
Any updates on this one? :) |
I started to review this yesterday. Turns out the whole linking- and dependency mechanism is borked to some extent (not because of this PR), so it might be a bit longer than I expected. |
Another thing is the lack of tests. I know that watchtower has far from 100% coverage (😁), but this is something that I think really should have some. Both to describe how the intended update strategy should work, and to make sure that it actually does what it aims to (without causing too much destruction in prod). Also, I can't tell if this would cause multiple pulls of the same image if several "sets" use the same one. If so, maybe the images could be fetched before the set split? |
Indeed. Tests are definitely needed. I'll look into the current set of tests and see how I can incorporate the cases I can think of.
The modifications this PR makes only affect the procedure of stopping and restarting of containers. The check of a container being stale and the image being updated shouldn't be affected by this PR. func (client dockerClient) IsContainerStale(container Container) (bool, error) {
ctx := context.Background()
if !client.pullImages {
log.Debugf("Skipping image pull.")
} else if err := client.PullImage(ctx, container); err != nil {
return false, err
}
return client.HasNewImage(ctx, container)
} According to this snippet, the image is checked in the func Update(client container.Client, params types.UpdateParams) error {
log.Debug("Checking containers for updated images")
if params.LifecycleHooks {
lifecycle.ExecutePreChecks(client, params)
}
containers, err := client.ListContainers(params.Filter)
if err != nil {
return err
}
for i, container := range containers {
stale, err := client.IsContainerStale(container)
if err != nil {
log.Infof("Unable to update container %s. Proceeding to next.", containers[i].Name())
log.Debug(err)
stale = false
}
containers[i].Stale = stale
}
checkDependencies(containers)
var dependencySortedGraphs [][]container.Container
// splitting related code after this line |
The new commits modularize some of the steps in the Update function to allow for easier testing. There are also new test cases added to demonstrate the proposed sorting logic. Any feedback and suggestions are welcome! Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Will not merge this in the version we're cutting today, as I'd like it to remain in latest-dev
for a while to make sure it's stable before releasing it to latest
. So, don't merge until v1.0.3
has been released and any critical bugs in the new version have been resolved.
Thank you for taking the time to review this PR! I really do appreciate it. |
Just to be clear, I meant critical bugs in the new version, not in your code 👍 😉 |
Ah I see xD. However, this is still a pretty significant change. Hopefully there aren't any obscure logic bugs in my code :) |
Thank you for addressing the issue. I'll make sure to go back to using watchtower after it's released! |
Thank you so much for your contribution! 🎉 This PR has conflicts, however. Would you be willing to give it a go @darkaether, or would you prefer if I did it? |
Hey, I had a go at solving the conflicts, however, quite a bit has changed since the time I last worked on the code. I might need some time to go over things before I can push in the fully merged and tested code. |
Hi again @simskij ! I've managed to fix the conflicts and made some convenience changes as well to play nicely with the existing rolling restart mechanism. I'm yet to test this on a live environment. Will get back to you once I'm done with that. |
Any updates on this PR? |
Quick question. I am planning for some of my containers to have a hypothetically long wind down time watchtower/internal/actions/check.go Line 50 in cb62b16
watchtower/pkg/container/client.go Line 158 in 0f06539
According to the code above, watchtower waits for up to 10 minutes for the container to cleanly shut down. This is cool! But how will it work after this PR is merged? Will each container shutdown cycle by handled independently? What if we have 10 containers and each of them takes 10 minutes to shutdown? Will it mess up the entire refresh cycle for all containers across the server? |
Hey there @Rush! This is an interesting question. In my opinion, the only changes this PR makes is to the order in which the containers are stopped and restarted. Theoretically, this shouldn't affect the actual process of stopping the container or the timeouts. |
@darkaether Sorry for the delay, I wanted to get #674 merged before touching this. As soon as it's stable I'll make sure to merge this. If there are conflicts by then, I'll make sure to fix them. Thanks again for your patience 🙇 |
Looks like #674 is merged - can we have this one merged now? :) |
This PR is my bad conscience. Sorry for that. We definitely want to have this merged as soon as possible. |
So, I have been looking at rebasing this to the current main, and my thoughts are as follows:
Each have it's use and downside, but compare that to this PRs "tree based" update:
If I'm not missing anything here, I would argue that this would totally replace the current methods of updating. As such, my current rebase/refactor removes rolling restarts wholly. Any thoughts @simskij @darkaether ? |
Never mind, the only real benefit of using the tree is that you could start/stop multiple containers in parallel (since it can tell the difference between "siblings" and dependencies). But even if that is possible to do with go routines, the gain is likely very small (and the complexity increase quite high). The only problem that remains is that the sorting is done on the list that is filtered for "should update", which means that a stray "monitor-only" container would mess up the graph, either leading to a crash, or an update just being ignored without anything being reported to the user. Also, the current tests are a bit... verbose. I think adding a helper that takes a dot diagraph like digraph test1 {
A -> B -> C;
B -> D;
E -> C;
} and an expected stopping order ( |
I agree with you reg. the digraphs. @darkaether, is this something you'd feel comfortable doing? If not, let us know and we'll try to assist as much as possible. |
Okay, so I have made a really messy merge in: It currently passes all tests, but that's not saying that it works correctly. Next step would be to make the tests easier to read, have less dependencies on each other (there were tests that shared the same client and mutated it), and include handling monitor-only. |
Soon there will be a 1 year anniversary of this PR - I wonder if we maybe we can make it before that? 🐸 |
@Rush It doesn't work correctly the way it's done in the PR, due to:
and no one seems to want to work on it. Personally, I don't have these problems and don't feel confident enough in my simulations to do a good job at implementing it. If this was merged as is, it would solve some scenarios, but cause currently working ones to fail. That is what we want to avoid and why this PR has taken such a long time. |
Hi everyone! Unfortunately, I have quite a number of things on my plate at the moment so I might not be able to cut out enough time to contribute much to this PR. If there is someone willing to take this PR to completion, that would be great! Otherwise, I'll be sure to revisit it in the near future once I get a bit of time on my hands. Regarding the technical aspect,
I agree with this. It was a case that I did not consider during implementation. Do we have a set of sample scenarios that we could document here that could help any one taking this forward to understand the expected functionality (would probably help with writing the test cases too)? |
No, that's why I suggested the diagraphs that could be used to create test cases "without code". It's much easier to find and test for the different scenarios when they can be visualized imho.
Another question here is of course: what did the user intend? |
FYI. I have published @darkaether's work at https://hub.docker.com/r/rushpl/watchtower. Just start it like you would start watchtower normally: For example
Based on some preliminary tests it seems to work well for my use-case. |
@darkaether Actually, a small problem I see is that if I run 4 containers with the same image I get 4 separate notifications for "Found new rushpl/test:latest image". The notifications seem to trigger at slightly different time which may indicate that there were 4 separate checks done for the very same image. |
@piksel, maybe we could schedule a session to finally get this tested, fixed, and merged? i think it makes sense if we collab on it as it will require some thorough collective thinking to get right, given our previous conversations around this. |
Bump :) maybe we could get this done before year's end? |
Another bump 5 months later :) |
Closes #272
The approach used in this solution changes the single list for all containers approach to a list of lists where each outer list represents an independent set of containers that are linked.
For example:
Existing code (sorted order):
The solution in this PR (sorted order):
Approach
This splitting up of containers into independent sets allows the update to work on only one set at a time. This ensures that the other containers are kept running until it is their time to be updated. This minimises the downtime for any container that is not dependent on any container in the current list being processed.
Evaluation
The docker containers used to test the solution: