Fix container downtime #622

vrajashkr · 2020-08-17T12:12:58Z

Closes #272

The approach used in this solution changes the single list for all containers approach to a list of lists where each outer list represents an independent set of containers that are linked.

For example:

Existing code (sorted order):

/k1 /k2 /k3 /l1 /l2 /l3 /t2 /t1

The solution in this PR (sorted order):

/k1 /k2 /k3  - individual list that contains one set of linked containers that is not dependent on any other list
/l1 /l2 /l3 
/t2 
/t1

Approach

This splitting up of containers into independent sets allows the update to work on only one set at a time. This ensures that the other containers are kept running until it is their time to be updated. This minimises the downtime for any container that is not dependent on any container in the current list being processed.

Evaluation

The docker containers used to test the solution:

docker run -td --name t1 altariax0x01/mybuntu:latest bash
docker run -td --name t2 altariax0x01/mybuntu:latest bash

docker run -td --name l1 altariax0x01/mybuntu:latest bash
docker run -td --name l2 --link l1 altariax0x01/mybuntu:latest bash
docker run -td --name l3 --link l2 altariax0x01/mybuntu:latest bash

docker run -td --name k1 altariax0x01/mybuntu:latest bash
docker run -td --name k2 --link k1 altariax0x01/mybuntu:latest bash
docker run -td --name k3 --link k2 altariax0x01/mybuntu:latest bash

…rr#571)" This reverts commit 6da66fb.

vrajashkr · 2020-08-17T12:25:58Z

I'd like to apologize for the additional commits. I made some Golang beginner mistakes. Whoops xD.

Any feedback and advice regarding the changes is appreciated :)

Thank you!

simskij · 2020-08-21T19:15:47Z

I will have a look as soon as I get some free time on my hands, but this definitely sounds like a really interesting change! 🥳

simskij · 2020-08-21T19:18:04Z

An initial thought: if I understand the change correctly, this would mean we could put every list in its own go-routine. That has the potential of considerably lowering the time needed for each check/update cycle even further. Do you share this thought, @darkaether?

vrajashkr · 2020-08-22T06:24:41Z

I will have a look as soon as I get some free time on my hands, but this definitely sounds like a really interesting change!

Thank you!

An initial thought: if I understand the change correctly, this would mean we could put every list in its own go-routine. That has the potential of considerably lowering the time needed for each check/update cycle even further. Do you share this thought, @darkaether?

This is indeed an excellent idea! However, there are some additional concerns to address when execution becomes concurrent as is the case with goroutines:

The map that is used to store imageIDs for cleanup is not thread-safe, hence concurrent access to it could cause undefined behavior. Possible solutions include using a sync.Map in Go or to control access to the map using synchronisation primitives.
Whether the Docker supports concurrent operations with respect to container launch / stop. Still need to look into this further.

simskij · 2020-08-22T07:38:09Z

For sure! My suggested additional improvement is outside the scope of this PR in any case, but it sure is interesting! 😅

Rush · 2020-09-07T06:08:24Z

Any updates on this one? :)

simskij · 2020-09-08T10:01:23Z

Any updates on this one? :)

I started to review this yesterday. Turns out the whole linking- and dependency mechanism is borked to some extent (not because of this PR), so it might be a bit longer than I expected.

piksel · 2020-09-08T16:23:51Z

Another thing is the lack of tests. I know that watchtower has far from 100% coverage (😁), but this is something that I think really should have some. Both to describe how the intended update strategy should work, and to make sure that it actually does what it aims to (without causing too much destruction in prod).

Also, I can't tell if this would cause multiple pulls of the same image if several "sets" use the same one. If so, maybe the images could be fetched before the set split?

vrajashkr · 2020-09-08T17:44:16Z

Another thing is the lack of tests. I know that watchtower has far from 100% coverage (grin), but this is something that I think really should have some. Both to describe how the intended update strategy should work, and to make sure that it actually does what it aims to (without causing too much destruction in prod).

Indeed. Tests are definitely needed. I'll look into the current set of tests and see how I can incorporate the cases I can think of.

Also, I can't tell if this would cause multiple pulls of the same image if several "sets" use the same one. If so, maybe the images could be fetched before the set split?

The modifications this PR makes only affect the procedure of stopping and restarting of containers. The check of a container being stale and the image being updated shouldn't be affected by this PR.

func (client dockerClient) IsContainerStale(container Container) (bool, error) {
	ctx := context.Background()

	if !client.pullImages {
		log.Debugf("Skipping image pull.")
	} else if err := client.PullImage(ctx, container); err != nil {
		return false, err
	}

	return client.HasNewImage(ctx, container)
}

According to this snippet, the image is checked in the IsContainerStale function. In update.go, this is called before the splitting happens (same as the existing implementation).

func Update(client container.Client, params types.UpdateParams) error {
	log.Debug("Checking containers for updated images")

	if params.LifecycleHooks {
		lifecycle.ExecutePreChecks(client, params)
	}

	containers, err := client.ListContainers(params.Filter)
	if err != nil {
		return err
	}

	for i, container := range containers {
		stale, err := client.IsContainerStale(container)
		if err != nil {
			log.Infof("Unable to update container %s. Proceeding to next.", containers[i].Name())
			log.Debug(err)
			stale = false
		}
		containers[i].Stale = stale
	}

	checkDependencies(containers)

	var dependencySortedGraphs [][]container.Container
        // splitting related code after this line

vrajashkr · 2020-09-11T21:04:29Z

The new commits modularize some of the steps in the Update function to allow for easier testing.
I've added comments to the new functions to hopefully make them easier to understand.

There are also new test cases added to demonstrate the proposed sorting logic.

Any feedback and suggestions are welcome! Thanks!

simskij

LGTM. Will not merge this in the version we're cutting today, as I'd like it to remain in latest-dev for a while to make sure it's stable before releasing it to latest. So, don't merge until v1.0.3 has been released and any critical bugs in the new version have been resolved.

vrajashkr · 2020-10-03T20:13:50Z

LGTM. Will not merge this in the version we're cutting today, as I'd like it to remain in latest-dev for a while to make sure it's stable before releasing it to latest. So, don't merge until v1.0.3 has been released and any critical bugs have been resolved.

Thank you for taking the time to review this PR! I really do appreciate it.
Hopefully, there won't be any critical bugs, but if any do show up, I'm looking forward to crushing them :)

simskij · 2020-10-03T20:17:57Z

LGTM. Will not merge this in the version we're cutting today, as I'd like it to remain in latest-dev for a while to make sure it's stable before releasing it to latest. So, don't merge until v1.0.3 has been released and any critical bugs have been resolved.

Thank you for taking the time to review this PR! I really do appreciate it.
Hopefully, there won't be any critical bugs, but if any do show up, I'm looking forward to crushing them :)

Just to be clear, I meant critical bugs in the new version, not in your code 👍 😉

vrajashkr · 2020-10-03T20:19:48Z

LGTM. Will not merge this in the version we're cutting today, as I'd like it to remain in latest-dev for a while to make sure it's stable before releasing it to latest. So, don't merge until v1.0.3 has been released and any critical bugs have been resolved.

Thank you for taking the time to review this PR! I really do appreciate it.
Hopefully, there won't be any critical bugs, but if any do show up, I'm looking forward to crushing them :)

Just to be clear, I meant critical bugs in the new version, not in your code

Ah I see xD. However, this is still a pretty significant change. Hopefully there aren't any obscure logic bugs in my code :)

Rush · 2020-10-03T20:49:38Z

Thank you for addressing the issue. I'll make sure to go back to using watchtower after it's released!

simskij · 2020-11-20T16:51:49Z

Thank you so much for your contribution! 🎉

This PR has conflicts, however. Would you be willing to give it a go @darkaether, or would you prefer if I did it?

vrajashkr · 2020-11-22T09:16:06Z

Thank you so much for your contribution!

This PR has conflicts, however. Would you be willing to give it a go @darkaether, or would you prefer if I did it?

Hey, I had a go at solving the conflicts, however, quite a bit has changed since the time I last worked on the code. I might need some time to go over things before I can push in the fully merged and tested code.

vrajashkr · 2020-11-27T12:02:15Z

Hi again @simskij ! I've managed to fix the conflicts and made some convenience changes as well to play nicely with the existing rolling restart mechanism. I'm yet to test this on a live environment. Will get back to you once I'm done with that.

Rush · 2020-12-14T03:05:23Z

Any updates on this PR?

Rush · 2020-12-15T05:16:57Z

Quick question. I am planning for some of my containers to have a hypothetically long wind down time

watchtower/internal/actions/check.go

Line 50 in cb62b16

if err := client.StopContainer(c, 10*time.Minute); err != nil {

watchtower/pkg/container/client.go

Line 158 in 0f06539

_ = client.waitForStopOrTimeout(c, timeout)

According to the code above, watchtower waits for up to 10 minutes for the container to cleanly shut down. This is cool!

But how will it work after this PR is merged? Will each container shutdown cycle by handled independently? What if we have 10 containers and each of them takes 10 minutes to shutdown? Will it mess up the entire refresh cycle for all containers across the server?

vrajashkr · 2020-12-15T17:08:49Z

But how will it work after this PR is merged? Will each container shutdown cycle by handled independently? What if we have 10 containers and each of them takes 10 minutes to shutdown? Will it mess up the entire refresh cycle for all containers across the server?

Hey there @Rush! This is an interesting question. In my opinion, the only changes this PR makes is to the order in which the containers are stopped and restarted. Theoretically, this shouldn't affect the actual process of stopping the container or the timeouts.

simskij · 2020-12-21T10:47:16Z

@darkaether Sorry for the delay, I wanted to get #674 merged before touching this. As soon as it's stable I'll make sure to merge this. If there are conflicts by then, I'll make sure to fix them.

Thanks again for your patience 🙇

Rush · 2020-12-30T20:00:53Z

Looks like #674 is merged - can we have this one merged now? :)

simskij · 2021-04-09T08:19:42Z

This PR is my bad conscience. Sorry for that. We definitely want to have this merged as soon as possible.

piksel · 2021-04-25T11:25:25Z

So, I have been looking at rebasing this to the current main, and my thoughts are as follows:
These are the two current ways of updating containers:

Dependency Sorted Updates
Pros: Handles linked containers
Cons: Long downtime (gray boxes in graph)

Graph
Rolling Updates
Pros: Minimum downtime
Cons: Cannot be used with linked containers

Graph

Each have it's use and downside, but compare that to this PRs "tree based" update:

Dependency Tree Updates
Pros: Minimum downtime, Handles linked containers
Cons: ??

Graph

If I'm not missing anything here, I would argue that this would totally replace the current methods of updating. As such, my current rebase/refactor removes rolling restarts wholly. Any thoughts @simskij @darkaether ?

piksel · 2021-04-25T16:00:02Z

Okay, so, the update graph made in this PR was a lot simpler than I initially thought and currently have some problems. First of all, the links are made bidirectional, which means that even if a container only have links to let's say it's database container. That database container would also be restarted every time the "main" app container was restarted. Add into this that other containers may link to the database, and as such, it will still bring down a lot more than necessary.
This could still be added as a third update option, but I would rather modify this to return a dependency tree.

Never mind, the only real benefit of using the tree is that you could start/stop multiple containers in parallel (since it can tell the difference between "siblings" and dependencies). But even if that is possible to do with go routines, the gain is likely very small (and the complexity increase quite high).
And also, the "what needs updating" calculation is still performed by the checkDependencies method, the container link map is only used for building the graph itself.

The only problem that remains is that the sorting is done on the list that is filtered for "should update", which means that a stray "monitor-only" container would mess up the graph, either leading to a crash, or an update just being ignored without anything being reported to the user.

Also, the current tests are a bit... verbose. I think adding a helper that takes a dot diagraph like

digraph test1 {
    A -> B -> C;
    B -> D;
    E -> C;
}

and an expected stopping order (EABC for updating C) would make adding some more complex dependency testing easy to add (and read/visualize in graphviz ofc).

simskij · 2021-04-27T20:21:45Z

Okay, so, the update graph made in this PR was a lot simpler than I initially thought and currently have some problems. First of all, the links are made bidirectional, which means that even if a container only have links to let's say it's database container. That database container would also be restarted every time the "main" app container was restarted. Add into this that other containers may link to the database, and as such, it will still bring down a lot more than necessary. This could still be added as a third update option, but I would rather modify this to return a dependency tree.

Never mind, the only real benefit of using the tree is that you could start/stop multiple containers in parallel (since it can tell the difference between "siblings" and dependencies). But even if that is possible to do with go routines, the gain is likely very small (and the complexity increase quite high).
And also, the "what needs updating" calculation is still performed by the checkDependencies method, the container link map is only used for building the graph itself.

The only problem that remains is that the sorting is done on the list that is filtered for "should update", which means that a stray "monitor-only" container would mess up the graph, either leading to a crash, or an update just being ignored without anything being reported to the user.

Also, the current tests are a bit... verbose. I think adding a helper that takes a dot diagraph like
digraph test1 {
    A -> B -> C;
    B -> D;
    E -> C;
}
and an expected stopping order (EABC for updating C) would make adding some more complex dependency testing easy to add (and read/visualize in graphviz ofc).

I agree with you reg. the digraphs. @darkaether, is this something you'd feel comfortable doing? If not, let us know and we'll try to assist as much as possible.

piksel · 2021-04-28T12:51:48Z

Okay, so I have made a really messy merge in:
https://github.com/containrrr/watchtower/compare/feat/graph-updates
with an addition of ensureUpdateAllowedByLabels which is meant to check if a monitor-only label somewhere in the dependency chain would prevent the update from being performed. It would never happen right now since the sorting is done on the already filtered list. The aim is to make a test for exactly that and then alter the behaviour to handle it.

It currently passes all tests, but that's not saying that it works correctly. Next step would be to make the tests easier to read, have less dependencies on each other (there were tests that shared the same client and mutated it), and include handling monitor-only.

Rush · 2021-07-27T05:07:51Z

Soon there will be a 1 year anniversary of this PR - I wonder if we maybe we can make it before that? 🐸

piksel · 2021-07-29T09:31:45Z

@Rush It doesn't work correctly the way it's done in the PR, due to:

The only problem that remains is that the sorting is done on the list that is filtered for "should update", which means that a stray "monitor-only" container would mess up the graph, either leading to a crash, or an update just being ignored without anything being reported to the user.

and no one seems to want to work on it. Personally, I don't have these problems and don't feel confident enough in my simulations to do a good job at implementing it.

If this was merged as is, it would solve some scenarios, but cause currently working ones to fail. That is what we want to avoid and why this PR has taken such a long time.

vrajashkr · 2021-07-31T08:27:19Z

Hi everyone!
Firstly, I sincerely apologize for my lack of response. I completely missed all my GitHub notifications.
It's been quite a while since I worked on this and have become very rusty in Golang.

Unfortunately, I have quite a number of things on my plate at the moment so I might not be able to cut out enough time to contribute much to this PR.

If there is someone willing to take this PR to completion, that would be great! Otherwise, I'll be sure to revisit it in the near future once I get a bit of time on my hands.

Regarding the technical aspect,

The only problem that remains is that the sorting is done on the list that is filtered for "should update", which means that a stray "monitor-only" container would mess up the graph, either leading to a crash, or an update just being ignored without anything being reported to the user.

I agree with this. It was a case that I did not consider during implementation.

Do we have a set of sample scenarios that we could document here that could help any one taking this forward to understand the expected functionality (would probably help with writing the test cases too)?

piksel · 2021-08-08T19:21:57Z

Do we have a set of sample scenarios that we could document here that could help any one taking this forward to understand the expected functionality (would probably help with writing the test cases too)?

No, that's why I suggested the diagraphs that could be used to create test cases "without code". It's much easier to find and test for the different scenarios when they can be visualized imho.

a stray "monitor-only" container would mess up the graph

Another question here is of course: what did the user intend?
The most obvious thing to do for these containers (part of dependency chain, but "monitor only"), is to stop & restart them, without any removal/recreation (if possible).
It would still potentially interrupt a transaction etc. and be just the thing the user wanted to prevent... 🤷‍♀️

Rush · 2021-08-16T05:15:56Z

FYI. I have published @darkaether's work at https://hub.docker.com/r/rushpl/watchtower.

Just start it like you would start watchtower normally:

For example

docker run --restart=always -d --name watchtower -v /var/run/docker.sock:/var/run/docker.sock -v /root/.docker/config.json:/config.json rushpl/watchtower --interval 60 --label-enable

Based on some preliminary tests it seems to work well for my use-case.

Rush · 2021-08-16T05:44:32Z

@darkaether Actually, a small problem I see is that if I run 4 containers with the same image I get 4 separate notifications for "Found new rushpl/test:latest image". The notifications seem to trigger at slightly different time which may indicate that there were 4 separate checks done for the very same image.

simskij · 2021-09-29T13:27:46Z

@piksel, maybe we could schedule a session to finally get this tested, fixed, and merged? i think it makes sense if we collab on it as it will require some thorough collective thinking to get right, given our previous conversations around this.

Rush · 2021-12-21T23:52:21Z

Bump :) maybe we could get this done before year's end?

Rush · 2022-05-26T22:35:36Z

Another bump 5 months later :)

piksel and others added 3 commits August 7, 2020 18:08

Revert "Image of running container no longer needed locally (containr…

ad6be06

…rr#571)" This reverts commit 6da66fb.

Update client.go

3d18c7e

Fix high application downtime during container update

b451c94

vrajashkr requested a review from simskij as a code owner August 17, 2020 12:12

vrajashkr added 2 commits August 17, 2020 17:47

Remove underscores in variable names

a2b05a8

Remove unused function introduced in update.go

279c48d

simskij requested a review from piksel August 21, 2020 19:15

vrajashkr added 3 commits September 12, 2020 02:26

Modularise Update function

348f121

Introduce test cases for proposed dependency sorting

d65e6cd

Correct variable name mistake

c73ce53

simskij approved these changes Oct 3, 2020

View reviewed changes

vrajashkr added 2 commits November 22, 2020 14:14

Merge master and fix conflicts

18c5bc3

Fix incorrect areas of merge

4fb0c8e

Fix container order in update method

78a4d63

simskij closed this Mar 9, 2021

simskij deleted the branch containrrr:main March 9, 2021 13:28

simskij reopened this Mar 9, 2021

simskij changed the base branch from master to main March 9, 2021 18:58

simskij self-assigned this Apr 9, 2021

simskij added this to the v1.3.0 milestone Apr 9, 2021

Fix container downtime #622

Are you sure you want to change the base?

Fix container downtime #622

Conversation

vrajashkr commented Aug 17, 2020

Approach

Evaluation

vrajashkr commented Aug 17, 2020

simskij commented Aug 21, 2020 • edited Loading

simskij commented Aug 21, 2020 • edited Loading

vrajashkr commented Aug 22, 2020

simskij commented Aug 22, 2020

Rush commented Sep 7, 2020

simskij commented Sep 8, 2020

piksel commented Sep 8, 2020

vrajashkr commented Sep 8, 2020

vrajashkr commented Sep 11, 2020

simskij left a comment • edited Loading

Choose a reason for hiding this comment

vrajashkr commented Oct 3, 2020

simskij commented Oct 3, 2020

vrajashkr commented Oct 3, 2020

Rush commented Oct 3, 2020

simskij commented Nov 20, 2020 • edited Loading

vrajashkr commented Nov 22, 2020 • edited Loading

vrajashkr commented Nov 27, 2020

Rush commented Dec 14, 2020

Rush commented Dec 15, 2020

vrajashkr commented Dec 15, 2020

simskij commented Dec 21, 2020

Rush commented Dec 30, 2020

simskij commented Apr 9, 2021

piksel commented Apr 25, 2021

piksel commented Apr 25, 2021 • edited Loading

simskij commented Apr 27, 2021

piksel commented Apr 28, 2021

Rush commented Jul 27, 2021 • edited Loading

piksel commented Jul 29, 2021

vrajashkr commented Jul 31, 2021

piksel commented Aug 8, 2021

Rush commented Aug 16, 2021

Rush commented Aug 16, 2021

simskij commented Sep 29, 2021

Rush commented Dec 21, 2021

Rush commented May 26, 2022

simskij commented Aug 21, 2020 •

edited

Loading

simskij commented Aug 21, 2020 •

edited

Loading

simskij left a comment •

edited

Loading

simskij commented Nov 20, 2020 •

edited

Loading

vrajashkr commented Nov 22, 2020 •

edited

Loading

piksel commented Apr 25, 2021 •

edited

Loading

Rush commented Jul 27, 2021 •

edited

Loading