Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix container downtime #622

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

vrajashkr
Copy link

Closes #272

The approach used in this solution changes the single list for all containers approach to a list of lists where each outer list represents an independent set of containers that are linked.

For example:

Existing code (sorted order):

/k1 /k2 /k3 /l1 /l2 /l3 /t2 /t1

The solution in this PR (sorted order):

/k1 /k2 /k3  - individual list that contains one set of linked containers that is not dependent on any other list
/l1 /l2 /l3 
/t2 
/t1

Approach

This splitting up of containers into independent sets allows the update to work on only one set at a time. This ensures that the other containers are kept running until it is their time to be updated. This minimises the downtime for any container that is not dependent on any container in the current list being processed.

Evaluation

The docker containers used to test the solution:

docker run -td --name t1 altariax0x01/mybuntu:latest bash
docker run -td --name t2 altariax0x01/mybuntu:latest bash

docker run -td --name l1 altariax0x01/mybuntu:latest bash
docker run -td --name l2 --link l1 altariax0x01/mybuntu:latest bash
docker run -td --name l3 --link l2 altariax0x01/mybuntu:latest bash

docker run -td --name k1 altariax0x01/mybuntu:latest bash
docker run -td --name k2 --link k1 altariax0x01/mybuntu:latest bash
docker run -td --name k3 --link k2 altariax0x01/mybuntu:latest bash

@vrajashkr vrajashkr requested a review from simskij as a code owner August 17, 2020 12:12
@vrajashkr
Copy link
Author

I'd like to apologize for the additional commits. I made some Golang beginner mistakes. Whoops xD.

Any feedback and advice regarding the changes is appreciated :)

Thank you!

@simskij simskij requested a review from piksel August 21, 2020 19:15
@simskij
Copy link
Member

simskij commented Aug 21, 2020

I will have a look as soon as I get some free time on my hands, but this definitely sounds like a really interesting change! 🥳

@simskij
Copy link
Member

simskij commented Aug 21, 2020

An initial thought: if I understand the change correctly, this would mean we could put every list in its own go-routine. That has the potential of considerably lowering the time needed for each check/update cycle even further. Do you share this thought, @darkaether?

@vrajashkr
Copy link
Author

I will have a look as soon as I get some free time on my hands, but this definitely sounds like a really interesting change!

Thank you!

An initial thought: if I understand the change correctly, this would mean we could put every list in its own go-routine. That has the potential of considerably lowering the time needed for each check/update cycle even further. Do you share this thought, @darkaether?

This is indeed an excellent idea! However, there are some additional concerns to address when execution becomes concurrent as is the case with goroutines:

  1. The map that is used to store imageIDs for cleanup is not thread-safe, hence concurrent access to it could cause undefined behavior. Possible solutions include using a sync.Map in Go or to control access to the map using synchronisation primitives.
  2. Whether the Docker supports concurrent operations with respect to container launch / stop. Still need to look into this further.

@simskij
Copy link
Member

simskij commented Aug 22, 2020

For sure! My suggested additional improvement is outside the scope of this PR in any case, but it sure is interesting! 😅

@Rush
Copy link

Rush commented Sep 7, 2020

Any updates on this one? :)

@simskij
Copy link
Member

simskij commented Sep 8, 2020

Any updates on this one? :)

I started to review this yesterday. Turns out the whole linking- and dependency mechanism is borked to some extent (not because of this PR), so it might be a bit longer than I expected.

@piksel
Copy link
Member

piksel commented Sep 8, 2020

Another thing is the lack of tests. I know that watchtower has far from 100% coverage (😁), but this is something that I think really should have some. Both to describe how the intended update strategy should work, and to make sure that it actually does what it aims to (without causing too much destruction in prod).

Also, I can't tell if this would cause multiple pulls of the same image if several "sets" use the same one. If so, maybe the images could be fetched before the set split?

@vrajashkr
Copy link
Author

Another thing is the lack of tests. I know that watchtower has far from 100% coverage (grin), but this is something that I think really should have some. Both to describe how the intended update strategy should work, and to make sure that it actually does what it aims to (without causing too much destruction in prod).

Indeed. Tests are definitely needed. I'll look into the current set of tests and see how I can incorporate the cases I can think of.

Also, I can't tell if this would cause multiple pulls of the same image if several "sets" use the same one. If so, maybe the images could be fetched before the set split?

The modifications this PR makes only affect the procedure of stopping and restarting of containers. The check of a container being stale and the image being updated shouldn't be affected by this PR.

func (client dockerClient) IsContainerStale(container Container) (bool, error) {
	ctx := context.Background()

	if !client.pullImages {
		log.Debugf("Skipping image pull.")
	} else if err := client.PullImage(ctx, container); err != nil {
		return false, err
	}

	return client.HasNewImage(ctx, container)
}

According to this snippet, the image is checked in the IsContainerStale function. In update.go, this is called before the splitting happens (same as the existing implementation).

func Update(client container.Client, params types.UpdateParams) error {
	log.Debug("Checking containers for updated images")

	if params.LifecycleHooks {
		lifecycle.ExecutePreChecks(client, params)
	}

	containers, err := client.ListContainers(params.Filter)
	if err != nil {
		return err
	}

	for i, container := range containers {
		stale, err := client.IsContainerStale(container)
		if err != nil {
			log.Infof("Unable to update container %s. Proceeding to next.", containers[i].Name())
			log.Debug(err)
			stale = false
		}
		containers[i].Stale = stale
	}

	checkDependencies(containers)

	var dependencySortedGraphs [][]container.Container
        // splitting related code after this line

@vrajashkr
Copy link
Author

The new commits modularize some of the steps in the Update function to allow for easier testing.
I've added comments to the new functions to hopefully make them easier to understand.

There are also new test cases added to demonstrate the proposed sorting logic.

Any feedback and suggestions are welcome! Thanks!

Copy link
Member

@simskij simskij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Will not merge this in the version we're cutting today, as I'd like it to remain in latest-dev for a while to make sure it's stable before releasing it to latest. So, don't merge until v1.0.3 has been released and any critical bugs in the new version have been resolved.

@vrajashkr
Copy link
Author

LGTM. Will not merge this in the version we're cutting today, as I'd like it to remain in latest-dev for a while to make sure it's stable before releasing it to latest. So, don't merge until v1.0.3 has been released and any critical bugs have been resolved.

Thank you for taking the time to review this PR! I really do appreciate it.
Hopefully, there won't be any critical bugs, but if any do show up, I'm looking forward to crushing them :)

@simskij
Copy link
Member

simskij commented Oct 3, 2020

LGTM. Will not merge this in the version we're cutting today, as I'd like it to remain in latest-dev for a while to make sure it's stable before releasing it to latest. So, don't merge until v1.0.3 has been released and any critical bugs have been resolved.

Thank you for taking the time to review this PR! I really do appreciate it.
Hopefully, there won't be any critical bugs, but if any do show up, I'm looking forward to crushing them :)

Just to be clear, I meant critical bugs in the new version, not in your code 👍 😉

@vrajashkr
Copy link
Author

LGTM. Will not merge this in the version we're cutting today, as I'd like it to remain in latest-dev for a while to make sure it's stable before releasing it to latest. So, don't merge until v1.0.3 has been released and any critical bugs have been resolved.

Thank you for taking the time to review this PR! I really do appreciate it.
Hopefully, there won't be any critical bugs, but if any do show up, I'm looking forward to crushing them :)

Just to be clear, I meant critical bugs in the new version, not in your code

Ah I see xD. However, this is still a pretty significant change. Hopefully there aren't any obscure logic bugs in my code :)

@Rush
Copy link

Rush commented Oct 3, 2020

Thank you for addressing the issue. I'll make sure to go back to using watchtower after it's released!

@simskij
Copy link
Member

simskij commented Nov 20, 2020

Thank you so much for your contribution! 🎉

This PR has conflicts, however. Would you be willing to give it a go @darkaether, or would you prefer if I did it?

@vrajashkr
Copy link
Author

vrajashkr commented Nov 22, 2020

Thank you so much for your contribution!

This PR has conflicts, however. Would you be willing to give it a go @darkaether, or would you prefer if I did it?

Hey, I had a go at solving the conflicts, however, quite a bit has changed since the time I last worked on the code. I might need some time to go over things before I can push in the fully merged and tested code.

@vrajashkr
Copy link
Author

Hi again @simskij ! I've managed to fix the conflicts and made some convenience changes as well to play nicely with the existing rolling restart mechanism. I'm yet to test this on a live environment. Will get back to you once I'm done with that.

@Rush
Copy link

Rush commented Dec 14, 2020

Any updates on this PR?

@Rush
Copy link

Rush commented Dec 15, 2020

Quick question. I am planning for some of my containers to have a hypothetically long wind down time

if err := client.StopContainer(c, 10*time.Minute); err != nil {

_ = client.waitForStopOrTimeout(c, timeout)

According to the code above, watchtower waits for up to 10 minutes for the container to cleanly shut down. This is cool!

But how will it work after this PR is merged? Will each container shutdown cycle by handled independently? What if we have 10 containers and each of them takes 10 minutes to shutdown? Will it mess up the entire refresh cycle for all containers across the server?

@vrajashkr
Copy link
Author

But how will it work after this PR is merged? Will each container shutdown cycle by handled independently? What if we have 10 containers and each of them takes 10 minutes to shutdown? Will it mess up the entire refresh cycle for all containers across the server?

Hey there @Rush! This is an interesting question. In my opinion, the only changes this PR makes is to the order in which the containers are stopped and restarted. Theoretically, this shouldn't affect the actual process of stopping the container or the timeouts.

@simskij
Copy link
Member

simskij commented Dec 21, 2020

@darkaether Sorry for the delay, I wanted to get #674 merged before touching this. As soon as it's stable I'll make sure to merge this. If there are conflicts by then, I'll make sure to fix them.

Thanks again for your patience 🙇

@Rush
Copy link

Rush commented Dec 30, 2020

Looks like #674 is merged - can we have this one merged now? :)

@simskij simskij closed this Mar 9, 2021
@simskij simskij deleted the branch containrrr:main March 9, 2021 13:28
@simskij simskij reopened this Mar 9, 2021
@simskij simskij changed the base branch from master to main March 9, 2021 18:58
@simskij
Copy link
Member

simskij commented Apr 9, 2021

This PR is my bad conscience. Sorry for that. We definitely want to have this merged as soon as possible.

@simskij simskij self-assigned this Apr 9, 2021
@simskij simskij added this to the v1.3.0 milestone Apr 9, 2021
@piksel
Copy link
Member

piksel commented Apr 25, 2021

So, I have been looking at rebasing this to the current main, and my thoughts are as follows:
These are the two current ways of updating containers:

  • Dependency Sorted Updates
    Pros: Handles linked containers
    Cons: Long downtime (gray boxes in graph)
    Graph

    mermaid-diagram-20210425130041

  • Rolling Updates
    Pros: Minimum downtime
    Cons: Cannot be used with linked containers
    Graph

    mermaid-diagram-20210425130047

Each have it's use and downside, but compare that to this PRs "tree based" update:

  • Dependency Tree Updates
    Pros: Minimum downtime, Handles linked containers
    Cons: ??
    Graph

    mermaid-diagram-20210425130051

If I'm not missing anything here, I would argue that this would totally replace the current methods of updating. As such, my current rebase/refactor removes rolling restarts wholly. Any thoughts @simskij @darkaether ?

@piksel
Copy link
Member

piksel commented Apr 25, 2021

Okay, so, the update graph made in this PR was a lot simpler than I initially thought and currently have some problems. First of all, the links are made bidirectional, which means that even if a container only have links to let's say it's database container. That database container would also be restarted every time the "main" app container was restarted. Add into this that other containers may link to the database, and as such, it will still bring down a lot more than necessary.
This could still be added as a third update option, but I would rather modify this to return a dependency tree.

Never mind, the only real benefit of using the tree is that you could start/stop multiple containers in parallel (since it can tell the difference between "siblings" and dependencies). But even if that is possible to do with go routines, the gain is likely very small (and the complexity increase quite high).
And also, the "what needs updating" calculation is still performed by the checkDependencies method, the container link map is only used for building the graph itself.

The only problem that remains is that the sorting is done on the list that is filtered for "should update", which means that a stray "monitor-only" container would mess up the graph, either leading to a crash, or an update just being ignored without anything being reported to the user.

Also, the current tests are a bit... verbose. I think adding a helper that takes a dot diagraph like

digraph test1 {
    A -> B -> C;
    B -> D;
    E -> C;
}

and an expected stopping order (EABC for updating C) would make adding some more complex dependency testing easy to add (and read/visualize in graphviz ofc).

@simskij
Copy link
Member

simskij commented Apr 27, 2021

Okay, so, the update graph made in this PR was a lot simpler than I initially thought and currently have some problems. First of all, the links are made bidirectional, which means that even if a container only have links to let's say it's database container. That database container would also be restarted every time the "main" app container was restarted. Add into this that other containers may link to the database, and as such, it will still bring down a lot more than necessary. This could still be added as a third update option, but I would rather modify this to return a dependency tree.

Never mind, the only real benefit of using the tree is that you could start/stop multiple containers in parallel (since it can tell the difference between "siblings" and dependencies). But even if that is possible to do with go routines, the gain is likely very small (and the complexity increase quite high).
And also, the "what needs updating" calculation is still performed by the checkDependencies method, the container link map is only used for building the graph itself.

The only problem that remains is that the sorting is done on the list that is filtered for "should update", which means that a stray "monitor-only" container would mess up the graph, either leading to a crash, or an update just being ignored without anything being reported to the user.

Also, the current tests are a bit... verbose. I think adding a helper that takes a dot diagraph like

digraph test1 {
    A -> B -> C;
    B -> D;
    E -> C;
}

and an expected stopping order (EABC for updating C) would make adding some more complex dependency testing easy to add (and read/visualize in graphviz ofc).

I agree with you reg. the digraphs. @darkaether, is this something you'd feel comfortable doing? If not, let us know and we'll try to assist as much as possible.

@piksel
Copy link
Member

piksel commented Apr 28, 2021

Okay, so I have made a really messy merge in:
https://github.com/containrrr/watchtower/compare/feat/graph-updates
with an addition of ensureUpdateAllowedByLabels which is meant to check if a monitor-only label somewhere in the dependency chain would prevent the update from being performed. It would never happen right now since the sorting is done on the already filtered list. The aim is to make a test for exactly that and then alter the behaviour to handle it.

It currently passes all tests, but that's not saying that it works correctly. Next step would be to make the tests easier to read, have less dependencies on each other (there were tests that shared the same client and mutated it), and include handling monitor-only.

@Rush
Copy link

Rush commented Jul 27, 2021

Soon there will be a 1 year anniversary of this PR - I wonder if we maybe we can make it before that? 🐸

@piksel
Copy link
Member

piksel commented Jul 29, 2021

@Rush It doesn't work correctly the way it's done in the PR, due to:

The only problem that remains is that the sorting is done on the list that is filtered for "should update", which means that a stray "monitor-only" container would mess up the graph, either leading to a crash, or an update just being ignored without anything being reported to the user.

and no one seems to want to work on it. Personally, I don't have these problems and don't feel confident enough in my simulations to do a good job at implementing it.

If this was merged as is, it would solve some scenarios, but cause currently working ones to fail. That is what we want to avoid and why this PR has taken such a long time.

@vrajashkr
Copy link
Author

Hi everyone!
Firstly, I sincerely apologize for my lack of response. I completely missed all my GitHub notifications.
It's been quite a while since I worked on this and have become very rusty in Golang.

Unfortunately, I have quite a number of things on my plate at the moment so I might not be able to cut out enough time to contribute much to this PR.

If there is someone willing to take this PR to completion, that would be great! Otherwise, I'll be sure to revisit it in the near future once I get a bit of time on my hands.

Regarding the technical aspect,

The only problem that remains is that the sorting is done on the list that is filtered for "should update", which means that a stray "monitor-only" container would mess up the graph, either leading to a crash, or an update just being ignored without anything being reported to the user.

I agree with this. It was a case that I did not consider during implementation.

Do we have a set of sample scenarios that we could document here that could help any one taking this forward to understand the expected functionality (would probably help with writing the test cases too)?

@piksel
Copy link
Member

piksel commented Aug 8, 2021

Do we have a set of sample scenarios that we could document here that could help any one taking this forward to understand the expected functionality (would probably help with writing the test cases too)?

No, that's why I suggested the diagraphs that could be used to create test cases "without code". It's much easier to find and test for the different scenarios when they can be visualized imho.

a stray "monitor-only" container would mess up the graph

Another question here is of course: what did the user intend?
The most obvious thing to do for these containers (part of dependency chain, but "monitor only"), is to stop & restart them, without any removal/recreation (if possible).
It would still potentially interrupt a transaction etc. and be just the thing the user wanted to prevent... 🤷‍♀️

@Rush
Copy link

Rush commented Aug 16, 2021

FYI. I have published @darkaether's work at https://hub.docker.com/r/rushpl/watchtower.

Just start it like you would start watchtower normally:

For example

docker run --restart=always -d --name watchtower -v /var/run/docker.sock:/var/run/docker.sock -v /root/.docker/config.json:/config.json rushpl/watchtower --interval 60 --label-enable

Based on some preliminary tests it seems to work well for my use-case.

@Rush
Copy link

Rush commented Aug 16, 2021

@darkaether Actually, a small problem I see is that if I run 4 containers with the same image I get 4 separate notifications for "Found new rushpl/test:latest image". The notifications seem to trigger at slightly different time which may indicate that there were 4 separate checks done for the very same image.

@simskij
Copy link
Member

simskij commented Sep 29, 2021

@piksel, maybe we could schedule a session to finally get this tested, fixed, and merged? i think it makes sense if we collab on it as it will require some thorough collective thinking to get right, given our previous conversations around this.

@Rush
Copy link

Rush commented Dec 21, 2021

Bump :) maybe we could get this done before year's end?

@Rush
Copy link

Rush commented May 26, 2022

Another bump 5 months later :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Long downtime during restart of multiple containers that are based on the same image
4 participants