Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

automation toggled between two images #1127

Closed
rade opened this issue Jun 7, 2018 · 6 comments
Closed

automation toggled between two images #1127

rade opened this issue Jun 7, 2018 · 6 comments

Comments

@rade
Copy link
Contributor

rade commented Jun 7, 2018

For some unknown reason flux decided to deploy the most-recent-but-one image of itself, and then immediately re-deployed the most-recent image:
screenshot from 2018-06-07 15-00-58

There is no apparent trigger - the two images were built yesterday, 6 hours apart.

@rade rade added the bug label Jun 7, 2018
@squaremo squaremo added the vague label Jun 7, 2018
@squaremo
Copy link
Member

squaremo commented Jun 7, 2018

Please posts logs or other clues, this isn't enough to go on.

@rade
Copy link
Contributor Author

rade commented Jun 7, 2018

Scanning the logs around the time for actions relating to the two image ids...

ts=2018-06-07T13:22:56.658856019Z caller=images.go:67 component=sync-loop service=weave:deployment/weave-flux-agent container=flux currentimage=quay.io/weaveworks/flux:master-180d2e3 msg=\"added image to changes\" newimage=quay.io/weaveworks/flux:master-4fd139b\n
ts=2018-06-07T13:23:07.537069912Z caller=daemon.go:549 component=daemon event=\"Automated release of quay.io/weaveworks/flux:master-4fd139b\" logupstream=true\n
ts=2018-06-07T13:24:13.021477431Z caller=images.go:67 component=sync-loop service=weave:deployment/weave-flux-agent container=flux currentimage=quay.io/weaveworks/flux:master-4fd139b msg=\"added image to changes\" newimage=quay.io/weaveworks/flux:master-180d2e3\n
ts=2018-06-07T13:24:23.8406753Z   caller=daemon.go:548 component=daemon event=\"Automated release of quay.io/weaveworks/flux:master-180d2e3\" logupstream=true\n

@rade
Copy link
Contributor Author

rade commented Jul 23, 2018

We had a slew of these over the weekend, all for the same image, this time not flux.

Interesting observations:

  • the flip is always from the most-recent to the 2nd-most-recent (taking into account the tag filter), and then back
  • the flip back happens within less than 2 minutes
  • the flips were approx an hour apart. Never less, though sometimes a few hours were skipped. E.g. we had flips at 1:32, 2:35, 4:35, 5:35, 6:35, 11:35

The one hour is a smoking gun: @squaremo says that the memcached expiry is set to that.

rade added a commit that referenced this issue Jul 24, 2018
The ByCreatedDesc sort was broken in two ways:

1. If there were two images with a zero CreatedAt time, the sort order
would depend on the order in which they appeared in the
orginal (unsorted) list, thus making the order unstable.

2. Images with a zero CreatedAt time were considere more recent than
any images with a non-zero CreatedAt time, which is the exact opposite
of what we want.

These bugs _may_ be the cause of the 'flipping' between different
images we are seeing in #1127.
rade added a commit that referenced this issue Jul 24, 2018
The ByCreatedDesc sort was broken in two ways:

1. If there were two images with a zero CreatedAt time, the sort order
would depend on the order in which they appeared in the
orginal (unsorted) list, thus making the order unstable.

2. Images with a zero CreatedAt time were considered more recent than
any images with a non-zero CreatedAt time, which is the exact opposite
of what we want.

These bugs _may_ be the cause of the 'flipping' between different
images we are seeing in #1127.
@rade
Copy link
Contributor Author

rade commented Jul 24, 2018

We've had some more flipping. Slack said

9:16:44pm
CONTROLLER                STATUS   UPDATES
default:deployment/users  success  users: quay.io/weaveworks/users:master-20cda2fd -> master-7a30379f

9:17:56pm
CONTROLLER                STATUS   UPDATES
default:deployment/users  success  users: quay.io/weaveworks/users:master-7a30379f -> master-20cda2fd

Thanks to the logging introduced in #1249 have some more info. Here's everything logged for weaveworks/users:master. Note that log timestamps are UTC, which is an hour behind the slack timestamps.

ts=2018-07-24T18:17:54.275692505Z caller=images.go:74 component=sync-loop service=default:deployment/users container=users repo=quay.io/weaveworks/users pattern=* current=quay.io/weaveworks/users:master-7a30379f info="added update to automation run" new=quay.io/weaveworks/users:master-20cda2fd reason="latest master-20cda2fd (2018-07-24 18:06:14.75586547 +0000 UTC) > current master-7a30379f (2018-07-17 16:54:10.738705008 +0000 UTC)"
ts=2018-07-24T18:17:55.45088081Z caller=images.go:74 component=sync-loop service=default:deployment/users container=users repo=quay.io/weaveworks/users pattern=* current=quay.io/weaveworks/users:master-7a30379f info="added update to automation run" new=quay.io/weaveworks/users:master-20cda2fd reason="latest master-20cda2fd (2018-07-24 18:06:14.75586547 +0000 UTC) > current master-7a30379f (2018-07-17 16:54:10.738705008 +0000 UTC)"
ts=2018-07-24T18:17:56.65079391Z caller=images.go:74 component=sync-loop service=default:deployment/users container=users repo=quay.io/weaveworks/users pattern=* current=quay.io/weaveworks/users:master-7a30379f info="added update to automation run" new=quay.io/weaveworks/users:master-20cda2fd reason="latest master-20cda2fd (2018-07-24 18:06:14.75586547 +0000 UTC) > current master-7a30379f (2018-07-17 16:54:10.738705008 +0000 UTC)"
ts=2018-07-24T18:18:13.014144429Z caller=images.go:74 component=sync-loop service=default:deployment/users container=users repo=quay.io/weaveworks/users pattern=* current=quay.io/weaveworks/users:master-7a30379f info="added update to automation run" new=quay.io/weaveworks/users:master-20cda2fd reason="latest master-20cda2fd (2018-07-24 18:06:14.75586547 +0000 UTC) > current master-7a30379f (2018-07-17 16:54:10.738705008 +0000 UTC)"
ts=2018-07-24T18:18:17.040534784Z caller=daemon.go:577 component=daemon event="Automated release of <elided images>, quay.io/weaveworks/users:master-20cda2fd" logupstream=true
ts=2018-07-24T20:16:35.936462361Z caller=images.go:63 component=sync-loop service=default:deployment/users container=users repo=quay.io/weaveworks/users pattern=* current=quay.io/weaveworks/users:master-20cda2fd warning="image with zero created timestamp" image=quay.io/weaveworks/users action="skip container"
ts=2018-07-24T20:16:44.352342347Z caller=daemon.go:577 component=daemon event="Automated release of quay.io/weaveworks/users:master-7a30379f" logupstream=true
ts=2018-07-24T20:17:48.233113192Z caller=images.go:74 component=sync-loop service=default:deployment/users container=users repo=quay.io/weaveworks/users pattern=* current=quay.io/weaveworks/users:master-7a30379f info="added update to automation run" new=quay.io/weaveworks/users:master-20cda2fd reason="latest master-20cda2fd (2018-07-24 18:06:14.75586547 +0000 UTC) > current master-7a30379f (2018-07-17 16:54:10.738705008 +0000 UTC)"
ts=2018-07-24T20:17:56.393991052Z caller=daemon.go:577 component=daemon event="Automated release of quay.io/weaveworks/users:master-20cda2fd" logupstream=true

The release at 18:18:17 was for a genuinely new image. We then saw a zero created timestamp for that image at 20:16:35. It's a mystery what triggered the release of the previous image at 20:16:44. The subsequent re-release of the correct image at 20:17:56 puts the house back in order.

So it looks like our hunch about zero timestamps was correct, but it's unclear a) why we get them, and b) why the defenses against them in #1249 did not work. For b) we do see the newly introduced "skip container" action at 20:16:35, but that didn't prevent the release of the wrong image a few seconds later.

@rade
Copy link
Contributor Author

rade commented Jul 24, 2018

we do see the newly introduced "skip container" action at 20:16:35, but that didn't prevent the release of the wrong image a few seconds later.

That riddle is solved. See #1250.

So the only remaining mystery here then is how we ended up with a zero CreatedAt timestamp.

rade added a commit that referenced this issue Jul 24, 2018
An error while fetching an image manifest would return a nil
error (hence indicating success) with a unit value image.Info{}
struct.

That is bad news for the caller in Warmer.warm(), which will map an
image tag to that empty image.Info{}, polluting the cache entry for
the image+tag and image in memcached.

When we subsequently use this info to determine the latest suitable
tag, we encounter zero CreatedAt timestamps, which, prior to the
changes in #1247, #1249 and #1250 would cause the wrong images to be
released.

Fixes #1127.
@rade
Copy link
Contributor Author

rade commented Jul 24, 2018

So the only remaining mystery here then is how we ended up with a zero CreatedAt timestamp.

Solved in #1251.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants