-
Notifications
You must be signed in to change notification settings - Fork 1.1k
automation toggled between two images #1127
Comments
Please posts logs or other clues, this isn't enough to go on. |
Scanning the logs around the time for actions relating to the two image ids...
|
We had a slew of these over the weekend, all for the same image, this time not flux. Interesting observations:
The one hour is a smoking gun: @squaremo says that the memcached expiry is set to that. |
The ByCreatedDesc sort was broken in two ways: 1. If there were two images with a zero CreatedAt time, the sort order would depend on the order in which they appeared in the orginal (unsorted) list, thus making the order unstable. 2. Images with a zero CreatedAt time were considere more recent than any images with a non-zero CreatedAt time, which is the exact opposite of what we want. These bugs _may_ be the cause of the 'flipping' between different images we are seeing in #1127.
The ByCreatedDesc sort was broken in two ways: 1. If there were two images with a zero CreatedAt time, the sort order would depend on the order in which they appeared in the orginal (unsorted) list, thus making the order unstable. 2. Images with a zero CreatedAt time were considered more recent than any images with a non-zero CreatedAt time, which is the exact opposite of what we want. These bugs _may_ be the cause of the 'flipping' between different images we are seeing in #1127.
We've had some more flipping. Slack said
Thanks to the logging introduced in #1249 have some more info. Here's everything logged for
The release at 18:18:17 was for a genuinely new image. We then saw a zero created timestamp for that image at 20:16:35. It's a mystery what triggered the release of the previous image at 20:16:44. The subsequent re-release of the correct image at 20:17:56 puts the house back in order. So it looks like our hunch about zero timestamps was correct, but it's unclear a) why we get them, and b) why the defenses against them in #1249 did not work. For b) we do see the newly introduced "skip container" action at 20:16:35, but that didn't prevent the release of the wrong image a few seconds later. |
That riddle is solved. See #1250. So the only remaining mystery here then is how we ended up with a zero |
An error while fetching an image manifest would return a nil error (hence indicating success) with a unit value image.Info{} struct. That is bad news for the caller in Warmer.warm(), which will map an image tag to that empty image.Info{}, polluting the cache entry for the image+tag and image in memcached. When we subsequently use this info to determine the latest suitable tag, we encounter zero CreatedAt timestamps, which, prior to the changes in #1247, #1249 and #1250 would cause the wrong images to be released. Fixes #1127.
Solved in #1251. |
For some unknown reason flux decided to deploy the most-recent-but-one image of itself, and then immediately re-deployed the most-recent image:
There is no apparent trigger - the two images were built yesterday, 6 hours apart.
The text was updated successfully, but these errors were encountered: