Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Ignore some errors during manifest loading #1559

Merged
merged 1 commit into from
Jun 8, 2020

Conversation

rndstr
Copy link
Contributor

@rndstr rndstr commented Nov 28, 2018

When loading manifests, we attempt to walk the base file tree twice. Once
while looking for directories with Helm charts in them, and then to load
all the .yaml/.yml files. The former keeps track of those directories
to exclude them wwhile doing the latter since we do not want to load
yamels from directories with Helm charts in them.

While walking those file trees any error aborted the whole loading
process and as a consequence, the API calls ListServices and
ListImages would return an error. Suddenly disappearing files such as
Git's gc lock file would trigger an error. Since the Git repo is cloned
before manifests are loaded, there was a race sometimes between having
that lock file being enumerated but then disappear while trying to
retrieve information about it.

This PR ignores all errors while enuemrating the Helm chart directories
since permissions are unlikely to change and disappearing files to not
modify the list of excluded directories.

When walking the yamels, only errors for files or directories that we
actually care about will be reported. That means any error for files
that do not have a yamel extension will be just ignored.


Follow-up to #1076
Fixes #1558

@rndstr rndstr force-pushed the issue/1558-chart-walking-skip-error branch from 029fbe2 to f0516a8 Compare November 28, 2018 02:50
@rndstr rndstr requested a review from squaremo November 28, 2018 15:57
@squaremo
Copy link
Member

Since the Git repo is cloned
before manifests are loaded, there was a race sometimes between having
that lock file being enumerated but then disappear while trying to
retrieve information about it.

I guess you must have seen this to report it specifically, but I am puzzled about how it occurs -- has the clone operation not finished before we look at the filesystem?

@rndstr
Copy link
Contributor Author

rndstr commented Nov 29, 2018

I'm also not sure how it happens but Git garbage collection or at least its lock file is being reported as existing but then cannot be accessed. Maybe fs caching?

I looked through the usages of git.Checkout whether it might be shared somewhere but I couldn't find anything.

Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@squaremo
Copy link
Member

Before papering over the problem, I would like to understand why it happens.

@2opremio
Copy link
Contributor

@rndstr can you reproduce? I would say we need a repro to be completely sure that what we are doing is right

@rndstr
Copy link
Contributor Author

rndstr commented Jan 25, 2019

@2opremio sorry I haven't yet gotten around to look more into this. the two independent reports of this happening were on GKE, @foot had it occurring occasionally.

@stefanprodan
Copy link
Member

I've seen reports of this on Slack, one user mentioned this happens once per week.

@2opremio
Copy link
Contributor

@stefanprodan @hiddeco are you still hearing from users having this problem?

@hiddeco
Copy link
Member

hiddeco commented Dec 12, 2019

I am not aware of any reports about this issue in the last couple of weeks.

@emas80
Copy link
Contributor

emas80 commented Jun 4, 2020

Hi, can this be related to #2927?
We are experiencing it very often, at least once a week, restarting the pod fixes.
I've been asked by the dev team to kill the Flux pods every day, because it is very annoying sometimes you don't get any feedback that something is actually wrong.

It was not happening before 1.18.1, I think that because of many optimizations Flux is not OOM killed anymore by Kubernetes like before so the same pod stays alive for weeks and this problem occurs.

@squaremo
Copy link
Member

squaremo commented Jun 8, 2020

https://git-scm.com/docs/git-gc#Documentation/git-gc.txt-gcautoDetach:

Make git gc --auto return immediately and run in background if the system supports it. Default is true.

and https://git-scm.com/docs/git-gc#_description:

When common porcelain operations that create objects are run, they will check whether the repository has grown substantially since the last maintenance, and if so run git gc automatically.

My guess is that in some circumstances, the git clone triggers a GC in the cloned repo, and this is run in the background.

@squaremo squaremo force-pushed the issue/1558-chart-walking-skip-error branch 2 times, most recently from 1673a8a to be859a1 Compare June 8, 2020 12:42
@squaremo
Copy link
Member

squaremo commented Jun 8, 2020

Rebased and removed the (now unnecessary) release note.

When loading manifests, we attempt to walk the base file tree twice. Once
while looking for directories with Helm charts in them, and then to load
all the .yaml/.yml files. The former keeps track of those directories
to exclude them while doing the latter since we do not want to load
yamels from directories with Helm charts in them.

While walking those file trees any error aborted the whole loading
process and as a consequence, the API calls `ListServices` and
`ListImages` would return an error. Suddenly disappearing files such as
Git's gc lock file would trigger an error. Since the Git repo is cloned
before manifests are loaded, there was a race sometimes between having
that lock file being enumerated but then disappear while trying to
retrieve information about it.

This PR ignores all errors while enumerating the Helm chart directories
since permissions are unlikely to change, and disappearing files do not
modify the list of excluded directories.

When walking the yamels afterwards, only errors for files or directories
that we actually care about will be reported. That means any error for
files that do not have a yamel extension will be just ignored.
@squaremo squaremo force-pushed the issue/1558-chart-walking-skip-error branch from be859a1 to 35f67d4 Compare June 8, 2020 14:46
Copy link
Member

@squaremo squaremo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting this after 1. seeing the problem reported more and 2. understanding why it might happen, I am more sympathetic to the idea of only caring about the files of interest. Belated thanks @rndstr :corgi:

@squaremo squaremo merged commit 71420aa into master Jun 8, 2020
@squaremo squaremo deleted the issue/1558-chart-walking-skip-error branch June 8, 2020 18:30
squaremo added a commit that referenced this pull request Jul 14, 2020
The PR #1559 rearranged the filesystem walk during Load, so that it
only resulted in an error if there was a problem reading a YAML file
or non-chart directory (which might contain YAML files).

To decide whether a file is of interest, it first checks the stat to
see if it's a directory (in which case, recurse if not a chart ..) --
but if there's an error, that will be nil, and it will panic.

In general, you don't know if the file you can't read is (supposed to
be) a directory or a regular file, so there's no way to treat those
differently. Instead, this commit makes it check before walking that
the path supplied exists, then during the walk, ignore errors unless
it looks like a YAML file.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Manifest loading fails when ephemeral file disappears at wrong time
6 participants