Ignore some errors during manifest loading #1559

rndstr · 2018-11-28T02:47:07Z

When loading manifests, we attempt to walk the base file tree twice. Once
while looking for directories with Helm charts in them, and then to load
all the .yaml/.yml files. The former keeps track of those directories
to exclude them wwhile doing the latter since we do not want to load
yamels from directories with Helm charts in them.

While walking those file trees any error aborted the whole loading
process and as a consequence, the API calls ListServices and
ListImages would return an error. Suddenly disappearing files such as
Git's gc lock file would trigger an error. Since the Git repo is cloned
before manifests are loaded, there was a race sometimes between having
that lock file being enumerated but then disappear while trying to
retrieve information about it.

This PR ignores all errors while enuemrating the Helm chart directories
since permissions are unlikely to change and disappearing files to not
modify the list of excluded directories.

When walking the yamels, only errors for files or directories that we
actually care about will be reported. That means any error for files
that do not have a yamel extension will be just ignored.

Follow-up to #1076
Fixes #1558

squaremo · 2018-11-29T10:35:38Z

Since the Git repo is cloned
before manifests are loaded, there was a race sometimes between having
that lock file being enumerated but then disappear while trying to
retrieve information about it.

I guess you must have seen this to report it specifically, but I am puzzled about how it occurs -- has the clone operation not finished before we look at the filesystem?

rndstr · 2018-11-29T23:20:19Z

I'm also not sure how it happens but Git garbage collection or at least its lock file is being reported as existing but then cannot be accessed. Maybe fs caching?

I looked through the usages of git.Checkout whether it might be shared somewhere but I couldn't find anything.

stefanprodan

LGTM

squaremo · 2018-12-11T17:02:16Z

Before papering over the problem, I would like to understand why it happens.

2opremio · 2019-01-25T15:00:30Z

@rndstr can you reproduce? I would say we need a repro to be completely sure that what we are doing is right

rndstr · 2019-01-25T16:53:57Z

@2opremio sorry I haven't yet gotten around to look more into this. the two independent reports of this happening were on GKE, @foot had it occurring occasionally.

stefanprodan · 2019-01-25T17:05:57Z

I've seen reports of this on Slack, one user mentioned this happens once per week.

2opremio · 2019-12-12T13:29:03Z

@stefanprodan @hiddeco are you still hearing from users having this problem?

hiddeco · 2019-12-12T13:31:06Z

I am not aware of any reports about this issue in the last couple of weeks.

emas80 · 2020-06-04T07:40:32Z

Hi, can this be related to #2927?
We are experiencing it very often, at least once a week, restarting the pod fixes.
I've been asked by the dev team to kill the Flux pods every day, because it is very annoying sometimes you don't get any feedback that something is actually wrong.

It was not happening before 1.18.1, I think that because of many optimizations Flux is not OOM killed anymore by Kubernetes like before so the same pod stays alive for weeks and this problem occurs.

squaremo · 2020-06-08T09:51:36Z

https://git-scm.com/docs/git-gc#Documentation/git-gc.txt-gcautoDetach:

Make git gc --auto return immediately and run in background if the system supports it. Default is true.

and https://git-scm.com/docs/git-gc#_description:

When common porcelain operations that create objects are run, they will check whether the repository has grown substantially since the last maintenance, and if so run git gc automatically.

My guess is that in some circumstances, the git clone triggers a GC in the cloned repo, and this is run in the background.

squaremo · 2020-06-08T14:45:01Z

Rebased and removed the (now unnecessary) release note.

When loading manifests, we attempt to walk the base file tree twice. Once while looking for directories with Helm charts in them, and then to load all the .yaml/.yml files. The former keeps track of those directories to exclude them while doing the latter since we do not want to load yamels from directories with Helm charts in them. While walking those file trees any error aborted the whole loading process and as a consequence, the API calls `ListServices` and `ListImages` would return an error. Suddenly disappearing files such as Git's gc lock file would trigger an error. Since the Git repo is cloned before manifests are loaded, there was a race sometimes between having that lock file being enumerated but then disappear while trying to retrieve information about it. This PR ignores all errors while enumerating the Helm chart directories since permissions are unlikely to change, and disappearing files do not modify the list of excluded directories. When walking the yamels afterwards, only errors for files or directories that we actually care about will be reported. That means any error for files that do not have a yamel extension will be just ignored.

squaremo

Revisiting this after 1. seeing the problem reported more and 2. understanding why it might happen, I am more sympathetic to the idea of only caring about the files of interest. Belated thanks @rndstr :corgi:

The PR #1559 rearranged the filesystem walk during Load, so that it only resulted in an error if there was a problem reading a YAML file or non-chart directory (which might contain YAML files). To decide whether a file is of interest, it first checks the stat to see if it's a directory (in which case, recurse if not a chart ..) -- but if there's an error, that will be nil, and it will panic. In general, you don't know if the file you can't read is (supposed to be) a directory or a regular file, so there's no way to treat those differently. Instead, this commit makes it check before walking that the path supplied exists, then during the walk, ignore errors unless it looks like a YAML file.

rndstr force-pushed the issue/1558-chart-walking-skip-error branch from 029fbe2 to f0516a8 Compare November 28, 2018 02:50

rndstr requested a review from squaremo November 28, 2018 15:57

stefanprodan approved these changes Dec 4, 2018

View reviewed changes

squaremo force-pushed the issue/1558-chart-walking-skip-error branch 2 times, most recently from 1673a8a to be859a1 Compare June 8, 2020 12:42

squaremo force-pushed the issue/1558-chart-walking-skip-error branch from be859a1 to 35f67d4 Compare June 8, 2020 14:46

squaremo approved these changes Jun 8, 2020

View reviewed changes

squaremo merged commit 71420aa into master Jun 8, 2020

squaremo deleted the issue/1558-chart-walking-skip-error branch June 8, 2020 18:30

squaremo mentioned this pull request Jul 14, 2020

Avoid panic when directory does not exist #3193

Merged

kingdonb mentioned this pull request Feb 23, 2021

"cannot clean working clone" error #2927

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore some errors during manifest loading #1559

Ignore some errors during manifest loading #1559

rndstr commented Nov 28, 2018

squaremo commented Nov 29, 2018

rndstr commented Nov 29, 2018

stefanprodan left a comment

squaremo commented Dec 11, 2018

2opremio commented Jan 25, 2019

rndstr commented Jan 25, 2019

stefanprodan commented Jan 25, 2019

2opremio commented Dec 12, 2019

hiddeco commented Dec 12, 2019

emas80 commented Jun 4, 2020

squaremo commented Jun 8, 2020

squaremo commented Jun 8, 2020

squaremo left a comment

Ignore some errors during manifest loading #1559

Ignore some errors during manifest loading #1559

Conversation

rndstr commented Nov 28, 2018

squaremo commented Nov 29, 2018

rndstr commented Nov 29, 2018

stefanprodan left a comment

Choose a reason for hiding this comment

squaremo commented Dec 11, 2018

2opremio commented Jan 25, 2019

rndstr commented Jan 25, 2019

stefanprodan commented Jan 25, 2019

2opremio commented Dec 12, 2019

hiddeco commented Dec 12, 2019

emas80 commented Jun 4, 2020

squaremo commented Jun 8, 2020

squaremo commented Jun 8, 2020

squaremo left a comment

Choose a reason for hiding this comment