-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Ignore some errors during manifest loading #1559
Conversation
029fbe2
to
f0516a8
Compare
I guess you must have seen this to report it specifically, but I am puzzled about how it occurs -- has the clone operation not finished before we look at the filesystem? |
I'm also not sure how it happens but Git garbage collection or at least its lock file is being reported as existing but then cannot be accessed. Maybe fs caching? I looked through the usages of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Before papering over the problem, I would like to understand why it happens. |
@rndstr can you reproduce? I would say we need a repro to be completely sure that what we are doing is right |
I've seen reports of this on Slack, one user mentioned this happens once per week. |
@stefanprodan @hiddeco are you still hearing from users having this problem? |
I am not aware of any reports about this issue in the last couple of weeks. |
Hi, can this be related to #2927? It was not happening before 1.18.1, I think that because of many optimizations Flux is not OOM killed anymore by Kubernetes like before so the same pod stays alive for weeks and this problem occurs. |
https://git-scm.com/docs/git-gc#Documentation/git-gc.txt-gcautoDetach:
and https://git-scm.com/docs/git-gc#_description:
My guess is that in some circumstances, the git clone triggers a GC in the cloned repo, and this is run in the background. |
1673a8a
to
be859a1
Compare
Rebased and removed the (now unnecessary) release note. |
When loading manifests, we attempt to walk the base file tree twice. Once while looking for directories with Helm charts in them, and then to load all the .yaml/.yml files. The former keeps track of those directories to exclude them while doing the latter since we do not want to load yamels from directories with Helm charts in them. While walking those file trees any error aborted the whole loading process and as a consequence, the API calls `ListServices` and `ListImages` would return an error. Suddenly disappearing files such as Git's gc lock file would trigger an error. Since the Git repo is cloned before manifests are loaded, there was a race sometimes between having that lock file being enumerated but then disappear while trying to retrieve information about it. This PR ignores all errors while enumerating the Helm chart directories since permissions are unlikely to change, and disappearing files do not modify the list of excluded directories. When walking the yamels afterwards, only errors for files or directories that we actually care about will be reported. That means any error for files that do not have a yamel extension will be just ignored.
be859a1
to
35f67d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revisiting this after 1. seeing the problem reported more and 2. understanding why it might happen, I am more sympathetic to the idea of only caring about the files of interest. Belated thanks @rndstr :corgi:
The PR #1559 rearranged the filesystem walk during Load, so that it only resulted in an error if there was a problem reading a YAML file or non-chart directory (which might contain YAML files). To decide whether a file is of interest, it first checks the stat to see if it's a directory (in which case, recurse if not a chart ..) -- but if there's an error, that will be nil, and it will panic. In general, you don't know if the file you can't read is (supposed to be) a directory or a regular file, so there's no way to treat those differently. Instead, this commit makes it check before walking that the path supplied exists, then during the walk, ignore errors unless it looks like a YAML file.
When loading manifests, we attempt to walk the base file tree twice. Once
while looking for directories with Helm charts in them, and then to load
all the .yaml/.yml files. The former keeps track of those directories
to exclude them wwhile doing the latter since we do not want to load
yamels from directories with Helm charts in them.
While walking those file trees any error aborted the whole loading
process and as a consequence, the API calls
ListServices
andListImages
would return an error. Suddenly disappearing files such asGit's gc lock file would trigger an error. Since the Git repo is cloned
before manifests are loaded, there was a race sometimes between having
that lock file being enumerated but then disappear while trying to
retrieve information about it.
This PR ignores all errors while enuemrating the Helm chart directories
since permissions are unlikely to change and disappearing files to not
modify the list of excluded directories.
When walking the yamels, only errors for files or directories that we
actually care about will be reported. That means any error for files
that do not have a yamel extension will be just ignored.
Follow-up to #1076
Fixes #1558