You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey guys, it's me again. I recently posted #1173 and from what I can gather, it seems to have fixed the problem described in the PR (TLDR: None of the notifier's processors could acquire lock). However, it seems to have uncovered another issue that seems to me even more complicated. I'll try to describe it as best as I can and it'll hopefully give you enough clues to untie this knot, because I'm at my wits' end.
So after #1173 has been merged and deployed to our envs, I made sure that the issue doesn't reproduce anymore. And indeed it hasn't which is good. However, every once in a while, our notifier pod got OOM-killed by OpenShift. As an immediate remedy, I raised the pod's memory limit to 4GiB (from 1.5GiB) and that helped. However I also noticed that one of the processors constantly keeps processing update operation by RHEL7-rhel-7-supplementary updater, seemingly without an end.
There was no message in the pod's log that would really expose what's going on so I decided to run notifier locally but connected it to all the other parts of our stage env where the problem reproduces. First I just ran it in development container using make local-dev-notifier-restart target. It ran for a couple of minutes and then crashed with this:
9:35AM DBG lock acquired by another processor. will not process UOID=78d96d7b-7fb5-41b8-aab4-31f6d753c5f7 component=notifier/processor/Processor.process processor_id=3 updater=RHEL7-rhel-7-supplementary
9:35AM INF notification ids in failed status component=notifier/delivery/Delivery.RunDelivery deliverer=amqp-direct- failed=0 id=1
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xcde2c3]
goroutine 13 [running]:
github.com/quay/clair/v4/httptransport/client.(*HTTP).AffectedManifests(0xc000424960, 0xffcfe0, 0xc000424a50, 0xc072380000, 0x231fc, 0x32a16, 0x0, 0x0, 0x0)
/src/clair/httptransport/client/indexer.go:42 +0x403
github.com/quay/clair/v4/notifier.(*Processor).create(0xc00011a280, 0xffcfe0, 0xc000424a50, 0xc000251c50, 0xc, 0xe648604a4799e171, 0xcdd89dee4f2de3ab, 0xc000250170, 0xc, 0xc0004e36e0, ...)
/src/clair/notifier/processor.go:127 +0x7aa
github.com/quay/clair/v4/notifier.(*Processor).process.func1(0xc00011a280, 0xffcfe0, 0xc000424a50, 0xc000251c50, 0xc, 0xe648604a4799e171, 0xcdd89dee4f2de3ab, 0xc000250170, 0xc, 0xc0004e36e0, ...)
/src/clair/notifier/processor.go:100 +0x165
github.com/quay/clair/v4/notifier.(*Processor).process(0xc00011a280, 0xffcfe0, 0xc000424a50, 0xc0004ca000)
/src/clair/notifier/processor.go:101 +0x9f8
created by github.com/quay/clair/v4/notifier.(*Processor).Process
/src/clair/notifier/processor.go:59 +0x5d
exit status 2
I was also monitoring the container using portainer and managed to snap some stats:
Then I decided I'd run notifier directly on my system and debug the crash using VS Code debugger. I set breakpoint here and ran cmd/clair with CLAIR_MODE set to "notifier". The story was pretty much the same. It ran for couple of minutes, gradually growing in memory consumption, and then after ~20 minutes (maybe more), it hit the breakpoint.
There are couple of interesting things in the clair state that I don't quite know how to process. Firstly and obviously, err is not nil. See here:
So couple of things to address here. The immediate cause of the previously mentioned crash is our error handling. We attempt to access resp.StatusCode and resp.Status. But with err non-nil, we have no guarantee of these fields existing. That's where the panic comes from and it's easy to fix. However, it doesn't deal with the error itself. I read this discussion on GH and my takeaway is (might be wrong!) that server closes the connection when we see this error. That would make sense, because the update operation we're trying to process there is HUGE and contains 143k added vulns.
As you can see, request body has 328MB:
Here comes the catch though. As far as I can tell, we initialize HTTP server like this. This means we don't override default value of ReadTimeout which is unlimited.
Also, there is content of u variable acquired here in case it's interesting:
So here I am. It looks like server (indexer in this case) closes connection when we try to POST huge chunk of data, but I don't know why is that happening. Could you please look into it? I could really use your expertise.
PS: Sorry for posting the debug data as images, unfortunately VS Code doesn't have any way how to dump them.
Expected Outcome
Actual Outcome
Environment
Clair version/image: Running clair from commit 5a6f1c3
Host OS: Reproducing both locally on Fedora 33 and also deployed in OpenShift v3.11.104
The text was updated successfully, but these errors were encountered:
hdonnay
added a commit
to hdonnay/clair
that referenced
this issue
Feb 16, 2021
Description of Problem
Hey guys, it's me again. I recently posted #1173 and from what I can gather, it seems to have fixed the problem described in the PR (TLDR: None of the notifier's processors could acquire lock). However, it seems to have uncovered another issue that seems to me even more complicated. I'll try to describe it as best as I can and it'll hopefully give you enough clues to untie this knot, because I'm at my wits' end.
So after #1173 has been merged and deployed to our envs, I made sure that the issue doesn't reproduce anymore. And indeed it hasn't which is good. However, every once in a while, our notifier pod got OOM-killed by OpenShift. As an immediate remedy, I raised the pod's memory limit to 4GiB (from 1.5GiB) and that helped. However I also noticed that one of the processors constantly keeps processing update operation by RHEL7-rhel-7-supplementary updater, seemingly without an end.
There was no message in the pod's log that would really expose what's going on so I decided to run notifier locally but connected it to all the other parts of our stage env where the problem reproduces. First I just ran it in development container using
make local-dev-notifier-restart
target. It ran for a couple of minutes and then crashed with this:I was also monitoring the container using portainer and managed to snap some stats:
Then I decided I'd run notifier directly on my system and debug the crash using VS Code debugger. I set breakpoint here and ran
cmd/clair
withCLAIR_MODE
set to "notifier". The story was pretty much the same. It ran for couple of minutes, gradually growing in memory consumption, and then after ~20 minutes (maybe more), it hit the breakpoint.There are couple of interesting things in the clair state that I don't quite know how to process. Firstly and obviously,
err
is notnil
. See here:So couple of things to address here. The immediate cause of the previously mentioned crash is our error handling. We attempt to access
resp.StatusCode
andresp.Status
. But witherr
non-nil, we have no guarantee of these fields existing. That's where the panic comes from and it's easy to fix. However, it doesn't deal with the error itself. I read this discussion on GH and my takeaway is (might be wrong!) that server closes the connection when we see this error. That would make sense, because the update operation we're trying to process there is HUGE and contains 143k added vulns.As you can see, request body has 328MB:
Here comes the catch though. As far as I can tell, we initialize HTTP server like this. This means we don't override default value of
ReadTimeout
which is unlimited.Also, there is content of
u
variable acquired here in case it's interesting:So here I am. It looks like server (indexer in this case) closes connection when we try to POST huge chunk of data, but I don't know why is that happening. Could you please look into it? I could really use your expertise.
PS: Sorry for posting the debug data as images, unfortunately VS Code doesn't have any way how to dump them.
Expected Outcome
Actual Outcome
Environment
The text was updated successfully, but these errors were encountered: