Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big POST request seems to be closed from indexer server side #1186

Closed
jan-zmeskal opened this issue Feb 16, 2021 · 2 comments
Closed

Big POST request seems to be closed from indexer server side #1186

jan-zmeskal opened this issue Feb 16, 2021 · 2 comments

Comments

@jan-zmeskal
Copy link
Contributor

Description of Problem

Hey guys, it's me again. I recently posted #1173 and from what I can gather, it seems to have fixed the problem described in the PR (TLDR: None of the notifier's processors could acquire lock). However, it seems to have uncovered another issue that seems to me even more complicated. I'll try to describe it as best as I can and it'll hopefully give you enough clues to untie this knot, because I'm at my wits' end.

So after #1173 has been merged and deployed to our envs, I made sure that the issue doesn't reproduce anymore. And indeed it hasn't which is good. However, every once in a while, our notifier pod got OOM-killed by OpenShift. As an immediate remedy, I raised the pod's memory limit to 4GiB (from 1.5GiB) and that helped. However I also noticed that one of the processors constantly keeps processing update operation by RHEL7-rhel-7-supplementary updater, seemingly without an end.

There was no message in the pod's log that would really expose what's going on so I decided to run notifier locally but connected it to all the other parts of our stage env where the problem reproduces. First I just ran it in development container using make local-dev-notifier-restart target. It ran for a couple of minutes and then crashed with this:

9:35AM DBG lock acquired by another processor. will not process UOID=78d96d7b-7fb5-41b8-aab4-31f6d753c5f7 component=notifier/processor/Processor.process processor_id=3 updater=RHEL7-rhel-7-supplementary
9:35AM INF notification ids in failed status component=notifier/delivery/Delivery.RunDelivery deliverer=amqp-direct- failed=0 id=1
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xcde2c3]
goroutine 13 [running]:
github.com/quay/clair/v4/httptransport/client.(*HTTP).AffectedManifests(0xc000424960, 0xffcfe0, 0xc000424a50, 0xc072380000, 0x231fc, 0x32a16, 0x0, 0x0, 0x0)
        /src/clair/httptransport/client/indexer.go:42 +0x403
github.com/quay/clair/v4/notifier.(*Processor).create(0xc00011a280, 0xffcfe0, 0xc000424a50, 0xc000251c50, 0xc, 0xe648604a4799e171, 0xcdd89dee4f2de3ab, 0xc000250170, 0xc, 0xc0004e36e0, ...)
        /src/clair/notifier/processor.go:127 +0x7aa
github.com/quay/clair/v4/notifier.(*Processor).process.func1(0xc00011a280, 0xffcfe0, 0xc000424a50, 0xc000251c50, 0xc, 0xe648604a4799e171, 0xcdd89dee4f2de3ab, 0xc000250170, 0xc, 0xc0004e36e0, ...)
        /src/clair/notifier/processor.go:100 +0x165
github.com/quay/clair/v4/notifier.(*Processor).process(0xc00011a280, 0xffcfe0, 0xc000424a50, 0xc0004ca000)
        /src/clair/notifier/processor.go:101 +0x9f8
created by github.com/quay/clair/v4/notifier.(*Processor).Process
        /src/clair/notifier/processor.go:59 +0x5d
exit status 2

I was also monitoring the container using portainer and managed to snap some stats:
image

Then I decided I'd run notifier directly on my system and debug the crash using VS Code debugger. I set breakpoint here and ran cmd/clair with CLAIR_MODE set to "notifier". The story was pretty much the same. It ran for couple of minutes, gradually growing in memory consumption, and then after ~20 minutes (maybe more), it hit the breakpoint.

There are couple of interesting things in the clair state that I don't quite know how to process. Firstly and obviously, err is not nil. See here:
image
So couple of things to address here. The immediate cause of the previously mentioned crash is our error handling. We attempt to access resp.StatusCode and resp.Status. But with err non-nil, we have no guarantee of these fields existing. That's where the panic comes from and it's easy to fix. However, it doesn't deal with the error itself. I read this discussion on GH and my takeaway is (might be wrong!) that server closes the connection when we see this error. That would make sense, because the update operation we're trying to process there is HUGE and contains 143k added vulns.
image
As you can see, request body has 328MB:
req
Here comes the catch though. As far as I can tell, we initialize HTTP server like this. This means we don't override default value of ReadTimeout which is unlimited.

Also, there is content of u variable acquired here in case it's interesting:
u

So here I am. It looks like server (indexer in this case) closes connection when we try to POST huge chunk of data, but I don't know why is that happening. Could you please look into it? I could really use your expertise.

PS: Sorry for posting the debug data as images, unfortunately VS Code doesn't have any way how to dump them.

Expected Outcome

Actual Outcome

Environment

  • Clair version/image: Running clair from commit 5a6f1c3
  • Host OS: Reproducing both locally on Fedora 33 and also deployed in OpenShift v3.11.104
hdonnay added a commit to hdonnay/clair that referenced this issue Feb 16, 2021
Fixes the worst of quay#1186

Signed-off-by: Hank Donnay <[email protected]>
hdonnay added a commit to hdonnay/clair that referenced this issue Feb 17, 2021
Fixes the worst of quay#1186

Backports: quay#1188
Signed-off-by: Hank Donnay <[email protected]>
@ldelossa
Copy link
Contributor

@jan-zmeskal can we close this? The notifier efficiency work has merged. You shouldnt see this issue.

@jan-zmeskal
Copy link
Contributor Author

@ldelossa You're right, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants