Big POST request seems to be closed from indexer server side #1186

jan-zmeskal · 2021-02-16T15:22:58Z

Description of Problem

Hey guys, it's me again. I recently posted #1173 and from what I can gather, it seems to have fixed the problem described in the PR (TLDR: None of the notifier's processors could acquire lock). However, it seems to have uncovered another issue that seems to me even more complicated. I'll try to describe it as best as I can and it'll hopefully give you enough clues to untie this knot, because I'm at my wits' end.

So after #1173 has been merged and deployed to our envs, I made sure that the issue doesn't reproduce anymore. And indeed it hasn't which is good. However, every once in a while, our notifier pod got OOM-killed by OpenShift. As an immediate remedy, I raised the pod's memory limit to 4GiB (from 1.5GiB) and that helped. However I also noticed that one of the processors constantly keeps processing update operation by RHEL7-rhel-7-supplementary updater, seemingly without an end.

There was no message in the pod's log that would really expose what's going on so I decided to run notifier locally but connected it to all the other parts of our stage env where the problem reproduces. First I just ran it in development container using make local-dev-notifier-restart target. It ran for a couple of minutes and then crashed with this:

9:35AM DBG lock acquired by another processor. will not process UOID=78d96d7b-7fb5-41b8-aab4-31f6d753c5f7 component=notifier/processor/Processor.process processor_id=3 updater=RHEL7-rhel-7-supplementary
9:35AM INF notification ids in failed status component=notifier/delivery/Delivery.RunDelivery deliverer=amqp-direct- failed=0 id=1
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xcde2c3]
goroutine 13 [running]:
github.com/quay/clair/v4/httptransport/client.(*HTTP).AffectedManifests(0xc000424960, 0xffcfe0, 0xc000424a50, 0xc072380000, 0x231fc, 0x32a16, 0x0, 0x0, 0x0)
        /src/clair/httptransport/client/indexer.go:42 +0x403
github.com/quay/clair/v4/notifier.(*Processor).create(0xc00011a280, 0xffcfe0, 0xc000424a50, 0xc000251c50, 0xc, 0xe648604a4799e171, 0xcdd89dee4f2de3ab, 0xc000250170, 0xc, 0xc0004e36e0, ...)
        /src/clair/notifier/processor.go:127 +0x7aa
github.com/quay/clair/v4/notifier.(*Processor).process.func1(0xc00011a280, 0xffcfe0, 0xc000424a50, 0xc000251c50, 0xc, 0xe648604a4799e171, 0xcdd89dee4f2de3ab, 0xc000250170, 0xc, 0xc0004e36e0, ...)
        /src/clair/notifier/processor.go:100 +0x165
github.com/quay/clair/v4/notifier.(*Processor).process(0xc00011a280, 0xffcfe0, 0xc000424a50, 0xc0004ca000)
        /src/clair/notifier/processor.go:101 +0x9f8
created by github.com/quay/clair/v4/notifier.(*Processor).Process
        /src/clair/notifier/processor.go:59 +0x5d
exit status 2

I was also monitoring the container using portainer and managed to snap some stats:

Then I decided I'd run notifier directly on my system and debug the crash using VS Code debugger. I set breakpoint here and ran cmd/clair with CLAIR_MODE set to "notifier". The story was pretty much the same. It ran for couple of minutes, gradually growing in memory consumption, and then after ~20 minutes (maybe more), it hit the breakpoint.

There are couple of interesting things in the clair state that I don't quite know how to process. Firstly and obviously, err is not nil. See here:

So couple of things to address here. The immediate cause of the previously mentioned crash is our error handling. We attempt to access resp.StatusCode and resp.Status. But with err non-nil, we have no guarantee of these fields existing. That's where the panic comes from and it's easy to fix. However, it doesn't deal with the error itself. I read this discussion on GH and my takeaway is (might be wrong!) that server closes the connection when we see this error. That would make sense, because the update operation we're trying to process there is HUGE and contains 143k added vulns.

As you can see, request body has 328MB:

Here comes the catch though. As far as I can tell, we initialize HTTP server like this. This means we don't override default value of ReadTimeout which is unlimited.

Also, there is content of u variable acquired here in case it's interesting:

So here I am. It looks like server (indexer in this case) closes connection when we try to POST huge chunk of data, but I don't know why is that happening. Could you please look into it? I could really use your expertise.

PS: Sorry for posting the debug data as images, unfortunately VS Code doesn't have any way how to dump them.

Expected Outcome

Actual Outcome

Environment

Clair version/image: Running clair from commit 5a6f1c3
Host OS: Reproducing both locally on Fedora 33 and also deployed in OpenShift v3.11.104

The text was updated successfully, but these errors were encountered:

Fixes the worst of quay#1186 Signed-off-by: Hank Donnay <[email protected]>

Fixes the worst of quay#1186 Backports: quay#1188 Signed-off-by: Hank Donnay <[email protected]>

ldelossa · 2021-03-29T13:19:11Z

@jan-zmeskal can we close this? The notifier efficiency work has merged. You shouldnt see this issue.

jan-zmeskal · 2021-03-29T14:11:31Z

@ldelossa You're right, closing

hdonnay added a commit to hdonnay/clair that referenced this issue Feb 16, 2021

client: fix panic on request failure

ce11fd7

Fixes the worst of quay#1186 Signed-off-by: Hank Donnay <[email protected]>

hdonnay mentioned this issue Feb 16, 2021

client: fix panic on request failure #1188

Merged

hdonnay added a commit to hdonnay/clair that referenced this issue Feb 17, 2021

client: fix panic on request failure

8b63953

Fixes the worst of quay#1186 Backports: quay#1188 Signed-off-by: Hank Donnay <[email protected]>

This was referenced Feb 17, 2021

client: fix panic on request failure #1189

Merged

client: remove request body buffering #1190

Closed

jan-zmeskal closed this as completed Mar 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big POST request seems to be closed from indexer server side #1186

Big POST request seems to be closed from indexer server side #1186

jan-zmeskal commented Feb 16, 2021

ldelossa commented Mar 29, 2021

jan-zmeskal commented Mar 29, 2021

Big POST request seems to be closed from indexer server side #1186

Big POST request seems to be closed from indexer server side #1186

Comments

jan-zmeskal commented Feb 16, 2021

Description of Problem

Expected Outcome

Actual Outcome

Environment

ldelossa commented Mar 29, 2021

jan-zmeskal commented Mar 29, 2021