Deadlock/overload in gateway during spiking goroutine count #8381

petar · 2021-08-25T08:19:12Z

@Stebalien reported the deadlock below. This is caused either by a bug or by lack of graceful degradation handling where needed.

goroutine 13827044 [semacquire, 19 minutes]:
sync.runtime_SemacquireMutex(0xc0020fd8a4, 0xc064b9d800, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0020fd8a0)
	/usr/local/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
sync.(*RWMutex).Lock(0xc0020fd8a0)
	/usr/local/go/src/sync/rwmutex.go:98 +0x90
github.com/ipfs/go-bitswap/network.(*connectEventManager).Disconnected(0xc0020fd890, 0xc042f9e9f0, 0x22)
	pkg/mod/github.com/ipfs/[email protected]/network/connecteventmanager.go:49 +0x47
github.com/ipfs/go-bitswap/network.(*netNotifiee).Disconnected(0xc000c7e000, 0x23b4b40, 0xc000312a80, 0x23ae8e0, 0xc0539f1290)
	pkg/mod/github.com/ipfs/[email protected]/network/ipfs_impl.go:442 +0x98
github.com/libp2p/go-libp2p-swarm.(*Conn).doClose.func1.1(0x23a07a0, 0xc000c7e000)
	pkg/mod/github.com/libp2p/[email protected]/swarm_conn.go:85 +0x5b
github.com/libp2p/go-libp2p-swarm.(*Swarm).notifyAll.func1(0xc0636baa20, 0xc0558265d0, 0x23a07a0, 0xc000c7e000)
	pkg/mod/github.com/libp2p/[email protected]/swarm.go:559 +0x66
created by github.com/libp2p/go-libp2p-swarm.(*Swarm).notifyAl)
	pkg/mod/github.com/libp2p/[email protected]/swarm.go:557 +0x118

The node was up for 28 minutes, and this has been stuck for 19. Unfortunately, I can't find the goroutine that's actually blocking this because the stack trace was truncated.

Debugging fixtures for investigating this problem in #8380

The text was updated successfully, but these errors were encountered:

BigLep · 2021-08-31T16:19:29Z

Next steps options:

Post dump
Post pprof image
Substitute in code lock with a tracing lock and deploying to one box.
See if can simulate load to manifest the issue locally. This would then enable us to see if any code changes fix the issue.
Pulling in extra eyes on the code

Various tools that may be useful:

Stebalien · 2021-08-31T19:22:50Z

I'm seeing logging backed up:

goroutine 98252740 [semacquire, 2 minutes]:
sync.runtime_SemacquireMutex(0xc00000e10c, 0xc2ab0fee00, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00000e108)
	/usr/local/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
go.uber.org/zap/zapcore.(*lockedWriteSyncer).Write(0xc00000e108, 0xc5f0da6000, 0xd9, 0x400, 0x2fd0ea0, 0x1f9d0c4, 0x6)
	pkg/mod/go.uber.org/[email protected]/zapcore/write_syncer.go:65 +0xe8
go.uber.org/zap/zapcore.(*ioCore).Write(0xc00018af30, 0x2, 0xc043a0aa2f8580d1, 0x7ef9a4ece61, 0x2fd0ea0, 0x1f9d0c4, 0x6, 0x20a7078, 0x2c, 0x1, ...)
	pkg/mod/go.uber.org/[email protected]/zapcore/core.go:90 +0x10c
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc5f0fcae40, 0xc4b621c080, 0x1, 0x2)
	pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:220 +0x12d
go.uber.org/zap.(*SugaredLogger).log(0xc000132108, 0x2, 0x20a7078, 0x2c, 0x0, 0x0, 0x0, 0xc5c9c5b750, 0x2, 0x2)
	pkg/mod/go.uber.org/[email protected]/sugar.go:227 +0x111
go.uber.org/zap.(*SugaredLogger).Errorw(...)
	pkg/mod/go.uber.org/[email protected]/sugar.go:191
github.com/ipfs/go-bitswap/internal/decision.(*Engine).ReceiveFrom(0xc000358d20, 0xc002671fb0, 0x26, 0xc317ef2240, 0x1, 0x1)
	pkg/mod/github.com/ipfs/[email protected]/internal/decision/engine.go:691 +0xbee
github.com/ipfs/go-bitswap.(*Bitswap).receiveBlocksFrom(0xc001812000, 0x23be9f0, 0xc000040078, 0xc002671fb0, 0x26, 0xc317ef21f0, 0x1, 0x1, 0x3002258, 0x0, ...)
	pkg/mod/github.com/ipfs/[email protected]/bitswap.go:487 +0x2e5
github.com/ipfs/go-bitswap.(*Bitswap).ReceiveMessage(0xc001812000, 0x23be9f0, 0xc000040078, 0xc002671fb0, 0x26, 0x23dfff8, 0xc25b8ee5a0)
	pkg/mod/github.com/ipfs/[email protected]/bitswap.go:547 +0x3c5
github.com/ipfs/go-bitswap/network.(*impl).handleNewStream(0xc0023d4f00, 0x23da598, 0xc5f746a7c0)
	pkg/mod/github.com/ipfs/[email protected]/network/ipfs_impl.go:423 +0x394
github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).SetStreamHandler.func1(0xc11cd16750, 0x13, 0x7fbda451a078, 0xc5f746a7c0, 0x1, 0x0)
	pkg/mod/github.com/libp2p/[email protected]/p2p/host/basic/basic_host.go:529 +0xa2
created by github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).newStreamHandle)
	pkg/mod/github.com/libp2p/[email protected]/p2p/host/basic/basic_host.go:372 +0x63a

This appears to be backing everything else up a bit. However, I believe we observed the same problem without these log statements, so we can't immediately blame them. I'm going to drop these to "debug" so we can continue debugging.

Stebalien · 2021-08-31T19:24:53Z

Let's try again with ipfs/go-bitswap#526.

guseggert · 2021-09-03T21:28:24Z

We've been unable to confirm that logging is the bottleneck here. We deployed that change that reduced bitswap logging verbosity. We also suspected that logging to the serial console was creating the bottleneck, due to journald writing that synchronously, but that was disabled and the issue remains. Also sometimes I take stack dumps while a host is sick, and I don't see any goroutines blocked on logging (doesn't rule it out entirely). The volume of logs is not very high, e.g. in dc12-b3 over the past hour we avergae ~48 log events per second into journald. So while we haven't ruled out logging entirely, we haven't found much additional evidence that supports it as the bottleneck.

I've got two more things to try here: turn off peerlogs entirely on a host (this change is already queued up for deployment), and stress test journald just to sanity check its limits. After that I think we have enough data to support logging not being the bottleneck.

BigLep · 2021-09-03T21:31:58Z

@guseggert : sounds good - thanks for the update. I assume we'll talk about what's next after we rule out logging.

Also, do we have the pprof image of the dump? I assume that doesn't point any particular direction.

guseggert · 2021-09-03T21:35:20Z

Yes I have quite a few at this point :) Here's the original we took: stack.gz

guseggert · 2021-09-23T03:35:53Z

Coming back from our off-site. The logging was a red herring...we fixed a number of related issues which didn't improve the situation (although still important fixes):

Don't log synchronously to console, which has very limited bandwidth
Turn off peerlogs for all gateways (greatly reduces verbosity)
Spamming journald

During these goroutine spikes the hosts are CPU-bound and not IO-bound. CPU profiles show that the CPU is busy iterating and copying maps (Bitswap wantlists). Here are the most frequent stack traces:

    local.runtime.mapiternext
    github.com/ipfs/go-bitswap/message.(*impl).Wantlist
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).sendMessage
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).sendIfReady
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).runQueue

    local.runtime.mapiternext
    local.runtime.mapiterinit
    github.com/ipfs/go-bitswap/message.(*impl).Wantlist
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).sendMessage
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).sendIfReady
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).runQueue

    local.runtime.mapiternext
    github.com/ipfs/go-bitswap/wantlist.(*Wantlist).Entries
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).extractOutgoingMessage
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).sendMessage
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).sendIfReady
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).runQueue

@Stebalien has a PR here ipfs/go-bitswap#530 that reuses cached wantlists in certain scenarios, and I deployed that to a box but it did not help, the CPU still spins on a combination of map iteration and GC.

Side note: mtail is showing up a lot in CPU profiles too, it uses an entire core to process the nginx access logs. I don't think it's the main culprit though, but something to keep an eye on.

I'm currently trying to make some optimizations to Bitswap so that it doesn't need to iterate and copy maps as much, which will take me some time since I am still ramping up on the codebase.

Ultimately though we are missing the right kind of backpressure to prevent these situations where we take on more work than we can handle. There will always be room for optimizations, but the service should protect itself from falling over. In this case, it looks like we're hitting processing limits related to # reqs, wantlist sizes, and # peers...we can perhaps apply backpressure at that level. I also have a long list of metrics we can add to Bitswap to help future investigations, some of the internals of Bitswap and its different mechanisms need more instrumentation.

Stebalien · 2021-09-23T08:04:53Z

@guseggert can broadcast fewer wants? I.e., one broadcast want per session? I agree we need better back pressure, but we should be able to cut down the work by several orders of magnitude.

Can you also upload the new profiles?

guseggert · 2021-10-04T21:44:59Z

I haven't been able to focus on this much, but do have some newer info. Bitswap is CPU-bound under certain workloads, and our public gateways do hit this bottleneck often, but I don't think it's the main contributor to the goroutine pileup...looking through metrics in the past when hosts have experienced this, they are not strongly correlated.

This morning I pushed out a change to Bitswap switching all mutexes to https://github.com/sasha-s/go-deadlock...I configured it to report goroutines stuck acquiring locks after >1 min. The lib tracks the last goroutine to acquire stuck locks too, and every log entry pointed to connectEventManager.MarkUnresponsive:

POTENTIAL DEADLOCK:
Previous place where the lock was grabbed
goroutine 39080090 lock 0xc002991150
pkg/mod/github.com/ipfs/[email protected]/network/connecteventmanager.go:67 network.(*connectEventManager).MarkUnresponsive ??? <<<<<

Runtime behavior was quite different though, goroutines immediately started piling up at a different rate than normal, with no recovery, so I rolled back the change. Either I misconfigured it, or it has some sort of runtime impact in a way that causes Bitswap to behave differently. Reading through MarkUnresponsive code paths, I cannot find any deadlocks or work that could get stuck (take that with a grain of salt, I am still relatively new to the codebase). I'm suspicious of that b/c the runtime behavior changed, but it could still be pointing in the right direction.

For now I'm pausing my investigation into this goroutine issue and focusing instead of adding tracing, metrics, and synthetic monitors.

On the CPU side, here's a CPU profile showing the bottleneck on a host at the 95% CPU utilization plateau:

cpuprofile-1.txt

We should set scaling policies to keep the CPU from going that high, standard thresholds should work, I think 80% would be a good threshold. In many regions were are routinely hitting the 95% plateau which is likely impacting latency.

BigLep · 2021-10-05T16:21:10Z

@guseggert : will attach the large go routine pprof.

Stebalien · 2021-10-05T20:29:53Z

@guseggert I've written up the idea I proposed in our discussion in ipfs/boxo#81 (well, a slight variant on it). We likely won't get to it for a while, but it's something to keep on the radar.

BigLep · 2021-10-11T16:53:05Z

@guseggert : can you please attach the large go routine pprof and then we'll park this for the time being.

BigLep · 2022-03-03T07:01:13Z

@guseggert : what are the next steps for this?

iand · 2022-09-08T08:49:25Z

I think we encountered this bug in Thunderdome. I wrote some notes on it on Notion

I took a full profile which I am investigating. You can get it at bafybeihlgdmnkkaawrge3yzqbocbkzf4mvbcgzxibq4tpwufklr4deqcyu

Elevated persistent goroutines and heap (yellow top line):

Also elevated number of peers:

jtsmedley · 2022-10-27T14:33:29Z

I believe we encountered this issue yesterday that brought our cluster to a complete crawl. Server 4 was actively selected by the cluster allocator during this time. We are running with a concurrency of 10 pins on all nodes and our cluster has 17 nodes. The issue also oddly caused a large amount of read bandwidth to occur, but the bandwidth appeared to go nowhere as the network chart was nearly flat.

Cluster: v1.0.4
Kubo: v0.15.0

Server 4

Server 5

petar self-assigned this Aug 25, 2021

petar changed the title ~~Deadlock/overload in gateway Aug 23 incident~~ Deadlock/overload in gateway during spiking goroutine count Aug 25, 2021

BigLep assigned guseggert and unassigned petar Aug 31, 2021

BigLep added this to the go-ipfs 0.13 milestone Oct 5, 2021

galargh added this to IPFS Shipyard Team Mar 2, 2022

BigLep modified the milestones: go-ipfs 0.13, TBD Mar 3, 2022

BigLep moved this to 🥞 Todo in IPFS Shipyard Team Mar 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock/overload in gateway during spiking goroutine count #8381

Deadlock/overload in gateway during spiking goroutine count #8381

petar commented Aug 25, 2021 •

edited

Loading

BigLep commented Aug 31, 2021 •

edited by whyrusleeping

Loading

Stebalien commented Aug 31, 2021

Stebalien commented Aug 31, 2021

guseggert commented Sep 3, 2021 •

edited

Loading

BigLep commented Sep 3, 2021

guseggert commented Sep 3, 2021

guseggert commented Sep 23, 2021 •

edited

Loading

Stebalien commented Sep 23, 2021

guseggert commented Oct 4, 2021

BigLep commented Oct 5, 2021

Stebalien commented Oct 5, 2021

BigLep commented Oct 11, 2021

BigLep commented Mar 3, 2022

iand commented Sep 8, 2022

jtsmedley commented Oct 27, 2022 •

edited

Loading

Deadlock/overload in gateway during spiking goroutine count #8381

Deadlock/overload in gateway during spiking goroutine count #8381

Comments

petar commented Aug 25, 2021 • edited Loading

BigLep commented Aug 31, 2021 • edited by whyrusleeping Loading

Stebalien commented Aug 31, 2021

Stebalien commented Aug 31, 2021

guseggert commented Sep 3, 2021 • edited Loading

BigLep commented Sep 3, 2021

guseggert commented Sep 3, 2021

guseggert commented Sep 23, 2021 • edited Loading

Stebalien commented Sep 23, 2021

guseggert commented Oct 4, 2021

BigLep commented Oct 5, 2021

Stebalien commented Oct 5, 2021

BigLep commented Oct 11, 2021

BigLep commented Mar 3, 2022

iand commented Sep 8, 2022

jtsmedley commented Oct 27, 2022 • edited Loading

Server 4

Server 5

petar commented Aug 25, 2021 •

edited

Loading

BigLep commented Aug 31, 2021 •

edited by whyrusleeping

Loading

guseggert commented Sep 3, 2021 •

edited

Loading

guseggert commented Sep 23, 2021 •

edited

Loading

jtsmedley commented Oct 27, 2022 •

edited

Loading