Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock/overload in gateway during spiking goroutine count #8381

Open
petar opened this issue Aug 25, 2021 · 15 comments
Open

Deadlock/overload in gateway during spiking goroutine count #8381

petar opened this issue Aug 25, 2021 · 15 comments
Assignees
Milestone

Comments

@petar
Copy link
Contributor

petar commented Aug 25, 2021

@Stebalien reported the deadlock below. This is caused either by a bug or by lack of graceful degradation handling where needed.

goroutine 13827044 [semacquire, 19 minutes]:
sync.runtime_SemacquireMutex(0xc0020fd8a4, 0xc064b9d800, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0020fd8a0)
	/usr/local/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
sync.(*RWMutex).Lock(0xc0020fd8a0)
	/usr/local/go/src/sync/rwmutex.go:98 +0x90
github.com/ipfs/go-bitswap/network.(*connectEventManager).Disconnected(0xc0020fd890, 0xc042f9e9f0, 0x22)
	pkg/mod/github.com/ipfs/[email protected]/network/connecteventmanager.go:49 +0x47
github.com/ipfs/go-bitswap/network.(*netNotifiee).Disconnected(0xc000c7e000, 0x23b4b40, 0xc000312a80, 0x23ae8e0, 0xc0539f1290)
	pkg/mod/github.com/ipfs/[email protected]/network/ipfs_impl.go:442 +0x98
github.com/libp2p/go-libp2p-swarm.(*Conn).doClose.func1.1(0x23a07a0, 0xc000c7e000)
	pkg/mod/github.com/libp2p/[email protected]/swarm_conn.go:85 +0x5b
github.com/libp2p/go-libp2p-swarm.(*Swarm).notifyAll.func1(0xc0636baa20, 0xc0558265d0, 0x23a07a0, 0xc000c7e000)
	pkg/mod/github.com/libp2p/[email protected]/swarm.go:559 +0x66
created by github.com/libp2p/go-libp2p-swarm.(*Swarm).notifyAl)
	pkg/mod/github.com/libp2p/[email protected]/swarm.go:557 +0x118

The node was up for 28 minutes, and this has been stuck for 19. Unfortunately, I can't find the goroutine that's actually blocking this because the stack trace was truncated.

Debugging fixtures for investigating this problem in #8380

@petar petar self-assigned this Aug 25, 2021
@petar petar changed the title Deadlock/overload in gateway Aug 23 incident Deadlock/overload in gateway during spiking goroutine count Aug 25, 2021
@BigLep BigLep assigned guseggert and unassigned petar Aug 31, 2021
@BigLep
Copy link
Contributor

BigLep commented Aug 31, 2021

Next steps options:

  1. Post dump
  2. Post pprof image
  3. Substitute in code lock with a tracing lock and deploying to one box.
  4. See if can simulate load to manifest the issue locally. This would then enable us to see if any code changes fix the issue.
  5. Pulling in extra eyes on the code

Various tools that may be useful:

  1. @whyrusleeping stackparse https://github.com/whyrusleeping/stackparse/
  2. https://github.com/linuxerwang/goroutine-inspect
  3. https://github.com/sasha-s/go-deadlock

@Stebalien
Copy link
Member

I'm seeing logging backed up:

goroutine 98252740 [semacquire, 2 minutes]:
sync.runtime_SemacquireMutex(0xc00000e10c, 0xc2ab0fee00, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00000e108)
	/usr/local/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
go.uber.org/zap/zapcore.(*lockedWriteSyncer).Write(0xc00000e108, 0xc5f0da6000, 0xd9, 0x400, 0x2fd0ea0, 0x1f9d0c4, 0x6)
	pkg/mod/go.uber.org/[email protected]/zapcore/write_syncer.go:65 +0xe8
go.uber.org/zap/zapcore.(*ioCore).Write(0xc00018af30, 0x2, 0xc043a0aa2f8580d1, 0x7ef9a4ece61, 0x2fd0ea0, 0x1f9d0c4, 0x6, 0x20a7078, 0x2c, 0x1, ...)
	pkg/mod/go.uber.org/[email protected]/zapcore/core.go:90 +0x10c
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc5f0fcae40, 0xc4b621c080, 0x1, 0x2)
	pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:220 +0x12d
go.uber.org/zap.(*SugaredLogger).log(0xc000132108, 0x2, 0x20a7078, 0x2c, 0x0, 0x0, 0x0, 0xc5c9c5b750, 0x2, 0x2)
	pkg/mod/go.uber.org/[email protected]/sugar.go:227 +0x111
go.uber.org/zap.(*SugaredLogger).Errorw(...)
	pkg/mod/go.uber.org/[email protected]/sugar.go:191
github.com/ipfs/go-bitswap/internal/decision.(*Engine).ReceiveFrom(0xc000358d20, 0xc002671fb0, 0x26, 0xc317ef2240, 0x1, 0x1)
	pkg/mod/github.com/ipfs/[email protected]/internal/decision/engine.go:691 +0xbee
github.com/ipfs/go-bitswap.(*Bitswap).receiveBlocksFrom(0xc001812000, 0x23be9f0, 0xc000040078, 0xc002671fb0, 0x26, 0xc317ef21f0, 0x1, 0x1, 0x3002258, 0x0, ...)
	pkg/mod/github.com/ipfs/[email protected]/bitswap.go:487 +0x2e5
github.com/ipfs/go-bitswap.(*Bitswap).ReceiveMessage(0xc001812000, 0x23be9f0, 0xc000040078, 0xc002671fb0, 0x26, 0x23dfff8, 0xc25b8ee5a0)
	pkg/mod/github.com/ipfs/[email protected]/bitswap.go:547 +0x3c5
github.com/ipfs/go-bitswap/network.(*impl).handleNewStream(0xc0023d4f00, 0x23da598, 0xc5f746a7c0)
	pkg/mod/github.com/ipfs/[email protected]/network/ipfs_impl.go:423 +0x394
github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).SetStreamHandler.func1(0xc11cd16750, 0x13, 0x7fbda451a078, 0xc5f746a7c0, 0x1, 0x0)
	pkg/mod/github.com/libp2p/[email protected]/p2p/host/basic/basic_host.go:529 +0xa2
created by github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).newStreamHandle)
	pkg/mod/github.com/libp2p/[email protected]/p2p/host/basic/basic_host.go:372 +0x63a

This appears to be backing everything else up a bit. However, I believe we observed the same problem without these log statements, so we can't immediately blame them. I'm going to drop these to "debug" so we can continue debugging.

@Stebalien
Copy link
Member

Let's try again with ipfs/go-bitswap#526.

@guseggert
Copy link
Contributor

guseggert commented Sep 3, 2021

We've been unable to confirm that logging is the bottleneck here. We deployed that change that reduced bitswap logging verbosity. We also suspected that logging to the serial console was creating the bottleneck, due to journald writing that synchronously, but that was disabled and the issue remains. Also sometimes I take stack dumps while a host is sick, and I don't see any goroutines blocked on logging (doesn't rule it out entirely). The volume of logs is not very high, e.g. in dc12-b3 over the past hour we avergae ~48 log events per second into journald. So while we haven't ruled out logging entirely, we haven't found much additional evidence that supports it as the bottleneck.

I've got two more things to try here: turn off peerlogs entirely on a host (this change is already queued up for deployment), and stress test journald just to sanity check its limits. After that I think we have enough data to support logging not being the bottleneck.

@BigLep
Copy link
Contributor

BigLep commented Sep 3, 2021

@guseggert : sounds good - thanks for the update. I assume we'll talk about what's next after we rule out logging.

Also, do we have the pprof image of the dump? I assume that doesn't point any particular direction.

@guseggert
Copy link
Contributor

Yes I have quite a few at this point :) Here's the original we took: stack.gz

@guseggert
Copy link
Contributor

guseggert commented Sep 23, 2021

Coming back from our off-site. The logging was a red herring...we fixed a number of related issues which didn't improve the situation (although still important fixes):

  • Don't log synchronously to console, which has very limited bandwidth
  • Turn off peerlogs for all gateways (greatly reduces verbosity)
  • Spamming journald

During these goroutine spikes the hosts are CPU-bound and not IO-bound. CPU profiles show that the CPU is busy iterating and copying maps (Bitswap wantlists). Here are the most frequent stack traces:

    local.runtime.mapiternext
    github.com/ipfs/go-bitswap/message.(*impl).Wantlist
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).sendMessage
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).sendIfReady
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).runQueue
    local.runtime.mapiternext
    local.runtime.mapiterinit
    github.com/ipfs/go-bitswap/message.(*impl).Wantlist
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).sendMessage
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).sendIfReady
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).runQueue
    local.runtime.mapiternext
    github.com/ipfs/go-bitswap/wantlist.(*Wantlist).Entries
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).extractOutgoingMessage
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).sendMessage
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).sendIfReady
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).runQueue

@Stebalien has a PR here ipfs/go-bitswap#530 that reuses cached wantlists in certain scenarios, and I deployed that to a box but it did not help, the CPU still spins on a combination of map iteration and GC.

Side note: mtail is showing up a lot in CPU profiles too, it uses an entire core to process the nginx access logs. I don't think it's the main culprit though, but something to keep an eye on.

I'm currently trying to make some optimizations to Bitswap so that it doesn't need to iterate and copy maps as much, which will take me some time since I am still ramping up on the codebase.

Ultimately though we are missing the right kind of backpressure to prevent these situations where we take on more work than we can handle. There will always be room for optimizations, but the service should protect itself from falling over. In this case, it looks like we're hitting processing limits related to # reqs, wantlist sizes, and # peers...we can perhaps apply backpressure at that level. I also have a long list of metrics we can add to Bitswap to help future investigations, some of the internals of Bitswap and its different mechanisms need more instrumentation.

@Stebalien
Copy link
Member

@guseggert can broadcast fewer wants? I.e., one broadcast want per session? I agree we need better back pressure, but we should be able to cut down the work by several orders of magnitude.

Can you also upload the new profiles?

@guseggert
Copy link
Contributor

I haven't been able to focus on this much, but do have some newer info. Bitswap is CPU-bound under certain workloads, and our public gateways do hit this bottleneck often, but I don't think it's the main contributor to the goroutine pileup...looking through metrics in the past when hosts have experienced this, they are not strongly correlated.

This morning I pushed out a change to Bitswap switching all mutexes to https://github.com/sasha-s/go-deadlock...I configured it to report goroutines stuck acquiring locks after >1 min. The lib tracks the last goroutine to acquire stuck locks too, and every log entry pointed to connectEventManager.MarkUnresponsive:

POTENTIAL DEADLOCK:
Previous place where the lock was grabbed
goroutine 39080090 lock 0xc002991150
pkg/mod/github.com/ipfs/[email protected]/network/connecteventmanager.go:67 network.(*connectEventManager).MarkUnresponsive ??? <<<<<

Runtime behavior was quite different though, goroutines immediately started piling up at a different rate than normal, with no recovery, so I rolled back the change. Either I misconfigured it, or it has some sort of runtime impact in a way that causes Bitswap to behave differently. Reading through MarkUnresponsive code paths, I cannot find any deadlocks or work that could get stuck (take that with a grain of salt, I am still relatively new to the codebase). I'm suspicious of that b/c the runtime behavior changed, but it could still be pointing in the right direction.

For now I'm pausing my investigation into this goroutine issue and focusing instead of adding tracing, metrics, and synthetic monitors.

On the CPU side, here's a CPU profile showing the bottleneck on a host at the 95% CPU utilization plateau:

cpuprofile-1.txt

We should set scaling policies to keep the CPU from going that high, standard thresholds should work, I think 80% would be a good threshold. In many regions were are routinely hitting the 95% plateau which is likely impacting latency.

@BigLep
Copy link
Contributor

BigLep commented Oct 5, 2021

@guseggert : will attach the large go routine pprof.

@BigLep BigLep added this to the go-ipfs 0.13 milestone Oct 5, 2021
@Stebalien
Copy link
Member

@guseggert I've written up the idea I proposed in our discussion in ipfs/boxo#81 (well, a slight variant on it). We likely won't get to it for a while, but it's something to keep on the radar.

@BigLep
Copy link
Contributor

BigLep commented Oct 11, 2021

@guseggert : can you please attach the large go routine pprof and then we'll park this for the time being.

@BigLep
Copy link
Contributor

BigLep commented Mar 3, 2022

@guseggert : what are the next steps for this?

@BigLep BigLep modified the milestones: go-ipfs 0.13, TBD Mar 3, 2022
@BigLep BigLep moved this to 🥞 Todo in IPFS Shipyard Team Mar 3, 2022
@iand
Copy link
Contributor

iand commented Sep 8, 2022

I think we encountered this bug in Thunderdome. I wrote some notes on it on Notion

I took a full profile which I am investigating. You can get it at bafybeihlgdmnkkaawrge3yzqbocbkzf4mvbcgzxibq4tpwufklr4deqcyu

Elevated persistent goroutines and heap (yellow top line):

Snapshot_2022-09-07_16-28-57

Snapshot_2022-09-07_16-29-15

Also elevated number of peers:

Snapshot_2022-09-07_16-29-26

@jtsmedley
Copy link

jtsmedley commented Oct 27, 2022

I believe we encountered this issue yesterday that brought our cluster to a complete crawl. Server 4 was actively selected by the cluster allocator during this time. We are running with a concurrency of 10 pins on all nodes and our cluster has 17 nodes. The issue also oddly caused a large amount of read bandwidth to occur, but the bandwidth appeared to go nowhere as the network chart was nearly flat.

Cluster: v1.0.4
Kubo: v0.15.0

Server 4

image

image

image

2022-10-27-09-30-grafana filebase com

Server 5

image

image

image

2022-10-27-09-32-grafana filebase com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: 🥞 Todo
Development

No branches or pull requests

6 participants