-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock/overload in gateway during spiking goroutine count #8381
Comments
Next steps options:
Various tools that may be useful: |
I'm seeing logging backed up:
This appears to be backing everything else up a bit. However, I believe we observed the same problem without these log statements, so we can't immediately blame them. I'm going to drop these to "debug" so we can continue debugging. |
Let's try again with ipfs/go-bitswap#526. |
We've been unable to confirm that logging is the bottleneck here. We deployed that change that reduced bitswap logging verbosity. We also suspected that logging to the serial console was creating the bottleneck, due to journald writing that synchronously, but that was disabled and the issue remains. Also sometimes I take stack dumps while a host is sick, and I don't see any goroutines blocked on logging (doesn't rule it out entirely). The volume of logs is not very high, e.g. in dc12-b3 over the past hour we avergae ~48 log events per second into journald. So while we haven't ruled out logging entirely, we haven't found much additional evidence that supports it as the bottleneck. I've got two more things to try here: turn off peerlogs entirely on a host (this change is already queued up for deployment), and stress test journald just to sanity check its limits. After that I think we have enough data to support logging not being the bottleneck. |
@guseggert : sounds good - thanks for the update. I assume we'll talk about what's next after we rule out logging. Also, do we have the pprof image of the dump? I assume that doesn't point any particular direction. |
Yes I have quite a few at this point :) Here's the original we took: stack.gz |
Coming back from our off-site. The logging was a red herring...we fixed a number of related issues which didn't improve the situation (although still important fixes):
During these goroutine spikes the hosts are CPU-bound and not IO-bound. CPU profiles show that the CPU is busy iterating and copying maps (Bitswap wantlists). Here are the most frequent stack traces:
@Stebalien has a PR here ipfs/go-bitswap#530 that reuses cached wantlists in certain scenarios, and I deployed that to a box but it did not help, the CPU still spins on a combination of map iteration and GC. Side note: mtail is showing up a lot in CPU profiles too, it uses an entire core to process the nginx access logs. I don't think it's the main culprit though, but something to keep an eye on. I'm currently trying to make some optimizations to Bitswap so that it doesn't need to iterate and copy maps as much, which will take me some time since I am still ramping up on the codebase. Ultimately though we are missing the right kind of backpressure to prevent these situations where we take on more work than we can handle. There will always be room for optimizations, but the service should protect itself from falling over. In this case, it looks like we're hitting processing limits related to # reqs, wantlist sizes, and # peers...we can perhaps apply backpressure at that level. I also have a long list of metrics we can add to Bitswap to help future investigations, some of the internals of Bitswap and its different mechanisms need more instrumentation. |
@guseggert can broadcast fewer wants? I.e., one broadcast want per session? I agree we need better back pressure, but we should be able to cut down the work by several orders of magnitude. Can you also upload the new profiles? |
I haven't been able to focus on this much, but do have some newer info. Bitswap is CPU-bound under certain workloads, and our public gateways do hit this bottleneck often, but I don't think it's the main contributor to the goroutine pileup...looking through metrics in the past when hosts have experienced this, they are not strongly correlated. This morning I pushed out a change to Bitswap switching all mutexes to https://github.com/sasha-s/go-deadlock...I configured it to report goroutines stuck acquiring locks after >1 min. The lib tracks the last goroutine to acquire stuck locks too, and every log entry pointed to
Runtime behavior was quite different though, goroutines immediately started piling up at a different rate than normal, with no recovery, so I rolled back the change. Either I misconfigured it, or it has some sort of runtime impact in a way that causes Bitswap to behave differently. Reading through MarkUnresponsive code paths, I cannot find any deadlocks or work that could get stuck (take that with a grain of salt, I am still relatively new to the codebase). I'm suspicious of that b/c the runtime behavior changed, but it could still be pointing in the right direction. For now I'm pausing my investigation into this goroutine issue and focusing instead of adding tracing, metrics, and synthetic monitors. On the CPU side, here's a CPU profile showing the bottleneck on a host at the 95% CPU utilization plateau: We should set scaling policies to keep the CPU from going that high, standard thresholds should work, I think 80% would be a good threshold. In many regions were are routinely hitting the 95% plateau which is likely impacting latency. |
@guseggert : will attach the large go routine pprof. |
@guseggert I've written up the idea I proposed in our discussion in ipfs/boxo#81 (well, a slight variant on it). We likely won't get to it for a while, but it's something to keep on the radar. |
@guseggert : can you please attach the large go routine pprof and then we'll park this for the time being. |
@guseggert : what are the next steps for this? |
I think we encountered this bug in Thunderdome. I wrote some notes on it on Notion I took a full profile which I am investigating. You can get it at bafybeihlgdmnkkaawrge3yzqbocbkzf4mvbcgzxibq4tpwufklr4deqcyu Elevated persistent goroutines and heap (yellow top line): Also elevated number of peers: |
I believe we encountered this issue yesterday that brought our cluster to a complete crawl. Server 4 was actively selected by the cluster allocator during this time. We are running with a concurrency of 10 pins on all nodes and our cluster has 17 nodes. The issue also oddly caused a large amount of read bandwidth to occur, but the bandwidth appeared to go nowhere as the network chart was nearly flat. Cluster: v1.0.4 Server 4Server 5 |
@Stebalien reported the deadlock below. This is caused either by a bug or by lack of graceful degradation handling where needed.
The node was up for 28 minutes, and this has been stuck for 19. Unfortunately, I can't find the goroutine that's actually blocking this because the stack trace was truncated.
Debugging fixtures for investigating this problem in #8380
The text was updated successfully, but these errors were encountered: