-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread tracing causes process to hang #1910
Comments
I've been debugging this exact issue in Resque for the past few days without much luck and happened to be investigating an entirely different NR issue when I just came across this! 😍 I'm going to disable thread instrumentation and let it run for the weekend. I'll report back early next week if we had anymore stuck processes. With regards to the issue we're seeing, it's the same basic info as @parkerfinch but in Resque, not Sidekiq. Long running jobs seem to be those affected, never see the problem manifest itself on short jobs. CPU gets pinned at 100% after the job is done and they don't exit cleanly like they should. Only solution I've found is to manually kill the proc. 🤞🏼 it's as easy as disabling threading instrumentation. |
Hi @parkerfinch and @travisbell, Thanks very much for bringing this issue to our attention. We place a huge emphasis and priority on delivering maximum value while running at as low of an overhead as possible, and a 100% CPU spike certainly is out of alignment with our goal. I'm very glad @parkerfinch was able to pinpoint the new-in-9.0 defaulting to (As @parkerfinch points out) Thread tracing was added in v8.7 but not enabled by default so as to avoid conflicting with existing custom user solutions for Thread instances. Then for v9.0 we were able to leverage semver and flip the default to being enabled. For multithreaded apps, the Thread based categorization can be super neat to see in the UI, but the functionality is certainly not worth a CPU spike and shouldn't be causing one. We always welcome all the assistance we can get on this open source agent software. If you're able to produce a reproduction that doesn't require sharing any application specific code or customer data, we'd love to use it. But we don't require it and will try to put together a repro ourselves. We'll keep you posted with updates here on this GitHub issue. In the meantime and depending on your support level with New Relic, you can also engage the New Relic Support Community or open a support ticket. When doing so, please provide a link to this GitHub issue. |
Thanks @fallwith. Alright, 3 nights of running Resque without thread tracing and we didn't have a single hanging process. That's the first time in ~3 weeks I can say that so I feel pretty confident this was our culprit. Hard to imagine it's a coincidence at this point but I will reply back in case I see it happen again. To add some extra details about our ENV:
And to re-iterate, it seems to affect long running jobs and not short quick ones. Long running for us means ~2-3 hours. |
Interesting that it only affects jobs running for 2-3 hours on your end, we've seen it with jobs running as quickly as 13 minutes. In case it helps, the general "shape" of the job that we're seeing this issue on is:
|
I'll admit that the work profile for the jobs we've seen it with do sound a lot like what you describe. The main area we saw this around was a job with a long running loop (that does many tens of thousands DB calls) and another job that is an export of sorts, ie. the same tens of thousands of DB calls + writing to a file for an hour or two. Maybe that likeness is enough for NR to at least get pointed in the right direction. |
@parkerfinch @travisbell thank you both for the updates. We're working on identifying a root cause and will let you know as soon as we have one. So far we've done lots of Our current hypothesis is that the agent's database and/or background job instrumentation is key to the problem, given that you have both of those ingredients in play. The agent's
Our current hypothesis is that our existing database and/or background job instrumentation may ultimately be to blame for the hang and the thread support is simply bringing that to the surface. Questions for you:
|
I haven't been able to find any! But I wouldn't consider myself an expert on navigating the dashboard so it's totally possible that I'm missing something.
I'm not sure how best to determine this. What should I expect to see if the New Relic agent is continuing to report data if the process is hanging? |
Thanks, @parkerfinch! If the agent itself is hogging all of the CPU but still managing to operate, its activity can be seen by temporarily having it operate in debug level logging mode, and/or by enabling the optional audit log. These are the two relevant log_level: debug
audit_log.enabled: true Those options aren't good for production use, but helpful for temporary debugging. With debug level logging, On our end, we now have a repro that will spike the CPU and hang the monitored application for a period of time only when thread instrumentation is enabled. The application eventually recovers after the agent has finished going through a backlog of thread related work, so it's not a complete and permanent hang we're seeing. We are now evaluating a variety of ways to improve things while keeping thread instrumentation enabled, and we will share our findings and improvement PRs here as they become available. If anyone else has a repro that they can share that does not include any sensitive data, we'd also be happy to test the fix against any* repro shared with us. Otherwise, we'll have a feature branch prepped that others can pull down and test with. |
Same issue here. In our case, a long-running (~ 3 hours) Sidekiq job executing a large number of queries faced this issue.
Below is a backtrace taken with the TTIN signal for a hung process.
Looking at this backtrace, I think #1927 is pointing in the right direction. |
@wata727, thank you for sharing your encounter with a similar problem. This issue is our highest priority at the moment. The maintainers are reviewing this change as we speak. The changes are at the core of how the agent functions. Therefore we are taking more time to ensure they are robust and avoid causing other unintended issues. |
Hi @parkerfinch, @travisbell, @wata727! 👋 We have a potential fix for this issue we'd like you to test. To do this, please update your Gemfile's reference to
Then bundle, make sure If you decide to try it out, please let us know how it goes. |
Hi @kaylareopelle, fantastic work. I don't mind bumping to do some testing for you guys. I'll deploy later today and report back what I find out. |
2 full days of running the |
That's fantastic news, @travisbell. Thanks for testing the branch! This code will be included in our next release. |
Work has been completed on this issue. |
Hi @travisbell, @parkerfinch! The changes provided by #1927 have been released in v9.2.0 of the |
Description
We recently attempted to upgrade the version of the agent we use from 8.16 to 9.0 and ran into an issue where, after our code had completed a Sidekiq job, the process would be spinning at 100% CPU usage seemingly indefinitely.
We were able to prevent this issue by disabling the thread tracing (setting
instrumentation.thread.tracing
tofalse
), so it seems like this issue may have been present for a while but only manifested with the automatic enabling of thread tracing that was included in 9.0.Notably, this only seems to occur for a particular job that we have. We use Sidekiq for job processing, and this issue would manifest for a particularly long-running Sidekiq job that made many database requests. Reducing the number of DB requests made by that job prevented this issue from manifesting.
Expected Behavior
We expect the New Relic agent to publish the traces quickly and without using 100% of a CPU.
Steps to Reproduce
I wish I had more specific steps for you here — the job we're looking at takes over ten minutes to run so it's a long debug cycle. We were not able to reproduce the issue when running the job outside of Sidekiq, so it seems like maybe some interaction of the threading used by Sidekiq with the thread tracing done in NewRelic.
If this seems worth digging into then I can look further into creating a reproducible example, let me know!
Your Environment
Running on Heroku 18 (Ubuntu 18.04).
New Relic Ruby Agent: 9.0.0
Ruby: 3.1.2
Sidekiq: 6.5.6
For Maintainers Only or Hero Triaging this bug
Suggested Priority (P1,P2,P3,P4,P5):
Suggested T-Shirt size (S, M, L, XL, Unknown):
The text was updated successfully, but these errors were encountered: