Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread tracing causes process to hang #1910

Closed
parkerfinch opened this issue Mar 24, 2023 · 17 comments
Closed

Thread tracing causes process to hang #1910

parkerfinch opened this issue Mar 24, 2023 · 17 comments
Labels
bug community To tag external issues and PRs submitted by the community P2-Bug Priority 2 Bug

Comments

@parkerfinch
Copy link

Description

We recently attempted to upgrade the version of the agent we use from 8.16 to 9.0 and ran into an issue where, after our code had completed a Sidekiq job, the process would be spinning at 100% CPU usage seemingly indefinitely.

We were able to prevent this issue by disabling the thread tracing (setting instrumentation.thread.tracing to false), so it seems like this issue may have been present for a while but only manifested with the automatic enabling of thread tracing that was included in 9.0.

Notably, this only seems to occur for a particular job that we have. We use Sidekiq for job processing, and this issue would manifest for a particularly long-running Sidekiq job that made many database requests. Reducing the number of DB requests made by that job prevented this issue from manifesting.

Expected Behavior

We expect the New Relic agent to publish the traces quickly and without using 100% of a CPU.

Steps to Reproduce

I wish I had more specific steps for you here — the job we're looking at takes over ten minutes to run so it's a long debug cycle. We were not able to reproduce the issue when running the job outside of Sidekiq, so it seems like maybe some interaction of the threading used by Sidekiq with the thread tracing done in NewRelic.

If this seems worth digging into then I can look further into creating a reproducible example, let me know!

Your Environment

Running on Heroku 18 (Ubuntu 18.04).
New Relic Ruby Agent: 9.0.0
Ruby: 3.1.2
Sidekiq: 6.5.6

For Maintainers Only or Hero Triaging this bug

Suggested Priority (P1,P2,P3,P4,P5):
Suggested T-Shirt size (S, M, L, XL, Unknown):

@workato-integration
Copy link

@github-actions github-actions bot added the community To tag external issues and PRs submitted by the community label Mar 24, 2023
@travisbell
Copy link

travisbell commented Mar 24, 2023

I've been debugging this exact issue in Resque for the past few days without much luck and happened to be investigating an entirely different NR issue when I just came across this! 😍 I'm going to disable thread instrumentation and let it run for the weekend. I'll report back early next week if we had anymore stuck processes.

With regards to the issue we're seeing, it's the same basic info as @parkerfinch but in Resque, not Sidekiq. Long running jobs seem to be those affected, never see the problem manifest itself on short jobs. CPU gets pinned at 100% after the job is done and they don't exit cleanly like they should. Only solution I've found is to manually kill the proc.

🤞🏼 it's as easy as disabling threading instrumentation.

@fallwith
Copy link
Contributor

Hi @parkerfinch and @travisbell,

Thanks very much for bringing this issue to our attention. We place a huge emphasis and priority on delivering maximum value while running at as low of an overhead as possible, and a 100% CPU spike certainly is out of alignment with our goal.

I'm very glad @parkerfinch was able to pinpoint the new-in-9.0 defaulting to true (enabled) for instrumentation.thread.tracing as a root cause. And @travisbell we eagerly await the results of your weekend run.

(As @parkerfinch points out) Thread tracing was added in v8.7 but not enabled by default so as to avoid conflicting with existing custom user solutions for Thread instances. Then for v9.0 we were able to leverage semver and flip the default to being enabled. For multithreaded apps, the Thread based categorization can be super neat to see in the UI, but the functionality is certainly not worth a CPU spike and shouldn't be causing one.

We always welcome all the assistance we can get on this open source agent software. If you're able to produce a reproduction that doesn't require sharing any application specific code or customer data, we'd love to use it. But we don't require it and will try to put together a repro ourselves. We'll keep you posted with updates here on this GitHub issue.

In the meantime and depending on your support level with New Relic, you can also engage the New Relic Support Community or open a support ticket. When doing so, please provide a link to this GitHub issue.

@travisbell
Copy link

travisbell commented Mar 27, 2023

Thanks @fallwith.

Alright, 3 nights of running Resque without thread tracing and we didn't have a single hanging process. That's the first time in ~3 weeks I can say that so I feel pretty confident this was our culprit. Hard to imagine it's a coincidence at this point but I will reply back in case I see it happen again.

To add some extra details about our ENV:

  • New Relic Ruby Agent 9.0.0
  • Ruby 3.2.1
  • Resque 2.5.0
  • resque-multi-job-forks

And to re-iterate, it seems to affect long running jobs and not short quick ones. Long running for us means ~2-3 hours.

@parkerfinch
Copy link
Author

And to re-iterate, it seems to affect long running jobs and not short quick ones. Long running for us means ~2-3 hours.

Interesting that it only affects jobs running for 2-3 hours on your end, we've seen it with jobs running as quickly as 13 minutes.

In case it helps, the general "shape" of the job that we're seeing this issue on is:

  1. Do a bunch (tens of thousands) of small database reads (return a single row from a table) in a loop.
  2. Do one large write.

@travisbell
Copy link

I'll admit that the work profile for the jobs we've seen it with do sound a lot like what you describe. The main area we saw this around was a job with a long running loop (that does many tens of thousands DB calls) and another job that is an export of sorts, ie. the same tens of thousands of DB calls + writing to a file for an hour or two.

Maybe that likeness is enough for NR to at least get pointed in the right direction.

@fallwith
Copy link
Contributor

@parkerfinch @travisbell thank you both for the updates. We're working on identifying a root cause and will let you know as soon as we have one. So far we've done lots of Thread focused testing with short-lived and long-lived processes running from within a dozen threads to 100,000 and haven't yet managed to generate a hang or agent induced CPU overhead.

Our current hypothesis is that the agent's database and/or background job instrumentation is key to the problem, given that you have both of those ingredients in play.

The agent's Thread instrumentation does 2 really cool things:

  • When instances of Thread are spawned, we record this to help users see what took place in which thread
  • For existing agent instrumentation for databases, background jobs, NoSQL stores, net clients, etc., activity taking place in a new thread would previously have been hidden from the agent. With instrumentation.thread.tracing enabled, the agent now does a much improved job of operating in an async context.

Our current hypothesis is that our existing database and/or background job instrumentation may ultimately be to blame for the hang and the thread support is simply bringing that to the surface.

Questions for you:

  • Before getting to the point where things hang, were you able to notice any additional database or Sidekiq/Resque data appearing in your New Relic dashboard that wasn't there before?
  • When things hang at 100% CPU, does both the running process and the New Relic agent hang or does the agent continue to report data at that point?

@parkerfinch
Copy link
Author

Before getting to the point where things hang, were you able to notice any additional database or Sidekiq/Resque data appearing in your New Relic dashboard that wasn't there before?

I haven't been able to find any! But I wouldn't consider myself an expert on navigating the dashboard so it's totally possible that I'm missing something.

When things hang at 100% CPU, does both the running process and the New Relic agent hang or does the agent continue to report data at that point?

I'm not sure how best to determine this. What should I expect to see if the New Relic agent is continuing to report data if the process is hanging?

@fallwith
Copy link
Contributor

fallwith commented Apr 4, 2023

Thanks, @parkerfinch! If the agent itself is hogging all of the CPU but still managing to operate, its activity can be seen by temporarily having it operate in debug level logging mode, and/or by enabling the optional audit log. These are the two relevant newrelic.yml config options:

  log_level: debug
  audit_log.enabled: true

Those options aren't good for production use, but helpful for temporary debugging.

With debug level logging, log/newrelic_agent.log should continue to see constant activity while the agent is still running and with the audit log enabled, log/newrelic_audit.log will see new content every time some data is sent up to New Relic's servers.

On our end, we now have a repro that will spike the CPU and hang the monitored application for a period of time only when thread instrumentation is enabled. The application eventually recovers after the agent has finished going through a backlog of thread related work, so it's not a complete and permanent hang we're seeing.

We are now evaluating a variety of ways to improve things while keeping thread instrumentation enabled, and we will share our findings and improvement PRs here as they become available.

If anyone else has a repro that they can share that does not include any sensitive data, we'd also be happy to test the fix against any* repro shared with us. Otherwise, we'll have a feature branch prepped that others can pull down and test with.

@wata727
Copy link

wata727 commented Apr 11, 2023

Same issue here. In our case, a long-running (~ 3 hours) Sidekiq job executing a large number of queries faced this issue.
Additional context:

  • New Relic Ruby Agent 9.1.0
  • Ruby 2.7.6
  • Sidekiq 6.5.8

Below is a backtrace taken with the TTIN signal for a hung process.
https://github.com/sidekiq/sidekiq/wiki/Signals#ttin

Thread TID-a8yu processor

/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:110:in `first'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:110:in `timings_overlap?'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:129:in `block (2 levels) in children_time_ranges'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:129:in `index'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:129:in `block in children_time_ranges'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:128:in `each'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:128:in `each_with_object'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:128:in `children_time_ranges'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:303:in `record_exclusive_duration'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:92:in `finalize'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/tracing.rb:52:in `block in finalize_segments'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/tracing.rb:52:in `each'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/tracing.rb:52:in `finalize_segments'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction.rb:556:in `commit!'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction.rb:535:in `finish'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/instrumentation/controller_instrumentation.rb:391:in `perform_action_with_newrelic_trace'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/instrumentation/sidekiq/server.rb:20:in `call'
/usr/local/lib/ruby/gems/2.7.0/gems/sidekiq-6.5.8/lib/sidekiq/middleware/chain.rb:179:in `block in invoke'

...<snip>...

Looking at this backtrace, I think #1927 is pointing in the right direction.

@angelatan2
Copy link
Contributor

@wata727, thank you for sharing your encounter with a similar problem. This issue is our highest priority at the moment. The maintainers are reviewing this change as we speak. The changes are at the core of how the agent functions. Therefore we are taking more time to ensure they are robust and avoid causing other unintended issues.

@kaylareopelle
Copy link
Contributor

Hi @parkerfinch, @travisbell, @wata727! 👋

We have a potential fix for this issue we'd like you to test.

To do this, please update your Gemfile's reference to newrelic_rpm to point at the fix's branch:

gem 'newrelic_rpm', git: 'https://github.com/newrelic/newrelic-ruby-agent/', branch: 'must_go_faster'

Then bundle, make sure instrumentation.thread.tracing set to true and run the problematic jobs.

If you decide to try it out, please let us know how it goes.

@travisbell
Copy link

travisbell commented Apr 11, 2023

Hi @kaylareopelle, fantastic work. I don't mind bumping to do some testing for you guys. I'll deploy later today and report back what I find out.

@travisbell
Copy link

2 full days of running the must_go_faster branch and we haven't had a single stuck process. Things seem to be working much better.

@kaylareopelle
Copy link
Contributor

That's fantastic news, @travisbell. Thanks for testing the branch! This code will be included in our next release.

@workato-integration
Copy link

Work has been completed on this issue.

@kaylareopelle
Copy link
Contributor

Hi @travisbell, @parkerfinch! The changes provided by #1927 have been released in v9.2.0 of the newrelic_rpm gem.

@kford-newrelic kford-newrelic moved this to Code Complete/Done in Ruby Engineering Board Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug community To tag external issues and PRs submitted by the community P2-Bug Priority 2 Bug
Projects
Archived in project
Development

No branches or pull requests

6 participants