Thread tracing causes process to hang #1910

parkerfinch · 2023-03-24T19:06:11Z

Description

We recently attempted to upgrade the version of the agent we use from 8.16 to 9.0 and ran into an issue where, after our code had completed a Sidekiq job, the process would be spinning at 100% CPU usage seemingly indefinitely.

We were able to prevent this issue by disabling the thread tracing (setting instrumentation.thread.tracing to false), so it seems like this issue may have been present for a while but only manifested with the automatic enabling of thread tracing that was included in 9.0.

Notably, this only seems to occur for a particular job that we have. We use Sidekiq for job processing, and this issue would manifest for a particularly long-running Sidekiq job that made many database requests. Reducing the number of DB requests made by that job prevented this issue from manifesting.

Expected Behavior

We expect the New Relic agent to publish the traces quickly and without using 100% of a CPU.

Steps to Reproduce

I wish I had more specific steps for you here — the job we're looking at takes over ten minutes to run so it's a long debug cycle. We were not able to reproduce the issue when running the job outside of Sidekiq, so it seems like maybe some interaction of the threading used by Sidekiq with the thread tracing done in NewRelic.

If this seems worth digging into then I can look further into creating a reproducible example, let me know!

Your Environment

Running on Heroku 18 (Ubuntu 18.04).
New Relic Ruby Agent: 9.0.0
Ruby: 3.1.2
Sidekiq: 6.5.6

For Maintainers Only or Hero Triaging this bug

Suggested Priority (P1,P2,P3,P4,P5):
Suggested T-Shirt size (S, M, L, XL, Unknown):

The text was updated successfully, but these errors were encountered:

workato-integration · 2023-03-24T19:06:16Z

https://issues.newrelic.com/browse/NEWRELIC-7754

travisbell · 2023-03-24T22:33:06Z

I've been debugging this exact issue in Resque for the past few days without much luck and happened to be investigating an entirely different NR issue when I just came across this! 😍 I'm going to disable thread instrumentation and let it run for the weekend. I'll report back early next week if we had anymore stuck processes.

With regards to the issue we're seeing, it's the same basic info as @parkerfinch but in Resque, not Sidekiq. Long running jobs seem to be those affected, never see the problem manifest itself on short jobs. CPU gets pinned at 100% after the job is done and they don't exit cleanly like they should. Only solution I've found is to manually kill the proc.

🤞🏼 it's as easy as disabling threading instrumentation.

fallwith · 2023-03-25T00:03:41Z

Hi @parkerfinch and @travisbell,

Thanks very much for bringing this issue to our attention. We place a huge emphasis and priority on delivering maximum value while running at as low of an overhead as possible, and a 100% CPU spike certainly is out of alignment with our goal.

I'm very glad @parkerfinch was able to pinpoint the new-in-9.0 defaulting to true (enabled) for instrumentation.thread.tracing as a root cause. And @travisbell we eagerly await the results of your weekend run.

(As @parkerfinch points out) Thread tracing was added in v8.7 but not enabled by default so as to avoid conflicting with existing custom user solutions for Thread instances. Then for v9.0 we were able to leverage semver and flip the default to being enabled. For multithreaded apps, the Thread based categorization can be super neat to see in the UI, but the functionality is certainly not worth a CPU spike and shouldn't be causing one.

We always welcome all the assistance we can get on this open source agent software. If you're able to produce a reproduction that doesn't require sharing any application specific code or customer data, we'd love to use it. But we don't require it and will try to put together a repro ourselves. We'll keep you posted with updates here on this GitHub issue.

In the meantime and depending on your support level with New Relic, you can also engage the New Relic Support Community or open a support ticket. When doing so, please provide a link to this GitHub issue.

travisbell · 2023-03-27T14:23:51Z

Thanks @fallwith.

Alright, 3 nights of running Resque without thread tracing and we didn't have a single hanging process. That's the first time in ~3 weeks I can say that so I feel pretty confident this was our culprit. Hard to imagine it's a coincidence at this point but I will reply back in case I see it happen again.

To add some extra details about our ENV:

New Relic Ruby Agent 9.0.0
Ruby 3.2.1
Resque 2.5.0
resque-multi-job-forks

And to re-iterate, it seems to affect long running jobs and not short quick ones. Long running for us means ~2-3 hours.

parkerfinch · 2023-03-27T14:41:14Z

And to re-iterate, it seems to affect long running jobs and not short quick ones. Long running for us means ~2-3 hours.

Interesting that it only affects jobs running for 2-3 hours on your end, we've seen it with jobs running as quickly as 13 minutes.

In case it helps, the general "shape" of the job that we're seeing this issue on is:

Do a bunch (tens of thousands) of small database reads (return a single row from a table) in a loop.
Do one large write.

travisbell · 2023-03-27T14:50:07Z

I'll admit that the work profile for the jobs we've seen it with do sound a lot like what you describe. The main area we saw this around was a job with a long running loop (that does many tens of thousands DB calls) and another job that is an export of sorts, ie. the same tens of thousands of DB calls + writing to a file for an hour or two.

Maybe that likeness is enough for NR to at least get pointed in the right direction.

fallwith · 2023-03-28T21:01:35Z

@parkerfinch @travisbell thank you both for the updates. We're working on identifying a root cause and will let you know as soon as we have one. So far we've done lots of Thread focused testing with short-lived and long-lived processes running from within a dozen threads to 100,000 and haven't yet managed to generate a hang or agent induced CPU overhead.

Our current hypothesis is that the agent's database and/or background job instrumentation is key to the problem, given that you have both of those ingredients in play.

The agent's Thread instrumentation does 2 really cool things:

When instances of Thread are spawned, we record this to help users see what took place in which thread
For existing agent instrumentation for databases, background jobs, NoSQL stores, net clients, etc., activity taking place in a new thread would previously have been hidden from the agent. With instrumentation.thread.tracing enabled, the agent now does a much improved job of operating in an async context.

Our current hypothesis is that our existing database and/or background job instrumentation may ultimately be to blame for the hang and the thread support is simply bringing that to the surface.

Questions for you:

Before getting to the point where things hang, were you able to notice any additional database or Sidekiq/Resque data appearing in your New Relic dashboard that wasn't there before?
When things hang at 100% CPU, does both the running process and the New Relic agent hang or does the agent continue to report data at that point?

parkerfinch · 2023-04-04T13:31:27Z

Before getting to the point where things hang, were you able to notice any additional database or Sidekiq/Resque data appearing in your New Relic dashboard that wasn't there before?

I haven't been able to find any! But I wouldn't consider myself an expert on navigating the dashboard so it's totally possible that I'm missing something.

When things hang at 100% CPU, does both the running process and the New Relic agent hang or does the agent continue to report data at that point?

I'm not sure how best to determine this. What should I expect to see if the New Relic agent is continuing to report data if the process is hanging?

fallwith · 2023-04-04T21:23:41Z

Thanks, @parkerfinch! If the agent itself is hogging all of the CPU but still managing to operate, its activity can be seen by temporarily having it operate in debug level logging mode, and/or by enabling the optional audit log. These are the two relevant newrelic.yml config options:

  log_level: debug
  audit_log.enabled: true

Those options aren't good for production use, but helpful for temporary debugging.

With debug level logging, log/newrelic_agent.log should continue to see constant activity while the agent is still running and with the audit log enabled, log/newrelic_audit.log will see new content every time some data is sent up to New Relic's servers.

On our end, we now have a repro that will spike the CPU and hang the monitored application for a period of time only when thread instrumentation is enabled. The application eventually recovers after the agent has finished going through a backlog of thread related work, so it's not a complete and permanent hang we're seeing.

We are now evaluating a variety of ways to improve things while keeping thread instrumentation enabled, and we will share our findings and improvement PRs here as they become available.

If anyone else has a repro that they can share that does not include any sensitive data, we'd also be happy to test the fix against any* repro shared with us. Otherwise, we'll have a feature branch prepped that others can pull down and test with.

wata727 · 2023-04-11T01:45:25Z

Same issue here. In our case, a long-running (~ 3 hours) Sidekiq job executing a large number of queries faced this issue.
Additional context:

New Relic Ruby Agent 9.1.0
Ruby 2.7.6
Sidekiq 6.5.8

Below is a backtrace taken with the TTIN signal for a hung process.
https://github.com/sidekiq/sidekiq/wiki/Signals#ttin

Thread TID-a8yu processor

/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:110:in `first'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:110:in `timings_overlap?'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:129:in `block (2 levels) in children_time_ranges'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:129:in `index'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:129:in `block in children_time_ranges'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:128:in `each'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:128:in `each_with_object'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:128:in `children_time_ranges'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:303:in `record_exclusive_duration'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/abstract_segment.rb:92:in `finalize'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/tracing.rb:52:in `block in finalize_segments'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/tracing.rb:52:in `each'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction/tracing.rb:52:in `finalize_segments'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction.rb:556:in `commit!'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/transaction.rb:535:in `finish'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/instrumentation/controller_instrumentation.rb:391:in `perform_action_with_newrelic_trace'
/usr/local/lib/ruby/gems/2.7.0/gems/newrelic_rpm-9.1.0/lib/new_relic/agent/instrumentation/sidekiq/server.rb:20:in `call'
/usr/local/lib/ruby/gems/2.7.0/gems/sidekiq-6.5.8/lib/sidekiq/middleware/chain.rb:179:in `block in invoke'

...<snip>...

Looking at this backtrace, I think #1927 is pointing in the right direction.

angelatan2 · 2023-04-11T14:07:36Z

@wata727, thank you for sharing your encounter with a similar problem. This issue is our highest priority at the moment. The maintainers are reviewing this change as we speak. The changes are at the core of how the agent functions. Therefore we are taking more time to ensure they are robust and avoid causing other unintended issues.

kaylareopelle · 2023-04-11T20:39:32Z

Hi @parkerfinch, @travisbell, @wata727! 👋

We have a potential fix for this issue we'd like you to test.

To do this, please update your Gemfile's reference to newrelic_rpm to point at the fix's branch:

gem 'newrelic_rpm', git: 'https://github.com/newrelic/newrelic-ruby-agent/', branch: 'must_go_faster'

Then bundle, make sure instrumentation.thread.tracing set to true and run the problematic jobs.

If you decide to try it out, please let us know how it goes.

travisbell · 2023-04-11T21:18:02Z

Hi @kaylareopelle, fantastic work. I don't mind bumping to do some testing for you guys. I'll deploy later today and report back what I find out.

travisbell · 2023-04-13T15:26:34Z

2 full days of running the must_go_faster branch and we haven't had a single stuck process. Things seem to be working much better.

kaylareopelle · 2023-04-13T15:28:41Z

That's fantastic news, @travisbell. Thanks for testing the branch! This code will be included in our next release.

workato-integration · 2023-04-13T17:59:19Z

Work has been completed on this issue.

kaylareopelle · 2023-04-17T22:05:36Z

Hi @travisbell, @parkerfinch! The changes provided by #1927 have been released in v9.2.0 of the newrelic_rpm gem.

parkerfinch added the bug label Mar 24, 2023

github-actions bot added the community To tag external issues and PRs submitted by the community label Mar 24, 2023

angelatan2 pinned this issue Apr 4, 2023

angelatan2 added the P2-Bug Priority 2 Bug label Apr 6, 2023

fallwith mentioned this issue Apr 7, 2023

Efficiency improvements for segment completion #1927

Merged

kaylareopelle mentioned this issue Apr 11, 2023

newrelic_rpm slows down requests *4x* when used with graphql dataloader #1345

Closed

workato-integration bot closed this as completed Apr 13, 2023

travisbell mentioned this issue Apr 19, 2023

Possible memory leak since 9.0 #1936

Closed

tannalynn unpinned this issue Jun 6, 2023

kford-newrelic added this to Ruby Engineering Board Jul 12, 2023

kford-newrelic moved this to Code Complete/Done in Ruby Engineering Board Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread tracing causes process to hang #1910

Thread tracing causes process to hang #1910

parkerfinch commented Mar 24, 2023

workato-integration bot commented Mar 24, 2023

travisbell commented Mar 24, 2023 •

edited

Loading

fallwith commented Mar 25, 2023

travisbell commented Mar 27, 2023 •

edited

Loading

parkerfinch commented Mar 27, 2023

travisbell commented Mar 27, 2023

fallwith commented Mar 28, 2023

parkerfinch commented Apr 4, 2023

fallwith commented Apr 4, 2023 •

edited

Loading

wata727 commented Apr 11, 2023

angelatan2 commented Apr 11, 2023

kaylareopelle commented Apr 11, 2023

travisbell commented Apr 11, 2023 •

edited

Loading

travisbell commented Apr 13, 2023

kaylareopelle commented Apr 13, 2023

workato-integration bot commented Apr 13, 2023

kaylareopelle commented Apr 17, 2023

Thread tracing causes process to hang #1910

Thread tracing causes process to hang #1910

Comments

parkerfinch commented Mar 24, 2023

Description

Expected Behavior

Steps to Reproduce

Your Environment

For Maintainers Only or Hero Triaging this bug

workato-integration bot commented Mar 24, 2023

travisbell commented Mar 24, 2023 • edited Loading

fallwith commented Mar 25, 2023

travisbell commented Mar 27, 2023 • edited Loading

parkerfinch commented Mar 27, 2023

travisbell commented Mar 27, 2023

fallwith commented Mar 28, 2023

parkerfinch commented Apr 4, 2023

fallwith commented Apr 4, 2023 • edited Loading

wata727 commented Apr 11, 2023

angelatan2 commented Apr 11, 2023

kaylareopelle commented Apr 11, 2023

travisbell commented Apr 11, 2023 • edited Loading

travisbell commented Apr 13, 2023

kaylareopelle commented Apr 13, 2023

workato-integration bot commented Apr 13, 2023

kaylareopelle commented Apr 17, 2023

travisbell commented Mar 24, 2023 •

edited

Loading

travisbell commented Mar 27, 2023 •

edited

Loading

fallwith commented Apr 4, 2023 •

edited

Loading

travisbell commented Apr 11, 2023 •

edited

Loading