-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-9926] Fix rpath for linking to libdatadog when loading from extension dir #3683
[PROF-9926] Fix rpath for linking to libdatadog when loading from extension dir #3683
Conversation
…ension dir **What does this PR do?** This PR is a follow-up to #3582 . In that PR, we fixed loading the profiling native extension so that it could be loaded from the Ruby extensions directory (see the original PR for more details). It turns out this was not enough! Specifically, the customer reported that they saw the following error > Profiling was requested but is not supported, profiling disabled: There was an error loading the profiling > native extension due to 'RuntimeError Failure to load datadog_profiling_native_extension.3.2.2_x86_64-linux > due to libdatadog_profiling.so: cannot open shared object file: No such file or directory Specifically, what this message tells is that we're finding the profiling native extension BUT it's failing to load BECAUSE the dynamic loader is not able to find its `libdatadog_profiling.so` dependency. From debugging the issue with the customer, I suspect that what we're seeing here is a repeat of #2067 / #2125 , that is, the paths where the profiler is compiled are changed at deployment, and so we also need to adjust the relative rpath to account for this. I haven't yet confirmed with the customer that this is their issue, BUT I was able to reproduce the exact problem if I moved the installation of the library in the way I mention above (see "how to test the change", below). **Motivation:** Fix this weird corner case that made the profiler not load. **Additional Notes:** This is a really really weird corner case, so I'm happy to further describe what the issue is if my description above + the comments in the code are still too cryptic to understand. **How to test the change?** I've added test code for the helper, but actually validating the whole rpath thing is a bit annoying. Here's how I triggered the issue myself, and then used it to validate the fix: ``` # Build fixed gem into folder, will be used later $ bundle exec rake build datadog 2.0.0.rc1 built to pkg/datadog-2.0.0.rc1.gem. # Open a clean Ruby docker installation $ docker run --network=host -ti -v `pwd`:/working ruby:3.2.2-bookworm /bin/bash # I've created a minimal test gemfile ahead of time /working/rpathtest# cat gems.rb source 'https://rubygems.org' gem 'datadog' # Tell bundler to install the gem into a folder /working/rpathtest# bundle config set --local path 'vendor/bundle' /working/rpathtest# bundle install # Confirm profiler works: /working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1" # ... No errors loading profiler ... # Now let's simulate the native extension being loaded from the # extensions directory: /working/rpathtest# find | grep \.so$ | grep datadog ./vendor/bundle/ruby/3.2.0/extensions/x86_64-linux/3.2.0/datadog-2.0.0.rc1/datadog_profiling_native_extension.3.2.2_x86_64-linux.so ./vendor/bundle/ruby/3.2.0/extensions/x86_64-linux/3.2.0/datadog-2.0.0.rc1/datadog_profiling_loader.3.2.2_x86_64-linux.so ./vendor/bundle/ruby/3.2.0/gems/libdatadog-9.0.0.1.0-x86_64-linux/vendor/libdatadog-9.0.0/x86_64-linux/libdatadog-x86_64-unknown-linux-gnu/lib/libdatadog_profiling.so ./vendor/bundle/ruby/3.2.0/gems/libdatadog-9.0.0.1.0-x86_64-linux/vendor/libdatadog-9.0.0/x86_64-linux-musl/libdatadog-x86_64-alpine-linux-musl/lib/libdatadog_profiling.so ./vendor/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_native_extension.3.2.2_x86_64-linux.so ./vendor/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_loader.3.2.2_x86_64-linux.so /working/rpathtest# rm ./vendor/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_native_extension.3.2.2_x86_64-linux.so ./vendor/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_loader.3.2.2_x86_64-linux.so # Confirm profiler still works: /working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1" # ... No errors loading profiler ... # Now let's simulate the folders being moved (the issue being fixed): /working/rpathtest# cat /usr/local/bundle/config --- BUNDLE_PATH: "vendor/bundle" # Update this to vendor2... working/rpathtest# cat /usr/local/bundle/config --- BUNDLE_PATH: "vendor2/bundle" # and move the folder /working/rpathtest# mv vendor/ vendor2 # Now we've triggered the exact same error message as reported by the # customer /working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1" W, [2024-06-05T15:51:12.488843 #517] WARN -- datadog: [datadog] Profiling was requested but is not supported, profiling disabled: There was an error loading the profiling native extension due to 'RuntimeError Failure to load datadog_profiling_native_extension.3.2.2_x86_64-linux due to libdatadog_profiling.so: cannot open shared object file: No such file or directory' at '/working/rpathtest/vendor2/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog/profiling/load_native_extension.rb:41:in `<top (required)>'' # Now let's test the fix. Let's start by recreating the issue: # Put the fixed version into the bundler cache... /working/rpathtest# cp /working/pkg/datadog-2.0.0.rc1.gem vendor2/bundle/ruby/3.2.0/cache/datadog-2.0.0.rc1.gem # force bundler to reinstall... working/rpathtest# rm -rf vendor2/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/ working/rpathtest# bundle install # Force gem to be loaded from extension directory /working/rpathtest# rm ./vendor2/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_native_extension.3.2.2_x86_64-linux.so ./vendor2/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_loader.3.2.2_x86_64-linux.so # Confirm it works: /working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1" # ... No errors loading profiler ... # Let's now change the vendor folder again: /working/rpathtest# cat /usr/local/bundle/config --- BUNDLE_PATH: "vendor3/bundle" /working/rpathtest# mv vendor2/ vendor3 # And it now doesn't fail: /working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1" # ... No errors loading profiler ... # And extra confirmation that the relative paths are working: /working/rpathtest# ldd ./vendor3/bundle/ruby/3.2.0/extensions/x86_64-linux/3.2.0/datadog-2.0.0.rc1/datadog_profiling_native_extension.3.2.2_x86_64-linux.so libdatadog_profiling.so => /working/rpathtest/./vendor3/bundle/ruby/3.2.0/extensions/x86_64-linux/3.2.0/datadog-2.0.0.rc1/../../../../gems/libdatadog-9.0.0.1.0-x86_64-linux/vendor/libdatadog-9.0.0/x86_64-linux/libdatadog-x86_64-unknown-linux-gnu/lib/libdatadog_profiling.so (0x00007ff127c00000) ```
I am trying to understand the situation, here are where the various .so's are on my machine:
(I have a possibly slightly different configuration where I have a designated global directory for gems installed by bundler.) I think what happens with extensions in gems is the following:
So, looking at the set of files present in an "installed" gem, it's actually a mix of files that are used at runtime and the temporary files used during build process that are never cleaned up. Is it possible that the customer issue is actually due to some tool copying the "temporary" files, including the .so built, and not the "final" files? It probably wouldn't change the resulting logic that we would need but would at least provide an explanation for what is happening. A lot of ruby libraries just examine the filesystem around themselves assuming various files to be present that aren't used by ruby runtime and I wouldn't be surprised if there are multiple tools out in the wild that end up copying or using temporary files thinking those are permanently installed artifacts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Is there a way to test this further ?
cc @p-datadog
I think your understanding is almost correct. In particular this part
I believe this was recently changed on rubygems, now there's some cleanup: rubygems/rubygems#3958
We have a meeting scheduled with the customer, perhaps we'll get some hints on how exactly they deploy their gems and what's causing their weird setup. I don't think their tool is copying any temporary files -- it's correctly copying the stuff that ends up under Overall I've tried to support this weird setup because with Ruby being so configurable, I suspect it may not be the last time we see it, and so a bit of complexity on our side may save us from future support tickets.
@r1viollet It's a good question. Perhaps I was a bit too quick to dismiss having automated testing for this. On closer thought, we could have a CI step that basically does what I showed above: install the gem, then move the files around, then see if the profiler can still start. I'll see if I can take a stab at it before merging this PR. |
Thanks y'all for the reviews and the feedback. I'm working on adding testing for this, but rather than pile on this PR, I'll open a separate small PR just with the testing. Going ahead and merging this one! :) |
…d relative rpath is needed **What does this PR do?** This PR adds a new test case that validates that DataDog/dd-trace-rb#3582 and DataDog/dd-trace-rb#3683 keep working fine. **Motivation:** As described in DataDog/dd-trace-rb#3683, this a somewhat annoying thing to test, but important to avoid regressing. **Additional Notes:** You can actually see the evolution of both of those fixes in this test. E.g. here's dd-trace-rb 1.21.0 (prior to DataDog/dd-trace-rb#3582 ): ``` W, [2024-06-12T09:34:08.759519 #7] WARN -- ddtrace: [ddtrace] (/app/vendor-moved/bundle/ruby/3.3.0/gems/ddtrace-1.21.1/lib/datadog/core/configuration/components.rb:115:in `startup!') Profiling was requested but is not supported, profiling disabled: There was an error loading the profiling native extension due to 'RuntimeError Failure to load datadog_profiling_native_extension.3.3.2_x86_64-linux due to /app/vendor-moved/bundle/ruby/3.3.0/gems/ddtrace-1.21.1/lib/datadog/profiling/../../datadog_profiling_native_extension.3.3.2_x86_64-linux.so: cannot open shared object file: No such file or directory' at '/app/vendor-moved/bundle/ruby/3.3.0/gems/ddtrace-1.21.1/lib/datadog/profiling/load_native_extension.rb:26:in `<top (required)>'' --- FAIL: TestScenarios/scenarios/ruby_extension_dir_and_rpath (14.86s) ``` in this version, we failed because we couldn't load the native extension. Then here's dd-trace-rb 1.23.1 (without DataDog/dd-trace-rb#3683 ) and if we don't move the `vendor` folder (but still delete the so from the lib folder): ``` --- PASS: TestScenarios/scenarios/ruby_extension_dir_and_rpath (18.96s) ``` ...but if we additionally move the vendor folder (aka what this PR does in the Dockerfile): ``` W, [2024-06-12T09:37:33.517188 #6] WARN -- ddtrace: [ddtrace] (/app/vendor-moved/bundle/ruby/3.3.0/gems/ddtrace-1.23.1/lib/datadog/core/configuration/components.rb:116:in `startup!') Profiling was requested but is not supported, profiling disabled: There was an error loading the profiling native extension due to 'RuntimeError Failure to load datadog_profiling_native_extension.3.3.2_x86_64-linux due to libdatadog_profiling.so: cannot open shared object file: No such file or directory' at '/app/vendor-moved/bundle/ruby/3.3.0/gems/ddtrace-1.23.1/lib/datadog/profiling/load_native_extension.rb:39:in `<top (required)>'' --- FAIL: TestScenarios/scenarios/ruby_extension_dir_and_rpath (3.25s) ``` Notice it fails BUT the error is now different from the one above -- the error is relating to loading `libdatadog_profiling.so`, not `datadog_profiling_native_extension.3.3.2_x86_64-linux.so`. And with the change in DataDog/dd-trace-rb#3683 (which will be in 1.23.2): ``` --- PASS: TestScenarios/scenarios/ruby_extension_dir_and_rpath (9.60s) ``` **NOTE**: For this test, unlike other Ruby tests we have, we're pulling in the latest **released** gem version (e.g. with `gem 'datadog'` on the `gems.rb` file), not the latest from git (as we do for other Ruby tests). This is because gems get installed in different paths when bundler downloads them directly from git, and we want to validate the path when a stable version is installed. This also means that this PR will show up as failed until the latest datadog release (which will be 2.2.0) gets released. (Or 1.23.2, but I left the test setup to test the latest 2.x releases, not the 1.x ones, although I used 1.x on my tests above to show the evolution of the issue).
What does this PR do?
This PR is a follow-up to
#3582 .
In that PR, we fixed loading the profiling native extension so that it could be loaded from the Ruby extensions directory (see the original PR for more details).
It turns out this was not enough! Specifically, the customer reported that they saw the following error
Specifically, what this message tells is that we're finding the profiling native extension BUT it's failing to load BECAUSE the dynamic loader is not able to find its
libdatadog_profiling.so
dependency.From debugging the issue with the customer, I suspect that what we're seeing here is a repeat of
#2067 / #2125 , that is, the paths where the profiler is compiled are changed at deployment, and so we also need to adjust the relative rpath to account for this.
I haven't yet confirmed with the customer that this is their issue, BUT I was able to reproduce the exact problem if I moved the installation of the library in the way I mention above (see "how to test the change", below).
Motivation:
Fix this weird corner case that made the profiler not load.
Additional Notes:
This is a really really weird corner case, so I'm happy to further describe what the issue is if my description above + the comments in the code are still too cryptic to understand.
I'm opening this to target the 1.x-stable branch, as I'm hoping the customer can test the fix. If it's successful, I'll also forward-port it to the 2.x branch.
How to test the change?
I've added test code for the helper, but actually validating the whole rpath thing is a bit annoying.
Here's how I triggered the issue myself, and then used it to validate the fix: