Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Emulated TLS across shared libraries is (more) broken with r23 #1551

Closed
smeenai opened this issue Jul 28, 2021 · 17 comments
Closed

[BUG] Emulated TLS across shared libraries is (more) broken with r23 #1551

smeenai opened this issue Jul 28, 2021 · 17 comments
Assignees
Labels

Comments

@smeenai
Copy link

smeenai commented Jul 28, 2021

Repro test case (here's a tarball for convenience):

CMakeLists.txt:

cmake_minimum_required(VERSION 3.13.4)
project(tlstest C)
add_library(tls SHARED tls.c)
target_link_options(tls PRIVATE -rtlib=compiler-rt)
add_executable(main main.c)
target_link_libraries(main PRIVATE tls)
target_link_options(main PRIVATE -rtlib=compiler-rt)

tls.h:

#pragma once
extern _Thread_local int tls_var;
int get(void);
void set(int val);

tls.c:

#include "tls.h"
_Thread_local int tls_var;
int get() { return tls_var; }
void set(int val) { tls_var = val; }

main.c:

#include "tls.h"
#include <stdio.h>
int main() {
    set(42);
    printf("get() is %d\n", get());
    printf("tls_var is %d\n", tls_var);
}

Build and run with NDK r22:

$ cmake -G Ninja \
  -DCMAKE_TOOLCHAIN_FILE=$HOME/local/android-ndk-r22/build/cmake/android.toolchain.cmake \
  -DANDROID_ABI=armeabi-v7a -DANDROID_PLATFORM=16 ..
$ ninja
$ adb push main libtls.so /data/local/tmp
$ adb shell LD_LIBRARY_PATH=/data/local/tmp /data/local/tmp/main
get() is 42
tls_var is 42

Build and run with NDK r23 beta 6:

$ cmake -G Ninja \
  -DCMAKE_TOOLCHAIN_FILE=$HOME/local/android-ndk-r23-beta6/build/cmake/android.toolchain.cmake \
  -DANDROID_ABI=armeabi-v7a -DANDROID_PLATFORM=16 ..
$ ninja
$ adb push main libtls.so /data/local/tmp
$ adb shell LD_LIBRARY_PATH=/data/local/tmp /data/local/tmp/main
get() is 42
tls_var is 0

The culprit is https://reviews.llvm.org/D93431, which gave __emutls_get_address hidden visibility. Each .so now gets its own copy of that symbol, which in turn has its own copy of the emutls_pthread_key (and other emutls state), so the same emutls address will return different values across different shared objects.

The example is obviously contrived, but we've observed an actual app breakage from this. I say "more broken" instead of just "broken" in the title because it was kinda coincidental that this was working with the previous setup. The compiler driver inserts compiler-rt late in the link command, so if a shared object foo happens to reference __emutls_get_address (and so exports it with the old setup), other shared objects depending on foo will reference the symbol from that shared object instead of pulling in their own copy from compiler-rt. You could have a shared object not depending on foo that'd still get its own copy of __emutls_get_address though, and would be subject to the same breakage.

I'm not sure what the right fix is here ... maybe providing a shared library with the emutls bits to guarantee a single copy of emutls state across all shared objects? (For the time being, I'm just marking __emutls_get_address as default visibility and that works for us, but that could still bite us with the scenario described above.)

@smeenai smeenai added the bug label Jul 28, 2021
@smeenai
Copy link
Author

smeenai commented Jul 28, 2021

(Note that https://reviews.llvm.org/D93431 brings Android behavior in line with all other platforms, so this exact same issue would affect all other platforms as well; it's not Android specific. I'm reporting it here since it's a new issue with r23, but I can also file an upstream LLVM bug report instead if y'all would prefer.)

@DanAlbert
Copy link
Member

Thanks for the detailed report 👍

@rprichard I think making the symbol WEAK probably solves the problem, including the pre-existing one?

@DanAlbert
Copy link
Member

I think this counts as medium as a source compat issue which does qualify this for an RC respin, but those are certainly guidelines meant to be interpreted based on severity rather than strictly followed. I suspect that the use case needed to trigger the bug (direct access to a global thread local in another shared library, rather than through a function) is probably uncommon enough that we don't need to delay promotion of RC1 to stable for this, but we of course should get this fixed in a minor release soon after. @smeenai do you disagree with that?

@stephenhines assuming this is a trivial patch and update to clang-r416183, any estimate on how long it would take to deliver to prebuilts? I was actually just about to send the latest build to QA, but if this can be turned around quickly perhaps we should.

@smeenai
Copy link
Author

smeenai commented Jul 29, 2021

We build compiler-rt ourselves and don't use the one from the NDK, so you don't need to delay anything on our behalf :) I agree that this is a pretty big edge case and likely quite rare.

@rprichard I think making the symbol WEAK probably solves the problem, including the pre-existing one?

As in making it default visibility and weak, so that the dynamic linker coalesces multiple copies at runtime? That won't work for anyone who's built with -Bsymbolic or -Bsymbolic-functions, but I don't know how common that is either. (Fangrui proposed -Bsymbolic-non-weak-functions in https://reviews.llvm.org/D102570, but that hasn't landed yet.)

@DanAlbert
Copy link
Member

That won't work for anyone who's built with -Bsymbolic or -Bsymbolic-functions, but I don't know how common that is either. (Fangrui proposed -Bsymbolic-non-weak-functions in https://reviews.llvm.org/D102570, but that hasn't landed yet.)

Good point. I think your earlier suggestion of having an extra shared library to provide this function is the only fix there without fangrui's patch.

-Bsymbolic-non-weak-functions honestly sounds like it should just be the default behavior. I'm checking with the rest of the team first but I think we'd probably make that be our default behavior if it lands, since it preserves the C++ specified behavior for new/delete but also does what everyone wants for everything else (no interposition by default, get interposition if you want it by making a weak symbol, smaller code size thanks to fewer relocations, etc). Filed #1552 for us to consider that. Thanks for the link!

@stephenhines
Copy link
Collaborator

We have (unfortunately not visible on public ci.android.com) https://android-build.googleplex.com/builds/branches/aosp-llvm-r416183/grid? so this should be pretty easy to cherry-pick/rebuild/test. https://android-review.googlesource.com/c/toolchain/llvm_android/+/1782411 is the cherry-pick.

@smeenai
Copy link
Author

smeenai commented Jul 30, 2021

I put up https://reviews.llvm.org/D107127 to add the weak default visibility to __emutls_get_address.

@smeenai
Copy link
Author

smeenai commented Jul 30, 2021

We have (unfortunately not visible on public ci.android.com) https://android-build.googleplex.com/builds/branches/aosp-llvm-r416183/grid? so this should be pretty easy to cherry-pick/rebuild/test. https://android-review.googlesource.com/c/toolchain/llvm_android/+/1782411 is the cherry-pick.

If I'm reading this right, it's cherry-picking https://reviews.llvm.org/D93431, which is the cause of this issue, not the solution :) https://reviews.llvm.org/D107127 is a possible fix.

Like I said earlier, we don't use compiler-rt from the NDK toolchain, so you don't have to worry about scrambling to update prebuilts or delay releases on our behalf. (I don't know how likely it is for other people to run into this scenario.)

@stephenhines
Copy link
Collaborator

We have (unfortunately not visible on public ci.android.com) https://android-build.googleplex.com/builds/branches/aosp-llvm-r416183/grid? so this should be pretty easy to cherry-pick/rebuild/test. https://android-review.googlesource.com/c/toolchain/llvm_android/+/1782411 is the cherry-pick.

If I'm reading this right, it's cherry-picking https://reviews.llvm.org/D93431, which is the cause of this issue, not the solution :) https://reviews.llvm.org/D107127 is a possible fix.

Ah, I misunderstood above. It's also way too easy for us these days to just cherry-pick things, so I figured it wouldn't hurt to upload it. Oh well. I abandoned it now.

Like I said earlier, we don't use compiler-rt from the NDK toolchain, so you don't have to worry about scrambling to update prebuilts or delay releases on our behalf. (I don't know how likely it is for other people to run into this scenario.)

I'll let @DanAlbert decide whether he wants to wait your other fix, especially if the bug isn't present unless you use a more recent compiler-rt than what is shipping. It seems like we couldn't reproduce this with stock NDK r23 (nor should anyone else unless they're also building their own newer compiler-rt).

@smeenai
Copy link
Author

smeenai commented Jul 30, 2021

I'll let @DanAlbert decide whether he wants to wait your other fix, especially if the bug isn't present unless you use a more recent compiler-rt than what is shipping. It seems like we couldn't reproduce this with stock NDK r23 (nor should anyone else unless they're also building their own newer compiler-rt).

The issue should be present in a stock NDK r23 ... my repro instructions are with a stock NDK r23 beta 6. From what I understand, r23 is using the clang-r416183b prebuilt, which contains https://reviews.llvm.org/D93431.

@DanAlbert
Copy link
Member

Discussed this a bit and decided to stick with my plan above: fix in r23b. @rprichard pointed out that for anyone using ndk-build or CMake this isn't a regression because they were getting --exclude-libs even before r23, which means this hits an even smaller number of folks than my already small estimation.

@smeenai I approved your change upstream (thanks!). Strongly considering making -Bsymbolic-non-weak-functions the default for all of Android as well, which solves even that subset of the problem. Thanks again for all the help and feedback 👍

smeenai added a commit to llvm/llvm-project that referenced this issue Jul 30, 2021
Multiple copies of emulated TLS state means inconsistent results when
accessing the same thread-local variable from different shared objects
(android/ndk#1551). Making `__emutls_get_address`
be a weak default visibility symbol should make the dynamic linker
ensure only a single copy gets used at runtime. This is best-effort, but
the more robust approach of putting emulated TLS into its own shared
object would (a) be a much bigger change, and (b) shared objects are
pretty heavyweight, and adding a new one to a space-constrained
environment isn't an easy sell. Given the expected rarity of direct
accesses to emulated TLS variables across different shared objects, the
best-effort approach should suffice.

Reviewed By: danalbert, rprichard

Differential Revision: https://reviews.llvm.org/D107127
@pirama-arumuga-nainar
Copy link
Collaborator

@DanAlbert
Copy link
Member

https://ci.android.com/builds/branches/aosp-master-ndk/grid?head=7732824&tail=7732824 has the fix. I haven't verified the repro case, but it has the cherry-pick that was requested.

chenguoyin-alibaba pushed a commit to riscv-android-src/platform-ndk that referenced this issue Oct 21, 2021
Test: ./checkbuild.py && ./run_tests.py
Bug: android/ndk#1540
Bug: android/ndk#1551
Bug: android/ndk#1555
Change-Id: I2d345546457d8c461b603be4ea9e725230f11af9
(cherry picked from commit 2af6bc5)
chenguoyin-alibaba pushed a commit to riscv-android-src/toolchain-llvm_android that referenced this issue Oct 21, 2021
Bug: android/ndk#1551

b8f04a670f2 [builtins] Try to ensure single copy of emulated TLS state

Test: N/A
Change-Id: I3d537a8415f428fafc4822b8bdbc33b70e09d261
@rprichard
Copy link
Collaborator

Fixed in r23b.

bob-beck pushed a commit to openbsd/src that referenced this issue Nov 12, 2021
…ng emutls

Our emulated TLS implementation relies on local state (e.g. for the pthread
key), and if we duplicate this state across different shared libraries,
accesses to the same TLS variable from different shared libraries will yield
different results (see android/ndk#1551 for an
example). __emutls_get_address is the only external entry point for emulated
TLS, and by making it default visibility and weak, we can rely on the dynamic
linker to coalesce multiple copies at runtime and ensure a single unique copy
of TLS state. This is a best effort;

Also bump the libc++abi minor because now it picks up the __emutls_get_address
symbol.

ok kettenis@
@juha-h
Copy link

juha-h commented Nov 13, 2021

I built ffmpeg libraries using 23.1.7779620 that according to this page https://developer.android.com/ndk/downloads is r23b, and still I get the crash, when my app starts:

11-13 11:18:09.442 23890 23890 E AndroidRuntime: java.lang.UnsatisfiedLinkError: dlopen failed: cannot locate symbol "__emutls_get_address" referenced by "/data/app/~~VV5up2PFDWp5FVE4Nd4CIA==/com.tutpro.baresip.plus-OoB1CvZdgkOqYxE_nLSsvQ==/base.apk!/lib/arm64-v8a/libavutil.so"...

There is no crash if I build ffmpeg libs using 23.0.7599858.

Should the fix mentioned in above be already included in 23.1.7779620?

mordak pushed a commit to mordak/llvm-project that referenced this issue Nov 13, 2021
…ng emutls

Our emulated TLS implementation relies on local state (e.g. for the pthread
key), and if we duplicate this state across different shared libraries,
accesses to the same TLS variable from different shared libraries will yield
different results (see android/ndk#1551 for an
example). __emutls_get_address is the only external entry point for emulated
TLS, and by making it default visibility and weak, we can rely on the dynamic
linker to coalesce multiple copies at runtime and ensure a single unique copy
of TLS state. This is a best effort;

Also bump the libc++abi minor because now it picks up the __emutls_get_address
symbol.

ok kettenis@
mordak pushed a commit to mordak/llvm-project that referenced this issue Nov 13, 2021
…ng emutls

Our emulated TLS implementation relies on local state (e.g. for the pthread
key), and if we duplicate this state across different shared libraries,
accesses to the same TLS variable from different shared libraries will yield
different results (see android/ndk#1551 for an
example). __emutls_get_address is the only external entry point for emulated
TLS, and by making it default visibility and weak, we can rely on the dynamic
linker to coalesce multiple copies at runtime and ensure a single unique copy
of TLS state. This is a best effort;

Also bump the libc++abi minor because now it picks up the __emutls_get_address
symbol.

ok kettenis@
mordak pushed a commit to mordak/llvm-project that referenced this issue Nov 13, 2021
…ng emutls

Our emulated TLS implementation relies on local state (e.g. for the pthread
key), and if we duplicate this state across different shared libraries,
accesses to the same TLS variable from different shared libraries will yield
different results (see android/ndk#1551 for an
example). __emutls_get_address is the only external entry point for emulated
TLS, and by making it default visibility and weak, we can rely on the dynamic
linker to coalesce multiple copies at runtime and ensure a single unique copy
of TLS state. This is a best effort;

Also bump the libc++abi minor because now it picks up the __emutls_get_address
symbol.

ok kettenis@
@DanAlbert
Copy link
Member

Should the fix mentioned in above be already included in 23.1.7779620?

Yes, but what you have is different than the issue reported here. It seems like the problem you're encountering was surfaced by the fix here, but with the information available I can't tell if the fix caused a bug or just made it visible. Can you file a new bug with all the information the template asks for so we can take a look?

@juha-h
Copy link

juha-h commented Nov 15, 2021

Done #1606.

github-actions bot pushed a commit to tstellar/llvm-project that referenced this issue Jan 25, 2022
Multiple copies of emulated TLS state means inconsistent results when
accessing the same thread-local variable from different shared objects
(android/ndk#1551). Making `__emutls_get_address`
be a weak default visibility symbol should make the dynamic linker
ensure only a single copy gets used at runtime. This is best-effort, but
the more robust approach of putting emulated TLS into its own shared
object would (a) be a much bigger change, and (b) shared objects are
pretty heavyweight, and adding a new one to a space-constrained
environment isn't an easy sell. Given the expected rarity of direct
accesses to emulated TLS variables across different shared objects, the
best-effort approach should suffice.

Reviewed By: danalbert, rprichard

Differential Revision: https://reviews.llvm.org/D107127

(cherry picked from commit b8f04a6)
github-actions bot pushed a commit to tstellar/llvm-project that referenced this issue Jan 25, 2022
Multiple copies of emulated TLS state means inconsistent results when
accessing the same thread-local variable from different shared objects
(android/ndk#1551). Making `__emutls_get_address`
be a weak default visibility symbol should make the dynamic linker
ensure only a single copy gets used at runtime. This is best-effort, but
the more robust approach of putting emulated TLS into its own shared
object would (a) be a much bigger change, and (b) shared objects are
pretty heavyweight, and adding a new one to a space-constrained
environment isn't an easy sell. Given the expected rarity of direct
accesses to emulated TLS variables across different shared objects, the
best-effort approach should suffice.

Reviewed By: danalbert, rprichard

Differential Revision: https://reviews.llvm.org/D107127

(cherry picked from commit b8f04a6)
github-actions bot pushed a commit to tstellar/llvm-project that referenced this issue Jan 25, 2022
Multiple copies of emulated TLS state means inconsistent results when
accessing the same thread-local variable from different shared objects
(android/ndk#1551). Making `__emutls_get_address`
be a weak default visibility symbol should make the dynamic linker
ensure only a single copy gets used at runtime. This is best-effort, but
the more robust approach of putting emulated TLS into its own shared
object would (a) be a much bigger change, and (b) shared objects are
pretty heavyweight, and adding a new one to a space-constrained
environment isn't an easy sell. Given the expected rarity of direct
accesses to emulated TLS variables across different shared objects, the
best-effort approach should suffice.

Reviewed By: danalbert, rprichard

Differential Revision: https://reviews.llvm.org/D107127

(cherry picked from commit b8f04a6)
github-actions bot pushed a commit to tstellar/llvm-project that referenced this issue Jan 28, 2022
Multiple copies of emulated TLS state means inconsistent results when
accessing the same thread-local variable from different shared objects
(android/ndk#1551). Making `__emutls_get_address`
be a weak default visibility symbol should make the dynamic linker
ensure only a single copy gets used at runtime. This is best-effort, but
the more robust approach of putting emulated TLS into its own shared
object would (a) be a much bigger change, and (b) shared objects are
pretty heavyweight, and adding a new one to a space-constrained
environment isn't an easy sell. Given the expected rarity of direct
accesses to emulated TLS variables across different shared objects, the
best-effort approach should suffice.

Reviewed By: danalbert, rprichard

Differential Revision: https://reviews.llvm.org/D107127

(cherry picked from commit b8f04a6)
mem-frob pushed a commit to draperlaboratory/hope-llvm-project that referenced this issue Oct 7, 2022
Multiple copies of emulated TLS state means inconsistent results when
accessing the same thread-local variable from different shared objects
(android/ndk#1551). Making `__emutls_get_address`
be a weak default visibility symbol should make the dynamic linker
ensure only a single copy gets used at runtime. This is best-effort, but
the more robust approach of putting emulated TLS into its own shared
object would (a) be a much bigger change, and (b) shared objects are
pretty heavyweight, and adding a new one to a space-constrained
environment isn't an easy sell. Given the expected rarity of direct
accesses to emulated TLS variables across different shared objects, the
best-effort approach should suffice.

Reviewed By: danalbert, rprichard

Differential Revision: https://reviews.llvm.org/D107127
bsdjhb pushed a commit to CTSRD-CHERI/compiler-rt that referenced this issue Jan 3, 2023
Multiple copies of emulated TLS state means inconsistent results when
accessing the same thread-local variable from different shared objects
(android/ndk#1551). Making `__emutls_get_address`
be a weak default visibility symbol should make the dynamic linker
ensure only a single copy gets used at runtime. This is best-effort, but
the more robust approach of putting emulated TLS into its own shared
object would (a) be a much bigger change, and (b) shared objects are
pretty heavyweight, and adding a new one to a space-constrained
environment isn't an easy sell. Given the expected rarity of direct
accesses to emulated TLS variables across different shared objects, the
best-effort approach should suffice.

Reviewed By: danalbert, rprichard

Differential Revision: https://reviews.llvm.org/D107127
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants