Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACLiC compilation confuses compiled binaries with shared objects, breaking compilation in some cases #7366

Closed
1 task done
eguiraud opened this issue Mar 4, 2021 · 40 comments · Fixed by #9404
Closed
1 task done
Assignees
Milestone

Comments

@eguiraud
Copy link
Member

eguiraud commented Mar 4, 2021

  • Checked for duplicates

Describe the bug

Building ROOT on Arch Linux and then running ctest -R "(dataframe|datasource)" -- -j8 results in several test failures.
There are various possible failure modes -- the command above typically results in these kind of errors:

1369: Processing /home/jalopezg/CERN/roottest/root/dataframe/test_snapshot.C+...
1369: Info in <TUnixSystem::ACLiC>: creating shared library /home/jalopezg/CERN/build/roottest/root/dataframe/test_snapshot_C.so
1369: /usr/bin/ld: /home/jalopezg/CERN/build/roottest/root/dataframe/test_readFcc: _ZSt4cout: invalid version 6 (max 0)
1369: /usr/bin/ld: /home/jalopezg/CERN/build/roottest/root/dataframe/test_readFcc: error adding symbols: bad value
1369: collect2: error: ld returned 1 exit status
1369: Error in <ACLiC>: Executing 'cd "/home/jalopezg/CERN/build/roottest/root/dataframe" ; c++ -fPIC -c -O3 -DNDEBUG -std=c++14 -Wno-implicit-fallthrough -Wno-noexcept-type -pipe -W -Woverloaded-virtual -fsigned-char -pthread  -I$ROOTSYS/include -I/home/jalopezg/CERN/build/roottest/root/dataframe -I"/home/jalopezg/CERN/build/etc/" -I"/home/jalopezg/CERN/build/etc//cling" -I"/home/jalopezg/CERN/build/include/" -I"/home/jalopezg/CERN/build/include" -I"/home/jalopezg/CERN/build/roottest/root/dataframe"   -D__ACLIC__ "/home/jalopezg/CERN/build/roottest/root/dataframe/test_snapshot_C_ACLiC_dict.cxx" ; c++ -O3 -DNDEBUG "/home/jalopezg/CERN/build/roottest/root/dataframe/test_snapshot_C_ACLiC_dict.o" -shared   "/usr/lib/libdl.so" "/usr/lib/libc.so" "/usr/lib/libgcc_s.so" "/usr/lib/libm.so" "/usr/lib/libstdc++.so" "/home/jalopezg/CERN/build/lib/libRint.so" "/home/jalopezg/CERN/build/lib/libCore.so" "/usr/lib/libpthread.so.0" "/usr/lib/libpcre.so.1" "/usr/lib/liblzma.so.5" "/usr/lib/liblz4.so.1" "/usr/lib/libz.so.1" "/usr/lib/libzstd.so.1" "/usr/lib/libnss_files.so.2" "/home/jalopezg/CERN/build/lib/libRIO.so" "/home/jalopezg/CERN/build/lib/libThread.so" "/home/jalopezg/CERN/build/lib/libCling.so" "/usr/lib/librt.so.1" "/usr/lib/libncursesw.so.6" "/usr/lib/ld-2.33.so" "/home/jalopezg/CERN/build/lib/libMathCore.so" "/home/jalopezg/CERN/build/lib/libImt.so" "/home/jalopezg/CERN/build/lib/libMultiProc.so" "/home/jalopezg/CERN/build/lib/libNet.so" "/usr/lib/libtbb.so.2" "/usr/lib/libssl.so.1.1" "/usr/lib/libcrypto.so.1.1" "/usr/lib/libnss_systemd.so.2" -o "/home/jalopezg/CERN/build/roottest/root/dataframe/test_snapshot_C.so" /home/jalopezg/CERN/build/lib/libROOTDataFrame.so /home/jalopezg/CERN/build/lib/libTreePlayer.so /home/jalopezg/CERN/build/lib/libTree.so /home/jalopezg/CERN/build/roottest/root/dataframe/test_readFcc /home/jalopezg/CERN/build/roottest/root/dataframe/par' failed!
1369: terminate called after throwing an instance of 'std::logic_error'
1369:   what():  basic_string::_M_construct null not valid

I did not manage to reproduce the problem outside of ctest and without running multiple tests concurrently -- it also seems that one of the tests "has" to be roottest-root-dataframe-test_snapshot_manytasks.

To Reproduce

The following repository provides a Dockerfile to reproduce the issue in a container https://gitlab.cern.ch/eguiraud/arch_aclic_bug .

Setup

ROOT master (but 6.22 also has issues) and arch linux.

I tried building from source (6.22 and master) as well as installing from the system repos -- failure modes are different but none can run all tests without problems.

Gentoo seems to also be affected.

@eguiraud
Copy link
Member Author

i think there is a bug in how the aclic compilation command is built

i see this compilation command for one of the RDF roottests:

c++ -O3 -DNDEBUG "/home/blue/ROOT/master/_build/roottest/root/dataframe/writeFcc_C_ACLiC_dict.o" -shared   "/usr/lib/libdl.so" "/usr/lib/libc.so" "/usr/lib/libgcc_s.so" "/usr/lib/libm.so" "/usr/lib/libstdc++.so" "/home/blue/ROOT/master/_build/lib/libRint.so" "/home/blue/ROOT/master/_build/lib/libCore.so" "/usr/lib/libpthread.so.0" "/usr/lib/libpcre.so.1" "/usr/lib/liblzma.so.5" "/usr/lib/libxxhash.so.0" "/usr/lib/liblz4.so.1" "/usr/lib/libz.so.1" "/usr/lib/libzstd.so.1" "/usr/lib/libnss_files.so.2" "/home/blue/ROOT/master/_build/lib/libRIO.so" "/home/blue/ROOT/master/_build/lib/libThread.so" "/home/blue/ROOT/master/_build/lib/libCling.so" "/usr/lib/librt.so.1" "/usr/lib/libncursesw.so.6" "/usr/lib/ld-2.33.so" "/home/blue/ROOT/master/_build/lib/libMathCore.so" "/home/blue/ROOT/master/_build/lib/libImt.so" "/home/blue/ROOT/master/_build/lib/libMultiProc.so" "/home/blue/ROOT/master/_build/lib/libNet.so" "/usr/lib/libtbb.so.2" "/usr/lib/libssl.so.1.1" "/usr/lib/libcrypto.so.1.1" "/usr/lib/libnss_systemd.so.2" "/usr/lib/libcrypt.so.2" "/usr/lib/libp11-kit.so.0" "/usr/lib/libffi.so.7" -o "/home/blue/ROOT/master/_build/roottest/root/dataframe/writeFcc_C.so" /home/blue/ROOT/master/_build/lib/libROOTDataFrame.so /home/blue/ROOT/master/_build/lib/libTreePlayer.so /home/blue/ROOT/master/_build/lib/libTree.so /home/blue/ROOT/master/_build/roottest/root/dataframe/branchoverwrite /home/blue/ROOT/master/_build/roottest/root/dataframe/test_readerarray

i.e.

c++ <bunch of irrelevant stuff> -o "/home/blue/ROOT/master/_build/roottest/root/dataframe/writeFcc_C.so" /home/blue/ROOT/master/_build/lib/libROOTDataFrame.so /home/blue/ROOT/master/_build/lib/libTreePlayer.so /home/blue/ROOT/master/_build/lib/libTree.so /home/blue/ROOT/master/_build/roottest/root/dataframe/branchoverwrite /home/blue/ROOT/master/_build/roottest/root/dataframe/test_readerarray

The last two arguments are bogus, /home/blue/ROOT/master/_build/roottest/root/dataframe/branchoverwrite should not be there nor /home/blue/ROOT/master/_build/roottest/root/dataframe/test_readerarray

SImilarly for the failure reported in the issue description: there are 2 weird extra arguments appended to the compiler invocation that end up causing obscure aclic compilation failures.

@eguiraud
Copy link
Member Author

At TSystem.cxx:3340, in TSystem::CompileMacro, the following call:

gInterpreter->GetSharedLibDeps("/home/blue/ROOT/master/roottest/root/dataframe/writeFcc_C.so", /*tryDyld*/ true);

returns

"writeFcc_C.so libROOTDataFrame.so libTreePlayer.so libTree.so branchoverwrite test_readerarray "

where I'm pretty sure the last two argument shouldn't be present.

@eguiraud
Copy link
Member Author

In TInterpreter::GetSharedLibDeps, at some point we call interp->getDynamicLibraryManager()->searchLibrariesForSymbol("__libc_single_threaded@GLIBC_2.32", /*searchSystem*/false) and that returns "/home/blue/ROOT/master/_build/roottest/root/dataframe/branchoverwrite" in this broken usecase.

As branchoverwrite is a completely unrelated test, the linking of which actually breaks the ACLiC compilation of this macro, I guess we don't want to pick "__libc_single_threaded@GLIBC_2.32" from "/home/blue/ROOT/master/_build/roottest/root/dataframe/branchoverwrite"

@eguiraud eguiraud assigned eguiraud and vgvassilev and unassigned jalopezg-git and eguiraud May 27, 2021
@eguiraud
Copy link
Member Author

Assigning to @vgvassilev as the problem seems to be that interp->getDynamicLibraryManager()->searchLibrariesForSymbol picks up unrelated executables as if they were libraries providing certain symbols

@vgvassilev
Copy link
Member

@eguiraud, thanks for the detailed analysis. Unfortunately this works as designed and I am not sure if we can fix that behavior in a way that is reasonable when running the tests while keeping it working for the rest.

If the issue is just for the test suite, I'd propose slitting these tests into subfolders.

@pcanal
Copy link
Member

pcanal commented May 27, 2021

@vgvassilev I am confused, why does getDynamicLibraryManager returns the names of executables? What is the purpose and/or why would an executable be used to provide symbols for other executables?

@Axel-Naumann
Copy link
Member

Axel-Naumann commented May 28, 2021

Unfortunately this works as designed

That would mean the design is broken and needs to be reconsidered, possibly by re-introducing rootmap files. @pcanal FYI

@eguiraud
Copy link
Member Author

Also note that this is (or seems to be) a relatively recent problem, we only see it on Arch and Gentoo since a few months.

@vgvassilev
Copy link
Member

@vgvassilev I am confused, why does getDynamicLibraryManager returns the names of executables? What is the purpose and/or why would an executable be used to provide symbols for other executables?

Ah, if those are executables and not shared objects that’s indeed a bug that can be fixed.

@vgvassilev
Copy link
Member

In TInterpreter::GetSharedLibDeps, at some point we call interp->getDynamicLibraryManager()->searchLibrariesForSymbol("__libc_single_threaded@GLIBC_2.32", /*searchSystem*/false) and that returns "/home/blue/ROOT/master/_build/roottest/root/dataframe/branchoverwrite" in this broken usecase.

As branchoverwrite is a completely unrelated test, the linking of which actually breaks the ACLiC compilation of this macro, I guess we don't want to pick "__libc_single_threaded@GLIBC_2.32" from "/home/blue/ROOT/master/_build/roottest/root/dataframe/branchoverwrite"

Is the branchoverwrite the executable which runs the TInterpreter::GetSharedLibDeps? If that is not the case then for some reason, on Arch, the implementation thinks branchoverwrite is a shared object and not an executable..

@eguiraud
Copy link
Member Author

branchoverwrite is a completely unrelated executable, see roottest/root/dataframe/CMakeLists.txt

@vgvassilev
Copy link
Member

Unfortunately I do not have access to such a system but if you can build with -DLLVM_BUILD_TYPE=Debug the function isSharedLibrary here should return true for branchoverwrite.

If that is the case then there might be a problem with the Elf representation (or ACLiC is missing some compiler/linker flag).

@eguiraud
Copy link
Member Author

The issue description links to a dockerfile that reproduces the problem

@vgvassilev
Copy link
Member

The link seems broken for me.

@eguiraud
Copy link
Member Author

Sorry, gitlab makes projects private by default. Now fixed 😓

@eguiraud
Copy link
Member Author

Hi @vgvassilev , could you suggest a workaround for this? It prevents me from running certain tests locally 😅

@vgvassilev
Copy link
Member

Sorry I can look at that soon. I am not really sure if that can be worked around.

@vgvassilev
Copy link
Member

Some updates:

Unfortunately I do not have access to such a system but if you can build with -DLLVM_BUILD_TYPE=Debug the function isSharedLibrary here should return true for branchoverwrite.

If that is the case then there might be a problem with the Elf representation (or ACLiC is missing some compiler/linker flag).

(gdb) p Error
$6 = {_M_value = 0, _M_cat = 0x7ffff7917180 <(anonymous namespace)::system_category_instance>}
(gdb) list
260	    file_magic Magic;
261	    const std::error_code Error = identify_magic(libFullPath, Magic);
262	    if (exists)
263	      *exists = !Error;
264	
265	    return !Error &&
266	#ifdef __APPLE__
267	      (Magic == file_magic::macho_fixed_virtual_memory_shared_lib
268	       || Magic == file_magic::macho_dynamically_linked_shared_lib
269	       || Magic == file_magic::macho_dynamically_linked_shared_lib_stub
(gdb) p Magic
$7 = {V = llvm::file_magic::elf_shared_object}

There is nothing wrong with the implementation per se. For some reason the system compiles the test_snapshot_manytasks executable with a file magic which corresponds to a shared object. Although, bash correctly recognizes the right kind:

/build_root/lib/libCore.so: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, BuildID[sha1]=e29464343a374abe0b0e4149350d4a57f8383b3a, with debug_info, not stripped
[root@39b36d564292 dataframe]# file -i /build_root/roottest/root/dataframe/test_snapshot_manytasks
/build_root/roottest/root/dataframe/test_snapshot_manytasks: application/x-pie-executable; charset=binary```

@eguiraud
Copy link
Member Author

Possibly relevant (this or one of the related questions): https://stackoverflow.com/questions/34519521/why-does-gcc-create-a-shared-object-instead-of-an-executable-binary-according-to

Looks like the file magic being checked is not sufficient to distinguish between shared objs and executables (anymore)?

@vgvassilev
Copy link
Member

Link right on target. Yeah, looks like pie is to enable address space randomization for executables. I am not sure if that’s important for ACLIC binaries. A fix/workaround of the problem would be to pass -fno-pie to the aclic command.

vgvassilev added a commit to vgvassilev/root that referenced this issue Dec 9, 2021
Executables that are compiled with fPIE means they are compiled in a position
independent manner and are almost indistinguishable from the shared objects. A
reasonably reliable way to find if this was a `pie executable` is to check the
`DF_1_PIE` in the dynamic section of ELF.

Fixes root-project#7366

Patch by Alexander Penev (@alexander-penev)
vgvassilev added a commit to vgvassilev/root that referenced this issue Dec 9, 2021
Executables that are compiled with fPIE means they are compiled in a position
independent manner and are almost indistinguishable from the shared objects. A
reasonably reliable way to find if this was a `pie executable` is to check the
`DF_1_PIE` in the dynamic section of ELF.

The pseudo-code is:
```
if DT_FLAGS_1 dynamic section entry is present
  if DF_1_PIE is set in DT_FLAGS_1:
    print pie executable
  else
    print shared object
```

See https://stackoverflow.com/questions/34519521/why-does-gcc-create-a-shared-object-instead-of-an-executable-binary-according-to/34522357#34522357

Fixes root-project#7366

Patch by Alexander Penev (@alexander-penev)
@vgvassilev
Copy link
Member

@eguiraud, could you test #9404 on your setup?

vgvassilev added a commit to vgvassilev/root that referenced this issue Dec 9, 2021
Executables that are compiled with fPIE means they are compiled in a position
independent manner and are almost indistinguishable from the shared objects. A
reasonably reliable way to find if this was a `pie executable` is to check the
`DF_1_PIE` in the dynamic section of ELF.

The pseudo-code is:
```
if DT_FLAGS_1 dynamic section entry is present
  if DF_1_PIE is set in DT_FLAGS_1:
    print pie executable
  else
    print shared object
```

See https://stackoverflow.com/questions/34519521/why-does-gcc-create-a-shared-object-instead-of-an-executable-binary-according-to/34522357#34522357

Fixes root-project#7366

Patch by Alexander Penev (@alexander-penev)
@jalopezg-git
Copy link
Contributor

@eguiraud, could you test #9404 on your setup?

@vgvassilev I have a similar setup (Arch Linux x86_64) and I can confirm that the patch fixes the problem. :-)

@eguiraud
Copy link
Member Author

Hi, works for me as well! It would be nice to have confirmation for Gentoo as well if and when @amadio has time to try this out, but the patch is a fix for the original issue.

@amadio
Copy link
Member

amadio commented Dec 13, 2021

Sorry, I missed this before, please ping me in the issue description or add me as assignee next time to ensure I see it. Gentoo has -fPIE enabled by default in GCC, and I've had to make changes in ROOT already because of some tests failing due to that. If the patch fixes it for Arch, it should fix it for Gentoo. I can pick up the patch in 6.24.06, but I think this is important enough that a new patch release would be nice to make sure the fix hits as many people as possible. If there's a quick way to test this, let me know, I can test and report on the results.

@eguiraud
Copy link
Member Author

Thank you Guilherme, afaik it's enough to run ctest -j8 to test this, see your comment at #7936 (comment)

@amadio
Copy link
Member

amadio commented Dec 13, 2021

Ok, I will build the latest master and (maybe) report which tests are failing for me in a new issue then. Thanks!

@amadio
Copy link
Member

amadio commented Dec 13, 2021

No dataframe tests fail for me, but a bunch of others do fail, so if this is in master, it should have fixed the problem in Gentoo.

vgvassilev added a commit to vgvassilev/root that referenced this issue Dec 13, 2021
Executables that are compiled with fPIE means they are compiled in a position
independent manner and are almost indistinguishable from the shared objects. A
reasonably reliable way to find if this was a `pie executable` is to check the
`DF_1_PIE` in the dynamic section of ELF.

The pseudo-code is:
```
if DT_FLAGS_1 dynamic section entry is present
  if DF_1_PIE is set in DT_FLAGS_1:
    print pie executable
  else
    print shared object
```

See https://stackoverflow.com/questions/34519521/why-does-gcc-create-a-shared-object-instead-of-an-executable-binary-according-to/34522357#34522357

Fixes root-project#7366

Patch by Alexander Penev (@alexander-penev)
eguiraud pushed a commit to eguiraud/root that referenced this issue Dec 14, 2021
Executables that are compiled with fPIE means they are compiled in a position
independent manner and are almost indistinguishable from the shared objects. A
reasonably reliable way to find if this was a `pie executable` is to check the
`DF_1_PIE` in the dynamic section of ELF.

The pseudo-code is:
```
if DT_FLAGS_1 dynamic section entry is present
  if DF_1_PIE is set in DT_FLAGS_1:
    print pie executable
  else
    print shared object
```

See https://stackoverflow.com/questions/34519521/why-does-gcc-create-a-shared-object-instead-of-an-executable-binary-according-to/34522357#34522357

Fixes root-project#7366

Patch by Alexander Penev (@alexander-penev)
@eguiraud eguiraud added this to the 6.26/00 milestone Jan 5, 2022
eguiraud pushed a commit to eguiraud/root that referenced this issue Jan 5, 2022
Executables that are compiled with fPIE means they are compiled in a position
independent manner and are almost indistinguishable from the shared objects. A
reasonably reliable way to find if this was a `pie executable` is to check the
`DF_1_PIE` in the dynamic section of ELF.

The pseudo-code is:
```
if DT_FLAGS_1 dynamic section entry is present
  if DF_1_PIE is set in DT_FLAGS_1:
    print pie executable
  else
    print shared object
```

See https://stackoverflow.com/questions/34519521/why-does-gcc-create-a-shared-object-instead-of-an-executable-binary-according-to/34522357#34522357

Fixes root-project#7366

Patch by Alexander Penev (@alexander-penev)
vgvassilev added a commit that referenced this issue Jan 7, 2022
Executables that are compiled with fPIE means they are compiled in a position
independent manner and are almost indistinguishable from the shared objects. A
reasonably reliable way to find if this was a `pie executable` is to check the
`DF_1_PIE` in the dynamic section of ELF.

The pseudo-code is:
```
if DT_FLAGS_1 dynamic section entry is present
  if DF_1_PIE is set in DT_FLAGS_1:
    print pie executable
  else
    print shared object
```

See https://stackoverflow.com/questions/34519521/why-does-gcc-create-a-shared-object-instead-of-an-executable-binary-according-to/34522357#34522357

Fixes #7366

Patch by Alexander Penev (@alexander-penev)
FonsRademakers pushed a commit to root-project/cling that referenced this issue Jan 7, 2022
Executables that are compiled with fPIE means they are compiled in a position
independent manner and are almost indistinguishable from the shared objects. A
reasonably reliable way to find if this was a `pie executable` is to check the
`DF_1_PIE` in the dynamic section of ELF.

The pseudo-code is:
```
if DT_FLAGS_1 dynamic section entry is present
  if DF_1_PIE is set in DT_FLAGS_1:
    print pie executable
  else
    print shared object
```

See https://stackoverflow.com/questions/34519521/why-does-gcc-create-a-shared-object-instead-of-an-executable-binary-according-to/34522357#34522357

Fixes root-project/root#7366

Patch by Alexander Penev (@alexander-penev)
@github-actions
Copy link

Hi @vgvassilev,

It appears this issue is closed, but wasn't yet added to a project. Please add upcoming versions that will include the fix, or 'not applicable' otherwise.

Sincerely,
🤖

@eguiraud
Copy link
Member Author

I guess this needs a backport to v6-26-00-patches?

@Axel-Naumann
Copy link
Member

Reopening until it lands in 6.26 (also so we can set the project "fixed in 6.26")

@Axel-Naumann Axel-Naumann reopened this Jan 10, 2022
eguiraud pushed a commit to eguiraud/root that referenced this issue Jan 24, 2022
Executables that are compiled with fPIE means they are compiled in a position
independent manner and are almost indistinguishable from the shared objects. A
reasonably reliable way to find if this was a `pie executable` is to check the
`DF_1_PIE` in the dynamic section of ELF.

The pseudo-code is:
```
if DT_FLAGS_1 dynamic section entry is present
  if DF_1_PIE is set in DT_FLAGS_1:
    print pie executable
  else
    print shared object
```

See https://stackoverflow.com/questions/34519521/why-does-gcc-create-a-shared-object-instead-of-an-executable-binary-according-to/34522357#34522357

Fixes root-project#7366

Patch by Alexander Penev (@alexander-penev)
@eguiraud
Copy link
Member Author

A v6.26 backport is at #9676

rahulgrit pushed a commit to rahulgrit/root that referenced this issue Jan 25, 2022
Executables that are compiled with fPIE means they are compiled in a position
independent manner and are almost indistinguishable from the shared objects. A
reasonably reliable way to find if this was a `pie executable` is to check the
`DF_1_PIE` in the dynamic section of ELF.

The pseudo-code is:
```
if DT_FLAGS_1 dynamic section entry is present
  if DF_1_PIE is set in DT_FLAGS_1:
    print pie executable
  else
    print shared object
```

See https://stackoverflow.com/questions/34519521/why-does-gcc-create-a-shared-object-instead-of-an-executable-binary-according-to/34522357#34522357

Fixes root-project#7366

Patch by Alexander Penev (@alexander-penev)
eguiraud pushed a commit that referenced this issue Jan 25, 2022
Executables that are compiled with fPIE means they are compiled in a position
independent manner and are almost indistinguishable from the shared objects. A
reasonably reliable way to find if this was a `pie executable` is to check the
`DF_1_PIE` in the dynamic section of ELF.

The pseudo-code is:
```
if DT_FLAGS_1 dynamic section entry is present
  if DF_1_PIE is set in DT_FLAGS_1:
    print pie executable
  else
    print shared object
```

See https://stackoverflow.com/questions/34519521/why-does-gcc-create-a-shared-object-instead-of-an-executable-binary-according-to/34522357#34522357

Fixes #7366

Patch by Alexander Penev (@alexander-penev)
@github-actions
Copy link

Hi @eguiraud, @vgvassilev,

It appears this issue is closed, but wasn't yet added to a project. Please add upcoming versions that will include the fix, or 'not applicable' otherwise.

Sincerely,
🤖

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants