Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to inject host site versions of libfabric/UCX #252

Open
ocaisa opened this issue May 30, 2023 · 6 comments
Open

Ability to inject host site versions of libfabric/UCX #252

ocaisa opened this issue May 30, 2023 · 6 comments

Comments

@ocaisa
Copy link
Member

ocaisa commented May 30, 2023

We were looking into the case of the EFA fabric at AWS. What we provide in EESSI works with the fabric, but it is true that you get better performance with the libfabric version that they ship with the OS (Amazon Linux 2 in the case we investigated).

You can check this with:

LD_PRELOAD="/opt/amazon/efa/lib64/libfabric.so.1 /lib64/libefa.so.1" mpirun --mca pml cm osu_bibw

and compare that to

mpirun --mca pml cm osu_bibw

As things stand, we've only built in capabilities to switch out the MPI library, but it may be better/easier to switch out the UCX/libfabric libraries.

@ocaisa
Copy link
Member Author

ocaisa commented May 30, 2023

Issue with making those libraries available is that we don't control the elf header of (an injected) libfabric.so.1 and that is the one that tries to load libefa.so.1...that means we can override the link to libfabric.so.1 but not libefa.so.1.

The only thing I can think of right now is the use LD_LIBRARY_PATH to the same location as the overrides and have a copy of the necessary library/libraries there. This sounds like another good reason to have the init scripts be a symlink.

@ocaisa
Copy link
Member Author

ocaisa commented May 30, 2023

One solution may be to use the the Gentoo Prefix equivalent of /etc/ld.so.preload (not ideal though, as these are preloaded for everything that uses the prefix linker)

@ocaisa
Copy link
Member Author

ocaisa commented May 30, 2023

Another option would be to ask AWS to build libfabric with RUNPATH support for /usr/lib{64}. It may sound weird but it would mean the host libefa.so.1 would be picked up before the one we ship in the compat layer.

@ocaisa
Copy link
Member Author

ocaisa commented Aug 2, 2023

In the older EESSI versions we saw some performance issues from GROMACS when injecting libfabric. With the latest EESSI release (2023.06), these issue don't exist, and we can inject both libfabric and the AWS provided OpenMPI (4.1.5 as opposed to 4.1.1 used by the module GROMACS/2021.3-foss-2021a):

[EESSI pilot 2023.06] $ mpirun --mca pml cm gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 10000 -g logfile
...
                 (ns/day)    (hour/ns)
Performance:       55.402        0.433

[EESSI pilot 2023.06] $ LD_PRELOAD=/opt/amazon/efa/lib64/libfabric.so.1:/lib64/libefa.so.1 mpirun --mca pml cm gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 10000 -g logfile
...
                 (ns/day)    (hour/ns)
Performance:       57.946        0.414

[EESSI pilot 2023.06] $ LD_PRELOAD="/opt/amazon/openmpi/lib64/libmpi.so.40 /lib64/libhwloc.so.5 /lib64/libevent_core-2.0.so.5 /opt/amazon/openmpi/lib64/libopen-rte.so.40 /opt/amazon/openmpi/lib64/libopen-pal.so.40 /lib64/libnl-3.so.200 /lib64/libnl-route-3.so.200 /lib64/libevent_pthreads-2.0.so.5 /opt/amazon/efa/lib64/libfabric.so.1 /lib64/libefa.so.1" mpirun --mca pml cm gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 10000 -g logfile
...
                 (ns/day)    (hour/ns)
Performance:       58.356        0.411

So, by injecting the libraries we get about a 5% performance improvement.

@ocaisa
Copy link
Member Author

ocaisa commented Aug 2, 2023

LD_PRELOAD is clumsy and I would prefer to figure out another way to do the injection. I wonder if I can use patchelf to rewrite the elf header of the AWS MPI library to find all it's required libraries. I tried setting the rpath header but this worked too well, injecting more libraries than I actually wanted. I may need to replace .so dependencies by their full path.

@ocaisa
Copy link
Member Author

ocaisa commented Aug 2, 2023

patchelf does indeed seem to provide a way forward:

# Force libmpi to resolve unavailable libraries to the system versions
cp /opt/amazon/openmpi/lib64/libmpi.so.40 .
patchelf --replace-needed libhwloc.so.5 /lib64/libhwloc.so.5 libmpi.so.40
patchelf --replace-needed libevent_core-2.0.so.5 /lib64/libevent_core-2.0.so.5 libmpi.so.40
patchelf --replace-needed libevent_pthreads-2.0.so.5 /lib64/libevent_pthreads-2.0.so.5 libmpi.so.40
# Do the same for libfabric (I'm forcing use of system `libefa` here) 
cp /opt/amazon/efa/lib64/libfabric.so.1 .
patchelf --replace-needed libefa.so.1 /lib64/libefa.so.1 libfabric.so.1 
patchelf --add-needed /lib64/libnl-route-3.so.200 libfabric.so.1  # Required by system libefa
patchelf --add-needed /lib64/libnl-3.so.200 libfabric.so.1        # Required by system libefa
# Make libmpi depend on the patched libfabric thereby effectively forcing a preload of that lib
patchelf --add-needed $PWD/libfabric.so.1 libmpi.so.40 

with those changes I was able to run

LD_PRELOAD="$PWD/libmpi.so.40" mpirun --mca pml cm gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 10000 -g logfile

This means if the modified libmpi is found first then we should be fully replacing the MPI/libfabric of the application.

Due to our EasyBuild hook for rpath, we don't need LD_PRELOAD to be able to force the exectuable to find the library:

mkdir -p /cvmfs/pilot.eessi-hpc.org/host_injections/2023.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system/lib
mv libfabric.so.1 libmpi.so.40 /cvmfs/pilot.eessi-hpc.org/host_injections/2023.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system/lib

and we we can then see that EESSI resolves libmpi to the injected library with ldd $(which gmx_mpi).

trz42 pushed a commit to trz42/software-layer that referenced this issue Jan 23, 2024
remove obsolete test files, correct cpupath for skylake
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant