Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heap corruption on Python when torch is imported before juliacall, but not the reverse #215

Closed
MilesCranmer opened this issue Sep 5, 2022 · 8 comments

Comments

@MilesCranmer
Copy link
Contributor

MilesCranmer commented Sep 5, 2022

Here is my system information:

  • Python 3.8.9
  • Julia 1.8.0
  • macOS 12
  • M1 chip (ARM64)
  • Python from homebrew (not conda)

I have not tested this on other systems.

Here is the trigger:

>>> import torch
>>> from juliacall import Main as jl

and the error:

Python(65251,0x104cf8580) malloc: Heap corruption detected, free list is damaged at 0x600001c17280
*** Incorrect guard value: 1903002876
Python(65251,0x104cf8580) malloc: *** set a breakpoint in malloc_error_break to debug
[1]    65251 abort      ipython

However, I can run the following just fine:

>>> from juliacall import Main as jl
>>> import torch

Here are some related issues: JuliaPy/pyjulia#125, pytorch/pytorch#78829. Particularly, check out the comment from @tttc3: pytorch/pytorch#78829 (comment).

@cjdoris
Copy link
Collaborator

cjdoris commented Sep 6, 2022

Oof, fun error!

If the root cause is the same as in the comment you linked (which looks highly plausible) then I doubt it can be fixed from JuliaCall. It could/should be documented in a troubleshooting/FAQ section in the docs. We could maybe add a warning via an import hook, but that seems a bit much.

@MilesCranmer
Copy link
Contributor Author

Thanks - is there a way that solution 1 could be used here? (is this similar to how JuliaCall works?) i.e.,

Use RTLD_DEEPBIND when loading _C.cpython-310-x86_64-linux-gnu.so. This ensures that pytorch will look for the symbol within libtorch_cpu.so before looking at the globally imported ones from libjulia-internals.so. However, I do not know if this would have some other unintended consequences?

Unfortunately the solutions 2 and 3 don't seem to help me (maybe because I'm on a mac).

@cjdoris
Copy link
Collaborator

cjdoris commented Sep 6, 2022

I'm not sure what _C is, is it referring to C bindings for the Torch library? I assume it would require a change to PyTorch to use DEEPBIND.

@MilesCranmer
Copy link
Contributor Author

Ah, yes, I think you are right... Thanks!

@MilesCranmer
Copy link
Contributor Author

MilesCranmer commented Sep 6, 2022

Just added a warning for this to PySR until it gets solved - feel free to do something similar in PythonCall.jl! I think this will prevent users getting discouraged, as a random segfault when starting julia would be a mystery - especially if torch is imported by a package rather than directly imported.

def check_for_conflicting_libraries():  # pragma: no cover
    """Check whether there are conflicting modules, and display warnings."""
    # See https://github.com/pytorch/pytorch/issues/78829: importing
    # pytorch before running `pysr.fit` causes a segfault.
    torch_is_loaded = "torch" in sys.modules
    if torch_is_loaded:
        warnings.warn(
            "`torch` was loaded before the Julia instance started. "
            "This may cause a segfault when running `PySRRegressor.fit`. "
            "To avoid this, please run `pysr.julia_helpers.init_julia()` *before* "
            "importing `torch`. "
            "For updates, see https://github.com/pytorch/pytorch/issues/78829"
        )

This simple sys.modules check seems to be enough:

sys.modules contains all modules used anywhere in the current instance of the interpreter and so shows up if imported in any other Python module.

@cjdoris
Copy link
Collaborator

cjdoris commented Sep 8, 2022

Good idea. I'm documenting it in a new Troubleshooting section in the docs.

@willow-ahrens
Copy link

I am still seeing this error on macos, but I see the error when loading juliacall after numba. Here's the CI that is failing: pydata/sparse#767. I'm willing to put some development effort into this, but it's a little bit out of my wheelhouse. I suspect the issue is also with llvm symbols. Does anyone know what it would take to fix this? Is a PR to numba required?

@MilesCranmer
Copy link
Contributor Author

So @tttc3 has the deepest understanding of this issue although it was back in 2022 so things might have changed since then. They wrote up this super useful and detailed comment describing what they had looked at: pytorch/pytorch#78829 (comment). I will copy it here for visibility:

I've had a bit of a dig and think I've found the problem, at least for the Linux case. In the Linux scenario the issue occurs when Julia is loaded before pytorch within the same process.

Cause of the problem in Linux

What I think is happening is as follows:

  1. When Julia is started, it calls dlopen(libjulia-internals.so, RTLD_NOW | RTLD_GLOBAL). This causes the symbols in libjulia-internals.so to be globally exported to any subsequently loaded objects. Because one of the libraries linked by libjulia-internals.so is libLLVM-12jl.so, it causes Julia to export these LLVM symbols globally.
  2. When pytorch is imported it calls dlopen(_C.cpython-310-x86_64-linux-gnu.so, RTLD_NOW | RTLD_LOCAL), which links to libtorch_cpu.so. The version of libtorch_cpu.so that is packaged in the pip wheels contains LLVM symbols (Note that the conda packages do not contain the LLVM symbols. I assume this is because the pip wheels are compiled using clang while conda uses gcc?). EDIT: It appears the reason some packages cause the issues and not others, is due to a libLLVM dependancy present in all official pytorch packages since version 1.10.0, that is not present in the conda-forge, pkgs/main or anaconda channel builds. Hence, when using the 1.12.1 build from the conda-forge channel on Linux, the issue appears to go away.
  3. This is where the problem arises. When pytorch comes to use an LLVM symbol, such as _ZN4llvm2cl3optINS_15FunctionSummary23ForceSummaryHotnessTypeELb1ENS0_6parserIS3_EEED2Ev in the above linux log, it uses the globally exported symbols from libjulia-internals.so, instead of the local symbols from libtorch_cpu.so. Thus, when pytorch closes and calls _ZN4llvm2cl3optINS_15FunctionSummary23ForceSummaryHotnessTypeELb1ENS0_6parserIS3_EEED2Ev, it calls the implementation within libLLVM-12jl.so, causing the incorrect pointer to be freed.
  4. When Julia subsequently closes and calls the same _ZN4llvm2cl3optINS_15FunctionSummary23ForceSummaryHotnessTypeELb1ENS0_6parserIS3_EEED2Ev as pytorch, the method tries to free a pointer that has already been freed, leading to the observed error message.

Potential solutions

  1. Use RTLD_DEEPBIND when loading _C.cpython-310-x86_64-linux-gnu.so. This ensures that pytorch will look for the symbol within libtorch_cpu.so before looking at the globally imported ones from libjulia-internals.so. However, I do not know if this would have some other unintended consequences?
  2. Ensure that the libraries within the pip wheels are compiled the same way as in conda. EDIT: See the edit made above.
  3. Use Julia 1.8.0-rc1 and greater, where the symbols from libLLVM-12jl.so are no longer globally exported (the link to libjulia-internals.so has been removed and been replaced with a link to libjulia-codegen.so, which is loaded with RTLD_LOCAL instead) and I can confirm the issue does not appear (on Linux at least).

Although option 3 should work for this specific case, I would expect similar problems to arise if any other library globally exports shared symbol names before torch is loaded. I don't know if MacOS handles things the same way, but it would be good to try julia 1.8.0-rc1 and see if the issue goes away.

My guess is that if the issue is anything like the PyTorch one, then yes, a PR to numba is required. The potential fix would have been to change PyTorch to use RTLD_DEEPBIND. I haven't had this issue recently so it might (?) have been fixed on the PyTorch side, or the fix in Julia 1.8.0-rc1 was enough to solve things, or I haven't hit the right import order in a while. So I would start by trying to repeat @tttc3's analysis but in numba, and try out their potential solutions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants