Heap corruption on Python when `torch` is imported before `juliacall`, but not the reverse #215

MilesCranmer · 2022-09-05T16:35:22Z

Here is my system information:

Python 3.8.9
Julia 1.8.0
macOS 12
M1 chip (ARM64)
Python from homebrew (not conda)

I have not tested this on other systems.

Here is the trigger:

>>> import torch
>>> from juliacall import Main as jl

and the error:

Python(65251,0x104cf8580) malloc: Heap corruption detected, free list is damaged at 0x600001c17280
*** Incorrect guard value: 1903002876
Python(65251,0x104cf8580) malloc: *** set a breakpoint in malloc_error_break to debug
[1]    65251 abort      ipython

However, I can run the following just fine:

>>> from juliacall import Main as jl
>>> import torch

Here are some related issues: JuliaPy/pyjulia#125, pytorch/pytorch#78829. Particularly, check out the comment from @tttc3: pytorch/pytorch#78829 (comment).

The text was updated successfully, but these errors were encountered:

cjdoris · 2022-09-06T18:26:50Z

Oof, fun error!

If the root cause is the same as in the comment you linked (which looks highly plausible) then I doubt it can be fixed from JuliaCall. It could/should be documented in a troubleshooting/FAQ section in the docs. We could maybe add a warning via an import hook, but that seems a bit much.

MilesCranmer · 2022-09-06T19:13:26Z

Thanks - is there a way that solution 1 could be used here? (is this similar to how JuliaCall works?) i.e.,

Use RTLD_DEEPBIND when loading _C.cpython-310-x86_64-linux-gnu.so. This ensures that pytorch will look for the symbol within libtorch_cpu.so before looking at the globally imported ones from libjulia-internals.so. However, I do not know if this would have some other unintended consequences?

Unfortunately the solutions 2 and 3 don't seem to help me (maybe because I'm on a mac).

cjdoris · 2022-09-06T20:05:28Z

I'm not sure what _C is, is it referring to C bindings for the Torch library? I assume it would require a change to PyTorch to use DEEPBIND.

MilesCranmer · 2022-09-06T20:12:16Z

Ah, yes, I think you are right... Thanks!

MilesCranmer · 2022-09-06T21:39:34Z

Just added a warning for this to PySR until it gets solved - feel free to do something similar in PythonCall.jl! I think this will prevent users getting discouraged, as a random segfault when starting julia would be a mystery - especially if torch is imported by a package rather than directly imported.

def check_for_conflicting_libraries():  # pragma: no cover
    """Check whether there are conflicting modules, and display warnings."""
    # See https://github.com/pytorch/pytorch/issues/78829: importing
    # pytorch before running `pysr.fit` causes a segfault.
    torch_is_loaded = "torch" in sys.modules
    if torch_is_loaded:
        warnings.warn(
            "`torch` was loaded before the Julia instance started. "
            "This may cause a segfault when running `PySRRegressor.fit`. "
            "To avoid this, please run `pysr.julia_helpers.init_julia()` *before* "
            "importing `torch`. "
            "For updates, see https://github.com/pytorch/pytorch/issues/78829"
        )

This simple sys.modules check seems to be enough:

sys.modules contains all modules used anywhere in the current instance of the interpreter and so shows up if imported in any other Python module.

cjdoris · 2022-09-08T19:08:52Z

Good idea. I'm documenting it in a new Troubleshooting section in the docs.

willow-ahrens · 2024-12-12T18:58:11Z

I am still seeing this error on macos, but I see the error when loading juliacall after numba. Here's the CI that is failing: pydata/sparse#767. I'm willing to put some development effort into this, but it's a little bit out of my wheelhouse. I suspect the issue is also with llvm symbols. Does anyone know what it would take to fix this? Is a PR to numba required?

MilesCranmer · 2024-12-13T00:25:30Z

So @tttc3 has the deepest understanding of this issue although it was back in 2022 so things might have changed since then. They wrote up this super useful and detailed comment describing what they had looked at: pytorch/pytorch#78829 (comment). I will copy it here for visibility:

I've had a bit of a dig and think I've found the problem, at least for the Linux case. In the Linux scenario the issue occurs when Julia is loaded before pytorch within the same process.

Cause of the problem in Linux

What I think is happening is as follows:

When Julia is started, it calls dlopen(libjulia-internals.so, RTLD_NOW | RTLD_GLOBAL). This causes the symbols in libjulia-internals.so to be globally exported to any subsequently loaded objects. Because one of the libraries linked by libjulia-internals.so is libLLVM-12jl.so, it causes Julia to export these LLVM symbols globally.

When pytorch is imported it calls dlopen(_C.cpython-310-x86_64-linux-gnu.so, RTLD_NOW | RTLD_LOCAL), which links to libtorch_cpu.so. The version of libtorch_cpu.so that is packaged in the pip wheels contains LLVM symbols ~~(Note that the conda packages do not contain the LLVM symbols. I assume this is because the pip wheels are compiled using clang while conda uses gcc?).~~ EDIT: It appears the reason some packages cause the issues and not others, is due to a libLLVM dependancy present in all official pytorch packages since version 1.10.0, that is not present in the conda-forge, pkgs/main or anaconda channel builds. Hence, when using the 1.12.1 build from the conda-forge channel on Linux, the issue appears to go away.

This is where the problem arises. When pytorch comes to use an LLVM symbol, such as _ZN4llvm2cl3optINS_15FunctionSummary23ForceSummaryHotnessTypeELb1ENS0_6parserIS3_EEED2Ev in the above linux log, it uses the globally exported symbols from libjulia-internals.so, instead of the local symbols from libtorch_cpu.so. Thus, when pytorch closes and calls _ZN4llvm2cl3optINS_15FunctionSummary23ForceSummaryHotnessTypeELb1ENS0_6parserIS3_EEED2Ev, it calls the implementation within libLLVM-12jl.so, causing the incorrect pointer to be freed.

When Julia subsequently closes and calls the same _ZN4llvm2cl3optINS_15FunctionSummary23ForceSummaryHotnessTypeELb1ENS0_6parserIS3_EEED2Ev as pytorch, the method tries to free a pointer that has already been freed, leading to the observed error message.

Potential solutions

Use RTLD_DEEPBIND when loading _C.cpython-310-x86_64-linux-gnu.so. This ensures that pytorch will look for the symbol within libtorch_cpu.so before looking at the globally imported ones from libjulia-internals.so. However, I do not know if this would have some other unintended consequences?

~~Ensure that the libraries within the pip wheels are compiled the same way as in conda.~~ EDIT: See the edit made above.

Use Julia 1.8.0-rc1 and greater, where the symbols from libLLVM-12jl.so are no longer globally exported (the link to libjulia-internals.so has been removed and been replaced with a link to libjulia-codegen.so, which is loaded with RTLD_LOCAL instead) and I can confirm the issue does not appear (on Linux at least).

Although option 3 should work for this specific case, I would expect similar problems to arise if any other library globally exports shared symbol names before torch is loaded. I don't know if MacOS handles things the same way, but it would be good to try julia 1.8.0-rc1 and see if the issue goes away.

My guess is that if the issue is anything like the PyTorch one, then yes, a PR to numba is required. The potential fix would have been to change PyTorch to use RTLD_DEEPBIND. I haven't had this issue recently so it might (?) have been fixed on the PyTorch side, or the fix in Julia 1.8.0-rc1 was enough to solve things, or I haven't hit the right import order in a while. So I would start by trying to repeat @tttc3's analysis but in numba, and try out their potential solutions?

MilesCranmer closed this as completed Sep 6, 2022

tttc3 mentioned this issue Sep 7, 2022

Certain import order triggers segmentation fault pytorch/pytorch#78829

Open

cjdoris mentioned this issue Oct 19, 2023

import juliacall sometimes segfaults #384

Open

fiktor mentioned this issue Dec 26, 2023

Segmentation fault when juliacall is imported before torch #435

Open

LilithHafner mentioned this issue Feb 16, 2024

Broken backward compatibility with Juliacall JuliaLang/julia#53363

Closed

MilesCranmer mentioned this issue Dec 15, 2024

try to solve segfault when using torchscript via THArrays.jl x66ccff/SymbolicRegressionGPU.jl#8

Closed

willow-ahrens mentioned this issue Dec 16, 2024

ci: Add doctests to CI again. pydata/sparse#767

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heap corruption on Python when `torch` is imported before `juliacall`, but not the reverse #215

Heap corruption on Python when `torch` is imported before `juliacall`, but not the reverse #215

MilesCranmer commented Sep 5, 2022 •

edited

Loading

cjdoris commented Sep 6, 2022

MilesCranmer commented Sep 6, 2022

cjdoris commented Sep 6, 2022

MilesCranmer commented Sep 6, 2022

MilesCranmer commented Sep 6, 2022 •

edited

Loading

cjdoris commented Sep 8, 2022

willow-ahrens commented Dec 12, 2024

MilesCranmer commented Dec 13, 2024

Cause of the problem in Linux

Potential solutions

Heap corruption on Python when torch is imported before juliacall, but not the reverse #215

Heap corruption on Python when torch is imported before juliacall, but not the reverse #215

Comments

MilesCranmer commented Sep 5, 2022 • edited Loading

cjdoris commented Sep 6, 2022

MilesCranmer commented Sep 6, 2022

cjdoris commented Sep 6, 2022

MilesCranmer commented Sep 6, 2022

MilesCranmer commented Sep 6, 2022 • edited Loading

cjdoris commented Sep 8, 2022

willow-ahrens commented Dec 12, 2024

MilesCranmer commented Dec 13, 2024

Cause of the problem in Linux

Potential solutions

Heap corruption on Python when `torch` is imported before `juliacall`, but not the reverse #215

Heap corruption on Python when `torch` is imported before `juliacall`, but not the reverse #215

MilesCranmer commented Sep 5, 2022 •

edited

Loading

MilesCranmer commented Sep 6, 2022 •

edited

Loading