Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cython][FFI] Fix crash when call del operator for handle #17190

Merged
merged 3 commits into from
Jul 25, 2024

Conversation

echuraev
Copy link
Contributor

In case of cython when we create a set function for property then the following code will be generated:

static int __pyx_setprop_4test_9TestClass_handle(PyObject *o, PyObject *v, CYTHON_UNUSED void *x) {
  if (v) {
    return __pyx_pw_4test_9TestClass_6handle_3__set__(o, v);
  }
  else {
    PyErr_SetString(PyExc_NotImplementedError, "__del__");
    return -1;
  }
}

And when we call operator del for this handler, then the memory will be released and operator __set__ will be called for NULL object. In this case an exception that operator __del__ is not implemented will be generated. To avoid this problem we need to declare __del__ function for each property where we define operator __set__.

In case of cython when we create a set function for property then the
following code will be generated:
```
static int __pyx_setprop_4test_9TestClass_handle(PyObject *o, PyObject *v, CYTHON_UNUSED void *x) {
  if (v) {
    return __pyx_pw_4test_9TestClass_6handle_3__set__(o, v);
  }
  else {
    PyErr_SetString(PyExc_NotImplementedError, "__del__");
    return -1;
  }
}
```

And when we call operator `del` for this handler, then the memory will
be released and operator `__set__` will be called for NULL object. In
this case an exception that operator `__del__` is not implemented will
be generated. To avoid this problem we need to declare `__del__`
function for each property where we define operator `__set__`.
@echuraev echuraev force-pushed the echuraev/fix_cython_crash branch from 51c37d3 to 0044111 Compare July 23, 2024 13:02
@echuraev echuraev changed the title Fix crash when call del operator for handle [Cython][FFI] Fix crash when call del operator for handle Jul 23, 2024
@echuraev
Copy link
Contributor Author

cc: @tqchen, @vinx13, @Lunderberg, @masahi

@tqchen
Copy link
Member

tqchen commented Jul 23, 2024

can you show a test case that would trigger the error? I am still not very sure what can trigger such error. del function was not supposed to be called explicitly

One thing we need to be very careful about is whethere there would be double free issue, since the free function was called in dealloc as well

python/tvm/_ffi/_cython/packed_func.pxi Outdated Show resolved Hide resolved
python/tvm/_ffi/_cython/object.pxi Outdated Show resolved Hide resolved
python/tvm/_ffi/_cython/ndarray.pxi Outdated Show resolved Hide resolved
@echuraev
Copy link
Contributor Author

@tqchen thank you for your review! I'll work on your comments tomorrow. About the test case. Originally, problem appeared in disco. When we run model on multiple GPU and TVM was compiled with cython, then we have such crash:

Traceback (most recent call last):
  File "/opt/dlami/nvme/echuraev/Workspace/OctoML/ollm-lora/mlc-serve/mlc_serve/engine/staging_engine_worker.py", line 421, in run_generation_loop_worker
    model_module = model_module_loader(**model_module_loader_kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dlami/nvme/echuraev/Workspace/OctoML/ollm-lora/mlc-serve/mlc_serve/model/paged_cache_model.py", line 136, in __init__
    model, cache_manager = init_tvm_model(model_artifact_config, engine_config)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dlami/nvme/echuraev/Workspace/OctoML/ollm-lora/mlc-serve/mlc_serve/model/tvm_model.py", line 922, in init_tvm_model
    model = Model(model_artifact_config, dev)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dlami/nvme/echuraev/Workspace/OctoML/ollm-lora/mlc-serve/mlc_serve/model/tvm_model.py", line 112, in __init__
    self.mod, self.params, self.disco_session = get_tvm_model(config, dev)
                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dlami/nvme/echuraev/Workspace/OctoML/ollm-lora/mlc-serve/mlc_serve/model/tvm_model_utils.py", line 275, in get_tvm_model
    return load_disco_module(config.model_artifact_path, lib_path, config.num_shards)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dlami/nvme/echuraev/Workspace/OctoML/ollm-lora/mlc-serve/mlc_serve/model/tvm_model_utils.py", line 93, in load_disco_module
    module, params = load_disco_weights_through_shard_loader(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dlami/nvme/echuraev/Workspace/OctoML/ollm-lora/mlc-serve/mlc_serve/model/tvm_model_utils.py", line 39, in load_disco_weights_through_shard_loader
    module = sess.load_vm_module(lib_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dlami/nvme/echuraev/Workspace/OctoML/ollm-lora/deps/tvm/python/tvm/runtime/disco/session.py", line 309, in load_vm_module
    return DModule(func(path, device, alloc_type), self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dlami/nvme/echuraev/Workspace/OctoML/ollm-lora/deps/tvm/python/tvm/runtime/disco/session.py", line 94, in __init__
    del dref.handle
        ^^^^^^^^^^^
NotImplementedError: __del__

The crash happen here when we try to delete dref.handle.

I simplified the problem and we can create a simple test.pyx without any TVM specific:

cdef class TestClass:
    property handle:
        def __get__(self):
            print("Getter")
            return None

        def __set__(self, value):
            print("Setter")
            return

The property in the code above has the same methods as exists in packed_func.pxi.

Next, create a setup.py file:

from setuptools import setup
from Cython.Build import cythonize

setup(
    ext_modules=cythonize("test.pyx"),
)

And build it by the following command:

python setup.py build_ext --inplace

After that, we can reproduce original issue:

$ python3
>>> import test
>>> t = test.TestClass()
>>> del t.handle
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: __del__
>>>

If we add an implementation of __del__ method to test.pyx than we won't have this exception:

  • Add this method to TestClass:
        def __del__(self):
            print("Dtor")
            return
  • Recompile the code and run the reproducer once again:
$ python3
>>> import test
>>> t = test.TestClass()
>>> del t.handle
Dtor
>>>

And after adding implementation of __del__ methods to the ffi files, the crash in disco also was fixed. If it is necessary, I can try to create a simple unit test tomorrow, but not sure how to integrate cython tests to the TVM infrastructure.

@echuraev
Copy link
Contributor Author

@tqchen I modified __del__ functions only to set NULL to handle. Looking at the code in disco/session.py, it looks like we shouldn't release memory after calling del operator, we just need to set NULL to handle of dref object and memory should be released by the last instance of this handle.

As a test for these changes, we can use two tests from disco/test_session.py. Without this fix, these tests are crashed when we use cython.

@tqchen
Copy link
Member

tqchen commented Jul 24, 2024

@echuraev thank you! I think in this particular case we should change https://github.com/apache/tvm/blob/main/python/tvm/runtime/disco/session.py#L95 to

 dref.handle = None

as in https://github.com/apache/tvm/blob/main/python/tvm/runtime/disco/session.py#L83

So we don't have to add del function.

@Lunderberg
Copy link
Contributor

I think I agree with @tqchen here. The TVMObjectFree function internally checks whether the handle is a nullptr, so setting dref.handle = None would not cause a segfault. Since no C++ object would have its reference count decremented, setting dref.handle = None would have the same desired result of removing the ownership of dref.handle from dref.

If there are other use cases of del my_obj.handle, then we probably should still add it to the cython definition, but it could be just a wrapper for my_obj.handle = None.

@echuraev
Copy link
Contributor Author

@tqchen, @Lunderberg thank you for your review. Removed implementation of __del__ methods and set dref.handle to None instead.

@echuraev
Copy link
Contributor Author

@tvm-bot rerun

Copy link
Contributor

@Lunderberg Lunderberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making the changes, and I think it looks good. Given that this would impact every use of session.load_module in cython, I'm guessing that it would have been caught by the unit tests, if they were run in CI. (Resolving failures exposed by the disco tests in PR #16975 has been on the back burner for me.)

@tqchen tqchen merged commit 08d7519 into apache:main Jul 25, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants