Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[internal] Switch from Rust-Cpython to PyO3 #13526

Merged
merged 41 commits into from
Nov 14, 2021

Conversation

Eric-Arellano
Copy link
Contributor

@Eric-Arellano Eric-Arellano commented Nov 8, 2021

See #12110 for the original motivation.

Why migrate

Beyond the APIs requiring much less boilerplate and being more ergonomic (e.g. the .call0() and call1() variants of .call()), I think the biggest benefit is clearer modeling for when the GIL is held.

With PyO3, the two core types are &PyAny and PyObject (aka Py<PyAny). &PyAny is always a reference with the same lifetime as the Python token, which represents when the GIL is used. Whenever you want to do Python operations in Rust, like .getattr(), you use &PyAny. PyObject and any Py<T> are GIL-independent.

In contrast, rust-cpython only had PyObject and &PyObject. It passed around Python as an argument to any Python functions like .getattr().

PyO3 is more disciplined about when we should have the GIL and when we should not. For example, #13586 and #13609 were necessary to get PyO3 working. This allows us to better reason about performance because we can know when the GIL is held or not.

--

An additional benefit is that using proc macros instead of declarative macros means PyCharm no longer hangs for me when working in the engine/ crate! PyCharm does much better w/ proc macros, and things feel zippy now like autocomplete working.

Migration strategy

I tried originally doing an incremental strategy, most recently with #12451, which proved technically infeasible.

This PR completely swaps Rust-Cpython with PyO3, but tries to minimize the diff. It only makes changes when it was too difficult to get working with PyO3. For example, interface.rs is far too big, but this keeps the same organization as before.

For now, we still have the native_engine_pyo3 extension along with native_engine. That will be consolidated in a followup.

Benchmark

Before:

❯ hyperfine -w 1 -r 10 './pants --no-pantsd list ::'
Benchmark #1: ./pants --no-pantsd list ::
  Time (mean ± σ):      2.170 s ±  0.025 s    [User: 1.786 s, System: 0.303 s]
  Range (min … max):    2.142 s …  2.227 s    10 runs

After:

❯ hyperfine -w 1 -r 10 './pants --no-pantsd list ::'
Benchmark #1: ./pants --no-pantsd list ::
  Time (mean ± σ):      2.193 s ±  0.060 s    [User: 1.798 s, System: 0.292 s]
  Range (min … max):    2.144 s …  2.348 s    10 runs

@Eric-Arellano Eric-Arellano force-pushed the pyo3 branch 4 times, most recently from 9e29e72 to 0544e75 Compare November 10, 2021 16:17
Eric-Arellano added a commit that referenced this pull request Nov 11, 2021
…3586)

Prework for #13526.

Before, we were parsing the Python objects (which requires the GIL) in the same iterator that created async blocks. That won't work when we port to PyO3 - iterating over the `file_items` collection uses `&'py PyAny` values, where the `'py` lifetime corresponds to how long we are holding the GIL (as represented by the `pyo3::Python` type). The `Python` token is not safe in async blocks because it does not implement `Send`, so we can't create the futures in this iterator.

This fix makes the code much more understandable as well. In the future, we are unblocked from defining `CreateDigest` in Rust with `PyO3` because parsing of the code is now completely decoupled from using it to create async futures. 

[ci skip-build-wheels]
[ci skip-build-wheels]
This will allow us to more clearly start porting things to PyO3 without confusing types.

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
Don't pass the GIL over a future. Also stop using the obfuscation of `Value`

Down to 7 errors!

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome! Thanks a lot.

Comment on lines +167 to +169
def __eq__(self, other: PySnapshot | Any) -> bool: ...
def __hash__(self) -> int: ...
def __repr__(self) -> str: ...
Copy link
Member

@stuhood stuhood Nov 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is it necessary to declare these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly necessary. MyPy was not complaining about anything, I was being proactive so users of this API know what is expected to work without having to look at Rust code.

I believe __repr__ and __eq__ are always meant to be implemented, just with defaults though that is the object instance.

Wdyt? Too noisy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not necessary, would leave them off probably. But doesn't really matter.

Comment on lines 272 to 273
/// count, but that type only works when we are holding the GIL. While we could clone the
/// `PyObject` without the GIL, that would be an expensive deep clone.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, possibly. Sounds like it is effectively already equivalent to an Arc.

Comment on lines -95 to +89
pub fn getattr<T>(value: &PyObject, field: &str) -> Result<T, String>
pub fn getattr<'py, T>(value: &'py PyAny, field: &str) -> Result<T, String>
where
for<'a> T: FromPyObject<'a>,
T: FromPyObject<'py>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, makes sense. Before, T was not going to have its own lifetime (it was implicitly 'static), but now it does.

src/rust/engine/src/intrinsics.rs Outdated Show resolved Hide resolved
@jsirois
Copy link
Contributor

jsirois commented Nov 13, 2021

Others have the review covered so I'll step away. Thanks for checking coarse perf is not worse.

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
Eric-Arellano added a commit that referenced this pull request Nov 14, 2021
Prework for #13526. We were reacquiring the GIL directly inside `py.allow_threads(||)`. Rust-cpython allows that, but PyO3 does not. (PyO3 has the benefit of instilling much stronger confidence in reasoning when the GIL is held or not.)

We can easily work around this by having `WorkunitStore.with_latest_workunits` return `Vec<Workunit>` instead of `&[Workunit]`. No new clones are happening! We already were making this Vec. Only now we move the value out, rather than referencing it via a closure passed in as a callback.

That change allows for operating on the `Vec<Workunit>` safely outside of the `py.allow_threads(||)` to convert it into the FFI boundary.

## Benchmark

These benchmarks use the Toolchain BuildSense plugin + anonymous telemetrics streaming workunit handlers.

### No pantsd

Before:

```
❯ hyperfine -w 1 -r 10 './pants -ldebug --no-pantsd list ::'
  Time (mean ± σ):      4.611 s ±  0.206 s    [User: 2.940 s, System: 0.474 s]
  Range (min … max):    4.419 s …  5.153 s    10 runs
```

After:

```
❯ hyperfine -w 1 -r 10 './pants -ldebug --no-pantsd list ::'
  Time (mean ± σ):      4.655 s ±  0.113 s    [User: 2.975 s, System: 0.509 s]
  Range (min … max):    4.430 s …  4.827 s    10 runs
```

### Pantsd

Before:

```
❯ hyperfine -w 1 -r 10 './pants -ldebug --concurrent --pantsd list ::'
  Time (mean ± σ):      2.641 s ±  0.021 s    [User: 2.026 s, System: 0.323 s]
  Range (min … max):    2.602 s …  2.670 s    10 runs
```

After:

```
❯ hyperfine -w 1 -r 10 './pants -ldebug --concurrent --pantsd list ::'
  Time (mean ± σ):      2.651 s ±  0.023 s    [User: 2.036 s, System: 0.334 s]
  Range (min … max):    2.611 s …  2.693 s    10 runs
```

[ci skip-build-wheels]
# Conflicts:
#	src/rust/engine/src/externs/interface.rs

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
@Eric-Arellano
Copy link
Contributor Author

Eric-Arellano commented Nov 14, 2021

I can still get Pants to crash on Linux by running ./pants --pantsd --concurrent -ldebug help goals a couple times.

This was happening in CI and locally on Linux (not macOS) this morning, and the core dump isolated it to line 380 from https://github.com/PyO3/pyo3/blob/47747444c76c9fc6bdd7800c55f3d6dad28a3c9b/src/python.rs#L365-L380. I tried fixing that with #13609, so that we don't reacquire the GIL inside py.allow_threads(||). That seems to have fixed that particular code path.

❯ gdb ~/.pyenv/versions/3.7.9/bin/python -c /tmp/core-python.31414.ip-10-1-201-65.1636833815
(gdb) bt
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  <signal handler called>
#2  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#3  0x00007f13e2960921 in __GI_abort () at abort.c:79
#4  0x00007f13e29a96b5 in __libc_message (action=(do_abort | do_backtrace), fmt=0x7f13e2ad6709 "%s", fmt=0x7f13e2ad6709 "%s",
    action=(do_abort | do_backtrace)) at ../sysdeps/posix/libc_fatal.c:181
#5  0x00007f13e29a99fa in __GI___libc_fatal (message=message@entry=0x7f13e34cac70 "FATAL: exception not rethrown\n")
    at ../sysdeps/posix/libc_fatal.c:191
#6  0x00007f13e34c6fc0 in unwind_cleanup (reason=<optimized out>, exc=<optimized out>) at unwind.c:105
#7  0x00007f13dd15cd3f in panic_unwind::real_imp::cleanup () at library/panic_unwind/src/gcc.rs:78
#8  panic_unwind::__rust_panic_cleanup () at library/panic_unwind/src/lib.rs:96
#9  0x00007f13dc91630d in std::panicking::try::cleanup () at library/std/src/panicking.rs:384
#10 0x00007f13dcabe4f7 in std::panicking::try::do_catch (payload=0x7ed09bbfdd70 "", data=<optimized out>)
    at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:428
#11 std::panicking::try (f=...) at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:367
#12 std::panic::catch_unwind (f=...) at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panic.rs:129
#13 pyo3::python::python::allow_threads (f=..., self=...)
    at /data/home/eric/.cargo/registry/src/jackfan.us.kg-1ecc6299db9ec823/pyo3-0.15.0/src/python.rs:380
#14 engine::externs::interface::session_poll_workunits::{{closure}}::{{closure}} (session=<optimized out>)
    at src/externs/interface.rs:847
#15 engine::externs::interface::with_session (py_session=<optimized out>, f=...) at src/externs/interface.rs:1700
#16 engine::externs::interface::session_poll_workunits::{{closure}} (scheduler=<optimized out>) at src/externs/interface.rs:845
#17 engine::externs::interface::with_scheduler::{{closure}} () at src/externs/interface.rs:1684
#18 task_executor::Executor::enter (self=<optimized out>, f=...)
    at /data/home/eric/pants/src/rust/engine/task_executor/src/lib.rs:102
#19 engine::externs::interface::with_scheduler (py_scheduler=<optimized out>, f=...) at src/externs/interface.rs:1684
#20 engine::externs::interface::session_poll_workunits (scheduler_ptr=<optimized out>, session_ptr=<optimized out>,
    max_log_verbosity_level=<optimized out>, py=...) at src/externs/interface.rs:844
#21 engine::externs::interface::__pyo3_raw_session_poll_workunits::{{closure}} (_py=...) at src/externs/interface.rs:834
#22 pyo3::callback::handle_panic::{{closure}} ()
    at /data/home/eric/.cargo/registry/src/jackfan.us.kg-1ecc6299db9ec823/pyo3-0.15.0/src/callback.rs:247
#23 std::panicking::try::do_call (data=<optimized out>)
    at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:403
#24 std::panicking::try (f=...) at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:367
#25 std::panic::catch_unwind (f=...) at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panic.rs:129
#26 pyo3::callback::handle_panic (body=...)
    at /data/home/eric/.cargo/registry/src/jackfan.us.kg-1ecc6299db9ec823/pyo3-0.15.0/src/callback.rs:245
#27 engine::externs::interface::__pyo3_raw_session_poll_workunits (_slf=<optimized out>, _args=<optimized out>,
    _nargs=<optimized out>, _kwnames=<optimized out>) at src/externs/interface.rs:834
#28 0x000055c15a6625dc in _PyMethodDef_RawFastCallKeywords (kwnames=0x0, nargs=139723303197824, args=0x7f13ddd98d78,

The most recent core dump is this, which I am not sure what to make sense of:

(gdb) bt
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  <signal handler called>
#2  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#3  0x00007fae404ec921 in __GI_abort () at abort.c:79
#4  0x00007fae405356b5 in __libc_message (action=(do_abort | do_backtrace), fmt=0x7fae40662709 "%s", fmt=0x7fae40662709 "%s",
    action=(do_abort | do_backtrace)) at ../sysdeps/posix/libc_fatal.c:181
#5  0x00007fae405359fa in __GI___libc_fatal (message=message@entry=0x7fae41056c70 "FATAL: exception not rethrown\n")
    at ../sysdeps/posix/libc_fatal.c:191
#6  0x00007fae41052fc0 in unwind_cleanup (reason=<optimized out>, exc=<optimized out>) at unwind.c:105
#7  0x00007fae3ad1873f in panic_unwind::real_imp::cleanup () at library/panic_unwind/src/gcc.rs:78
#8  panic_unwind::__rust_panic_cleanup () at library/panic_unwind/src/lib.rs:96
#9  0x00007fae3a4d369d in std::panicking::try::cleanup () at library/std/src/panicking.rs:384
#10 0x00007fae3a67ee23 in std::panicking::try::do_catch (payload=0x7fae37acfd70 "", data=<optimized out>)
    at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:428
#11 std::panicking::try (f=...) at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:367
#12 std::panic::catch_unwind (f=...) at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panic.rs:129
#13 pyo3::callback::handle_panic (body=...)
    at /data/home/eric/.cargo/registry/src/jackfan.us.kg-1ecc6299db9ec823/pyo3-0.15.0/src/callback.rs:245
#14 engine::externs::interface::__pyo3_raw_session_poll_workunits (_slf=<optimized out>, _args=<optimized out>,
    _nargs=<optimized out>, _kwnames=<optimized out>) at src/externs/interface.rs:834
#15 0x000055af68d965dc in _PyMethodDef_RawFastCallKeywords (kwnames=0x0, nargs=140386293464032, args=0x7fae3b25fe38,
    self=<module at remote 0x7fae3b25fe30>, method=0x55af6a8f3c40) at Objects/call.c:659
#16 _PyCFunction_FastCallKeywords (func=<built-in method session_poll_workunits of module object at remote 0x7fae3b25fe30>,
    args=args@entry=0x7f6b0a788920, nargs=nargs@entry=3, kwnames=kwnames@entry=0x0) at Objects/call.c:732
#17 0x000055af68d81eaf in call_function (kwnames=0x0, oparg=3, pp_stack=<synthetic pointer>) at Python/ceval.c:4568
#18 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3093
#19 0x000055af68d791b7 in function_code_fastcall (co=<optimized out>, args=<optimized out>, nargs=2, globals=<optimized out>)
    at Objects/call.c:283
#20 0x000055af68d95d06 in _PyFunction_FastCallKeywords (func=<optimized out>, stack=<optimized out>, nargs=<optimized out>,
    kwnames=<optimized out>) at Objects/call.c:415
#21 0x000055af68d80c3d in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at Python/ceval.c:4616
#22 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3110
#23 0x000055af68e5453d in PyEval_EvalFrameEx (throwflag=0,
    f=Frame 0x7f6b0b23c620, for file /data/home/eric/pants/src/python/pants/engine/streaming_workunit_handler.py, line 251, in poll_workunits (self=<_InnerHandler(_target=None, _name='Thread-1', _args=(), _kwargs={}, _daemonic=True, _ident=140386235119360, _tstate_lock=<_thread.lock at remote 0x7f6b0a7aaa20>, _started=<Event(_cond=<Condition(_lock=<_thread.lock at remote 0x7f6b0a7aaae0>, acquire=<built-in method acquire of _thread.lock object at remote 0x7f6b0a7aaae0>, release=<built-in method release of _thread.lock object at remote 0x7f6b0a7aaae0>, _waiters=<collections.deque at remote 0x7f6b0a848520>) at remote 0x7f6b0a818910>, _flag=True) at remote 0x7f6b0a818b50>, _is_stopped=False, _initialized=True, _stderr=<builtins.PyStdioWrite at remote 0x7fae38facb30>, scheduler=<SchedulerSession(_scheduler=<Scheduler(include_trace_on_error=True, _visualize_to_dir=None, _visualize_run_count=0, _py_scheduler=<builtins.PySche---Type <return> to continue, or q <return> to quit---
duler at remote 0x7f6b0a79e570>) at remote 0x7f6b0a7afed0>, _py_session=<builtins.PySession at remo...(truncated))
    at Python/ceval.c:547
#24 _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>, locals=locals@entry=0x0, args=<optimized out>,
    argcount=1, kwnames=0x7fae39951ce8, kwargs=0x7f6b0a78a6d8, kwcount=1, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0,
    name='poll_workunits', qualname='_InnerHandler.poll_workunits') at Python/ceval.c:3930
#25 0x000055af68d95c2f in _PyFunction_FastCallKeywords (func=<optimized out>, stack=<optimized out>, nargs=<optimized out>,
    kwnames=<optimized out>) at Objects/call.c:433
#26 0x000055af68d81741 in call_function (kwnames=('finished',), oparg=<optimized out>, pp_stack=<synthetic pointer>)
    at Python/ceval.c:4616
#27 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3139
#28 0x000055af68d791b7 in function_code_fastcall (co=<optimized out>, args=<optimized out>, nargs=1, globals=<optimized out>)
    at Objects/call.c:283
#29 0x000055af68d95d06 in _PyFunction_FastCallKeywords (func=<optimized out>, stack=<optimized out>, nargs=<optimized out>,
    kwnames=<optimized out>) at Objects/call.c:415
#30 0x000055af68d80c3d in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at Python/ceval.c:4616
#31 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3110
#32 0x000055af68d791b7 in function_code_fastcall (co=<optimized out>, args=<optimized out>, nargs=1, globals=<optimized out>)
    at Objects/call.c:283
#33 0x000055af68d95d06 in _PyFunction_FastCallKeywords (func=<optimized out>, stack=<optimized out>, nargs=<optimized out>,
    kwnames=<optimized out>) at Objects/call.c:415
#34 0x000055af68d80c3d in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at Python/ceval.c:4616
#35 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3110
#36 0x000055af68d791b7 in function_code_fastcall (co=co@entry=0x7fae3fe739c0, args=<optimized out>, args@entry=0x7fae37acee00,
    nargs=1,
    globals=globals@entry={'__name__': 'threading', '__doc__': "Thread module emulating a subset of Java's threading model.", '__package__': '', '__loader__': <SourceFileLoader(name='threading', path='/data/home/eric/.pyenv/versions/3.7.9/lib/python3.7/threading.py') at remote 0x7fae3fe65710>, '__spec__': <ModuleSpec(name='threading', loader=<...>, origin='/data/home/eric/.pyenv/versions/3.7.9/lib/python3.7/threading.py', loader_state=None, submodule_search_locations=None, _set_fileattr=True, _cached='/data/home/eric/.pyenv/versions/3.7.9/lib/python3.7/__pycache__/threading.cpython-37.pyc', _initializing=False) at remote 0x7fae3fe65a10>, '__file__': '/data/home/eric/.pyenv/versions/3.7.9/lib/python3.7/threading.py', '__cached__': '/data/home/eric/.pyenv/versions/3.7.9/lib/python3.7/__pycache__/threading.cpython-37.pyc', '__builtins__': {'__name__': 'builtins', '__doc__': "Built-in functions, exceptions, and other objects.\n\nNoteworthy: None is the `nil' object; Ellipsis represents `...' in slices.", '__package__': None, '__loader__':...(truncated)) at Objects/call.c:283
#37 0x000055af68d95b87 in _PyFunction_FastCallDict (func=<function at remote 0x7fae3fe7ff80>, args=0x7fae37acee00,
    nargs=<optimized out>, kwargs=0x0) at Objects/call.c:322
#38 0x000055af68d98bf1 in _PyObject_FastCallDict (kwargs=0x0, nargs=1, args=0x7fae37acee00,
    callable=<function at remote 0x7fae3fe7ff80>) at Objects/call.c:98
#39 _PyObject_Call_Prepend (callable=<function at remote 0x7fae3fe7ff80>, obj=<optimized out>, args=(), kwargs=0x0)
    at Objects/call.c:906
#40 0x000055af68d96fbe in PyObject_Call (callable=<method at remote 0x7f6b0aa261e0>, args=<optimized out>, kwargs=<optimized out>)
---Type <return> to continue, or q <return> to quit---
    at Objects/call.c:245
#41 0x000055af68eedea3 in t_bootstrap (boot_raw=boot_raw@entry=0x7f6b0a7aaab0) at ./Modules/_threadmodule.c:994
#42 0x000055af68ea0d67 in pythread_wrapper (arg=<optimized out>) at Python/thread_pthread.h:174
#43 0x00007fae410496db in start_thread (arg=0x7fae37acf700) at pthread_create.c:463
#44 0x00007fae405cd71f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

This is almost certainly an issue with our "streaming workunit handler" being async, as indicated by the core dump. I ran ./pants --no-pantsd -ldebug help goals 200 times and could not get it to crash, which does everything the same except for running those handlers sychronously. @stuhood are there perhaps alternative architectures to how we have Python launch threads which might make a difference?

@davidhewitt @mejrs, do you expect it to be safe to use Python::acquire_gil() inside Python.allow_threads()? I set up a simple repo doing exactly that, and I couldn't get it to crash. I'm not sure the difference here. https://github.com/Eric-Arellano/pyo3-allow-threads-panic

Edit: I have an idea how to isolate this failure to something more specific. session_poll_workunits() has a lot going on, so I'll try to get a more minimal repro.

Edit 2: tentatively looks like the problem is after exiting py.allow_threads(), so reclaiming the GIL. This happens after the Python program has already finished as a background thread. Gtg, will put up a minimal repro soon.

@davidhewitt
Copy link

Hmm, as far as I'm aware you're not doing something fundamentally impossible.

A quick search for "exception not rethrown" reaches this thread, which is "resolved", but it looks like the fix requires the nightly-only #[unwind(allowed)] attribute. Needless to say, we haven't used that anywhere in PyO3, and it's a bit unclear, should we be putting this on everything?!

I suspect that some C code is forcing unwinding (maybe pthread_cancel, as your work unit is async)?

The difference between PyO3 and Rust-CPython on that line you mark is that Rust-CPython doesn't use catch_unwind there. https://github.com/dgrunwald/rust-cpython/blob/167832ce8d8aeae60dd151323bbd82c1b4e17ee0/src/python.rs#L242

I wonder if I change to a guard pattern (which should be enough) it will fix your issue.

src/rust/engine/src/nodes.rs Outdated Show resolved Hide resolved
Comment on lines 272 to 273
/// count, but that type only works when we are holding the GIL. While we could clone the
/// `PyObject` without the GIL, that would be an expensive deep clone.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That entire comment block is more confusing/irrelevant than it is helpful, I think. I'd probably omit it entirely or, as you say, do away with the entire newtype.

So I'm honestly not sure what the value of wrapping in an Arc<> is?

In this context there is no value, I think.

It's already cheap to clone &PyAny as it simply increases Python's refcount, but that type only works when we are holding the GIL.

Copying and passing around &PyAny doesn't involve reference counts, it's a borrowed reference. Having it bound to the gil lifetime allows for cheap and safe working with borrowed references. But this struct needs to own a reference to it, which is what Py<T> represents.

While we could clone the PyObject without the GIL, that would be an expensive deep clone.

Cloning a Py<T> without the GIL just increases a reference count (deferred into the future, when the gil is next acquired). No "deep copying" takes place.

Seems like we're adding unnecessary locking.

Arc doesn't involve locking, just some math on atomic integers. Or I'm misunderstanding your comment.

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
@Eric-Arellano Eric-Arellano enabled auto-merge (squash) November 14, 2021 19:20
@Eric-Arellano Eric-Arellano merged commit 021c3c0 into pantsbuild:main Nov 14, 2021
@Eric-Arellano Eric-Arellano deleted the pyo3 branch November 14, 2021 20:24
Eric-Arellano added a commit that referenced this pull request Nov 14, 2021
No need for two distinct extensions anymore post #13526.
tdyas pushed a commit to tdyas/pants that referenced this pull request Nov 15, 2021
@stuhood stuhood mentioned this pull request Nov 18, 2021
jsirois added a commit to jsirois/pants that referenced this pull request Nov 28, 2021
This includes a fix we were waiting on in pantsbuild#13526:
  PyO3/pyo3#1990

Changelog here: https://pyo3.rs/v0.15.1/changelog.html

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
jsirois added a commit that referenced this pull request Nov 28, 2021
This includes a fix we were waiting on in #13526:
  PyO3/pyo3#1990

Changelog here: https://pyo3.rs/v0.15.1/changelog.html
Eric-Arellano added a commit that referenced this pull request Dec 4, 2021
`PyObject` (aka `Py<PyAny>`) already provides a GIL-independent reference to a Python object. Wrapping it in `Arc` is not necessary, as described at #13526 (comment).

Using `PyObject` directly removes a layer of indirection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants