Enhancing Parallelism in PyO3: Exploring Multi-Process Architecture Over Sub-Interpreters #3479

letalboy · 2023-09-28T21:31:41Z

Hey team,

I've recently conducted some in-depth research on our project. After discussing with @Aequitosh and going through the ffi API extensively—particularly the sections that deal with interaction with the C part of the code—I've made some observations. It appears that the memory references:

lack implementations for Py_NewInterpreter and Py_EndInterpreter

are already present in ffi. However, as I delved deeper into the Rust code and its interface with CPython, I realized that even if we implement sub-interpreters, they wouldn't support true multi-threading. So, I wondered: why not employ multiple processes on the Rust side? We could assign each process a secure GIL connection and synchronize them using an IPC-channel on the Rust side to ensure safety. Here's a sample code that demonstrates this idea:

use ipc_channel::ipc;
use pyo3::prelude::*;
use std::process::{Command, exit};

fn worker(channel_name: String) {
    // Initialize the Python interpreter using PyO3
    let gil = Python::acquire_gil();
    let py = gil.python();
    
    // Execute some Python code. Here, we're just computing a sum.
    let sum: i32 = py.eval("1 + 2 + 3 + 4 + 5", None, None)
                    .expect("Failed to execute Python code")
                    .extract()
                    .expect("Failed to extract value");

    // Send the result back to the main process using IPC
    let tx = ipc::IpcSender::connect(channel_name).unwrap();
    tx.send(format!("Result from worker: {}", sum)).unwrap();
    
    exit(0); // Exit the process when done.
}

fn main() {
    let num_workers = 4; // Spawn 4 worker processes as an example.
    let mut children = Vec::new();
    let mut receivers = Vec::new();

    for i in 0..num_workers {
        let (tx, rx) = ipc::channel().unwrap();
        receivers.push(rx);

        let child = Command::new(std::env::args().next().unwrap())
            .arg("worker")
            .arg(tx.name().unwrap())
            .arg(i.to_string())
            .spawn()
            .expect("Failed to start worker process");
        
        children.push(child);
    }

    for rx in receivers {
        let result = rx.recv().unwrap();
        println!("Received: {}", result);
    }

    // Optionally wait for all child processes to complete
    for child in children {
        child.wait().expect("Failed waiting for child");
    }
}

During my experiments, where I tried combining various parts of the currently implemented ffi to achieve multiple compilers, I noticed that while sub-interpreters indeed enhance performance, they don't offer genuine parallelism. However, the approach I've suggested above might!

I've seen some libraries utilize this multi-process model, and I believe it's feasible. Furthermore, this could align well with the current implementation of "relax." By ensuring only one GIL connection per process, we can prevent contention over the Global Interpreter Lock. We have a couple of options:

Centralize execution and modify #[py_function] to schedule a Python task in a pool with multiple sessions. Additionally, we could create a macro to wrap code inside functions, allowing us to execute Python code without having to transfer Python objects between threads—a known limitation.
Retain the current coding style but introduce a mechanism to import a session connection inside a thread. This lock would essentially send the code to execute in the pool and then return the result.
While the first option might entail significant changes and require considerable effort, the second might be relatively straightforward.

These insights aim to enhance our crate's overall efficiency. By adopting such an approach, libraries related to machine learning, biochemistry analysis, and deep space analysis algorithms could benefit from safer multithreading. This could potentially expedite numerous research projects and save millions or even billions in processing costs.

Of course, I acknowledge that I don't possess the extensive experience that you core mainteiners have about PyO3 code yet. Consequently, I'm unsure if everything I've proposed is both feasible and practical for PyO3. However, I believe we can concur that if executed correctly, this could be a game-changer.

mejrs · 2023-09-30T21:48:07Z

why not employ multiple processes on the Rust side?

This is pretty much what the multiprocessing library does. It comes with the same problems and disadvantages; mainly that sending (python) objects to other processes is fairly expensive, and it doesn't work well for tasks that aren't trivially parallelizable.

I noticed that while sub-interpreters indeed enhance performance, they don't offer genuine parallelism

They do, subinterpreters can run independently.

Centralize execution and modify #[py_function] to schedule a Python task in a pool with multiple sessions. Additionally, we could create a macro to wrap code inside functions, allowing us to execute Python code without having to transfer Python objects between threads—a known limitation.
Retain the current coding style but introduce a mechanism to import a session connection inside a thread. This lock would essentially send the code to execute in the pool and then return the result.
While the first option might entail significant changes and require considerable effort, the second might be relatively straightforward.

I think the best option is some kind of closure based api like what rayon does. I'm wary of more "magical" implicit things like letting the existing macros do it.

letalboy · 2023-10-01T04:47:16Z

Sending objects to other processes is fairly expensive in Python.

In my concept, we will not send them; instead, we will inject them. I have tested this idea using PyO3, and it not only works but also significantly improves the code efficiency in my application. However, since we cannot use multiple truly parallelized instances, we are still limited to using a single processing unit, which can change if we allocate a session for each one.

Subinterpreters can run independently.

Are you sure?

In the thread-state-and-the-global-interpreter-lock section of the C-API documentation, we can observe the following paragraph:

Note that the PyGILState_* functions assume there is only one global interpreter (created automatically by Py_Initialize()). Python supports the creation of additional interpreters (using Py_NewInterpreter()), but mixing multiple interpreters and the PyGILState_* API is unsupported.

This indicates that we cannot guarantee a safe state of PyGIL when working with multiple subinterpreters. Another point is that the session they run on top of has a master GIL, meaning that while we can allow concurrency, everything will still depend on the same GIL.

Also, the same page states:

However, when threads are created from C (for example by a third-party library with its own thread management), they don’t hold the GIL, nor is there a thread state structure for them.

This implies that even if we create threads on the Python side, we are still relying on one GIL. When I wrote the enhancement issue, I extensively read the Docs in the Python C-API and delved deep into the C-API. What I realized is this: even if we use subinterpreters, we are still relying on one interpreter. I have this problem in native Python; only processes have an independent GIL and can deeply explore hardware resources.

And here, in subinterpreters, we see that they are planning to have a "per-interpreter GIL." If they have already done this, then I can agree with you that this issue isn't necessary. However, I can't find any information stating that each subinterpreter has a dedicated GIL. If they don't, then a mechanism using sessions (processes) will work better for large intensive tasks that hold the GIL for a long time, such as machine learning, large operations per unit, etc.

I'm not saying that your point is completely wrong. Indeed, subinterpreters will increase efficiency in concurrency, but for true parallelism, we need to rely on processes. Since PEP 3099 states that GIL will not be removed from Python 3.x, we need an alternative to utilize all the capabilities that the hardware has to offer. For that, I will continue to conduct my "Frankenstein" experiments on top of PyO3 FFI, hoping that I achieve something like this.

Regarding:

I think the best option is some kind of closure-based API like what rayon does. I'm wary of more "magical" implicit things like letting the existing macros do it.

I have tested something like this in my crate, RustPyNet, which is also in my profile as a public project. Here, I try to use a mechanism similar to rayon's to allow multiple sessions of PyO3 and have multiple interpreters in each one. It is in an experimental state, but I think it may work.

I acknowledge that I might be saying something incorrect here since the docs are really really extensive and the discussion about GIL already exists for some years by now, but based on what I have read, I see these facts that make me think that while subinterpreters are good for concurrency, for truly parallelized things and large intensive CPU operations, they will still present limitations originating from GIL.

mejrs · 2023-10-01T09:50:30Z

See https://docs.python.org/3.12/c-api/init.html#a-per-interpreter-gil

letalboy · 2023-10-01T20:30:01Z

Nice to know about this! But this is only in 3.12, right? I understand that we need to focus on the future since the versions we currently work on will be deprecated soon, if we don't manage to find a solution for earlier versions as well, many programs that can't be updated yet will be outside this feature's scope. If it ends up being like that, I will continue with RustPyNet and add a session-based Frankenstein of the FFI to allow older versions to have it. I personally have a Network-Based IPC Framework Lib built on top of PyO3 that needs real parallelism for better performance for a private finance market application and some other Telecom things, and I simply can't rely only on py > 3.12 because a lot of machine learning stuff is in 3.7.9 - 3.10.

I understand your point, this is a very more practical approach, but at least where I read here in the ref that you sent, it will only work in the newer versions. This means that until the rest of the libs indeed update, a lot of things that can't be updated to py > 3.12 will stay out of this mod..

davidhewitt · 2023-10-02T07:10:36Z

Personally I think that a full multiprocess framework is an application-specific problem of high complexity which most users of PyO3 don't need. I can understand the value it brings to you and encourage you to continue to publish and support RustPyNet while it solves the problem for you and others. I just don't think there is justification for adding this complexity to PyO3.

Subinterpreter support is, on the other hand, something which most PyO3 users may want for their extension modules even if they don't actually need subinterpreters at all (their users might). The 3.12 per-interpreter GIL then also becomes a natural addition for Rust users looking for Python parallelism. Therefore I think the correct way to make progress is by solving #576.

While it is true that 3.12 is still unsupported by many projects, this will change in the time it take us to make changes here.

I also think that when nogil python / PEP 703 is accepted then that'll be significantly easier for us to support than subinterpreters.

letalboy · 2023-10-02T07:52:18Z

I now understand your point about it. PyO3 is a crate designed to interface Rust with Python, acting like a pipe. It allows sending things from Rust to Python and vice versa, executing commands and receiving responses. Essentially, you all prefer not to add too much complexity to it because projects built on top of it can be slowed down if this pipe has excessive complexity in this interlanguage union, especially for library development and also most uses will not need all this support because most modules don't need it.

Given that, I will take your suggestion, which you also recommended some days ago, and continue with RustPyNet. The purpose is to bridge this gap for users who need to maximize hardware utility, like the examples I mentioned, and also to allow for backward compatibility, since David suggested that as well.

Let's stay in touch. I'm collaborating on a fork with @Aequitosh, and we are trying to add support for sub-interpreters. Now that I'm more aligned with the goals of PyO3, I'm confident that something productive can emerge from this. Meanwhile, I will enhance RustPyNet as I gain more knowledge about CPython and PyO3's FFI module.

In fact, the references in the ffi module that you asked to open a PR and implement in the sub-interpreters issue seem to be already where they should be in the ffi module's pylifecycle.rs file

And then after realise that, we move to the planing of the mechanism to comport sub-interpreters and what it will impact on.

So I think this issue is closed, now that I understand better the goals that you guys have to PyO3 I will try to focus more on the direction that you guys are going especially with the sub interpreters, also considering the idea of no GIL that can be a good one. If you guys want to close this issue, or anything like that, feel free to do so! I hope that soon I can bring new in the topic of the sub-interpreter and help to add this support to PyO3! Tks for the clarification and also for the good Docs sent in this conversation!

davidhewitt · 2023-10-11T21:29:27Z

Thanks @letalboy. As per above, I will close this issue for now. I look forward to hearing both about RustPyNet further and what discoveries you make regarding what we can do about sub-interpreters. 👍

letalboy added the enhancement label Sep 28, 2023

davidhewitt closed this as completed Oct 11, 2023

davidhewitt closed this as not planned Won't fix, can't repro, duplicate, stale Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing Parallelism in PyO3: Exploring Multi-Process Architecture Over Sub-Interpreters #3479

Enhancing Parallelism in PyO3: Exploring Multi-Process Architecture Over Sub-Interpreters #3479

letalboy commented Sep 28, 2023 •

edited

Loading

mejrs commented Sep 30, 2023

letalboy commented Oct 1, 2023

mejrs commented Oct 1, 2023

letalboy commented Oct 1, 2023 •

edited

Loading

davidhewitt commented Oct 2, 2023

letalboy commented Oct 2, 2023

davidhewitt commented Oct 11, 2023

Enhancing Parallelism in PyO3: Exploring Multi-Process Architecture Over Sub-Interpreters #3479

Enhancing Parallelism in PyO3: Exploring Multi-Process Architecture Over Sub-Interpreters #3479

Comments

letalboy commented Sep 28, 2023 • edited Loading

mejrs commented Sep 30, 2023

letalboy commented Oct 1, 2023

mejrs commented Oct 1, 2023

letalboy commented Oct 1, 2023 • edited Loading

davidhewitt commented Oct 2, 2023

letalboy commented Oct 2, 2023

davidhewitt commented Oct 11, 2023

letalboy commented Sep 28, 2023 •

edited

Loading

letalboy commented Oct 1, 2023 •

edited

Loading