-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCX slower than TCP Socket #305
Comments
First of all, let me just say Python asyncio imposes horrible overhead, the runtime for a single task is over 10us, and a regular Python call runs in the range of 100ns or less (that means asyncio overhead > 100x!), so short-lived tasks (like small data transfers) are a terrible use of asyncio. Source: https://github.com/pentschev/python-overhead .
To be clear I was referring to synchronous Python sockets, not necessarily using asyncio, which I now see is the case in your socket implementation, so I do ultimately expect at least similar performance between UCXX async backend and asyncio's With that said, I remember you previously wrote the following numbers in rapidsai/ucx-py#1072 (comment):
Is that still true? Can you confirm if you're using InfiniBand interfaces in both cases? Can you also try the latest We're certainly not attempting to be necessarily way faster, nor necessarily even faster than Python's libraries in all cases, but we definitely do want to have similar performance provided we're using the same transports. What we do want with UCXX is to leverage transports that are not available with Python builtin libraries, such as CUDA IPC and GPUDirectRDMA with InfiniBand, as well as provide a seamless interface to transfer Python arrays (i.e., objects that implement In any case, we do want to improve performance where possible, but once again I do not necessarily expect we'll match or outperform Python sockets implementations if all you use are sockets via UCX. For example, in #309 I've added support for Python socket
ucxx-core/TCP
ucxx-core/RC
As you can see, we are just slightly faster than Python's ucxx-core/TCP
ucxx-core/RC
Similarly, if you run benchmarks with Finally, still on the topic of asyncio, the problems I've mentioned previously are something I've also encountered in Dask, so part of the UCXX effort attempts to try and overcome some of those issues, like the use of too many asyncio tasks. One of those attempts is to use what I've called multi-buffer transfers (see also the Please let us know if you happen to do some more experimenting based on the information above and how that goes so we can see whether UCXX can be made to perform better for you as well. |
Hi @pentschev Sorry for the late reply. Xoscar DesignIn xoscar, we reuse endpoints instead of creating them each time. Each process is an ActorPool with a ucxx endpoint, and processes communicate with each other through UCXX for data transmission. Refactoring the architecture is not easy, and I believe this might be the crux of the issue: all communications, including those between processes within the same worker, are conducted through UCXX. And UCXX is not good for small transfers? Benchmark for different progress modeGiven the architecture mention above, I did some tests. I installed the UCXX nightly from conda, and the speed of
|
Can you be more objective and provide numbers to what you refer as "high cost"? I'm not really sure what that means. Consuming a lot of CPU probably refers to
In C++ UCXX will do well for small transfers, I was merely stating facts about Python when used for small transfers (or any short-lived tasks in general), in particular async Python. I'm also saying that Python
It is strange that you're hitting that assertion all the time, I have not seen it in years or maybe ever. Can you provide a reproducer? Additionally, does it work if you comment it out? What does performance look like with that? If you have the chance I'd encourage you to test
So far I've been talking generally about UCXX because you haven't provided me with any numbers, only subjective descriptions such as "much slower". Can you provide observed numbers instead? If possible, can you provide a simple reproducer too? |
I started a thread in ucx-py, and now I have replaced ucx-py with ucxx, which resolved the blocking issue. However, in terms of performance, ucx is slower than TCP sockets. Over the past few days, I have done some analysis and profiling, and I found that ucx's
await endpoint.recv()
consumes a lot of CPU time.I am currently maintaining repositories (xorbits/xoscar) that are data science tools similar to Dask, which can scale workloads like pandas to a cluster. Among them, xoscar is the underlying actor framework used for inter-process communication, serialization, and more. As an underlying actor framework, xoscar is a little bit similar to Dask's
distributed
package. All communication and resource management is handled by xoscar. Similar to Dask, xorbits has a supervisor for management and workers responsible for computation. When a compute node starts xorbits/xoscar, xorbits sends messages such as the heartbeat of that compute node to the supervisor through xoscar's communication mechanism. These management messages are mostly small transfers, and these communications occur at short intervals (a few times within one second). Big transfers include shuffling dataframes. In summary, there are small transfers for management messages and big transfers for data shuffling.In xoscar, different actors communicate with each other through the concept of Channels, and we have currently implemented UCXChannel.
The ucx code is on this page.
I used py-spy to profile xorbits/xoscar/ucxx and found that this line of
await endpoint.recv()
is consuming a lot of CPU resources. The flame graph is as follows.@pentschev mentioned in the previous thread that it's not surprising for TCP sockets to be faster than UCX. So what I want to confirm is whether there are good solutions for scenarios like mine, where small transfers and big transfers are mixed together, to reduce the CPU load of
await endpoint.recv()
. Or do I need to modify the current design to have small transfers go through TCP and dataframe shuffling go through UCX?The text was updated successfully, but these errors were encountered: