Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept: offloading of CPU/ IO / Network to Twisted's threadpool benchmark #2137

Closed
lfdversluis opened this issue Apr 26, 2016 · 16 comments
Assignees

Comments

@lfdversluis
Copy link

Discussed with @synctext

As multiple threads in python do not give performance benefits because all of them are captured by the GIL, a proof of concept benchmark that has a mixed workload of CPU intensive tasks, IO intensive tasks (database) and networking requests that mimics Tribler's behavior should be constructed.
From this benchmark we can observe the gains if these distinct elements are placed on the thread pool vs. the current synchronous case. Results should indicate if the overhead created is acceptable.

@whirm
Copy link

whirm commented Apr 26, 2016

Note that blocking calls to native calls that free the GIT WILL give performance improvements!

@whirm whirm added this to the Backlog milestone Apr 26, 2016
@lfdversluis
Copy link
Author

lfdversluis commented May 1, 2016

I have finished the basic client and server code during the weekend. So far the results look promising. This experiment ran 6 servers and 6 clients, every client connects to one server and then issues 10 different queries so that the server's query is different and no database optimizations can make the results biased .

Client Server Duration (microseconds)
Synchronous (blocking) Synchronous 61937940
Asynchronous (non-blocking) Synchronous 51127318

This means that the Asynchronous code (using Twisted's Agent) is ~17.4% faster. Note that the parsing of the response and the inserting of data after the parsing stage are all completely synchronous still, only the requests themselves were (a)synchronous. Further results should fine grain if doing these tasks asynchronously on the threadpool further increase the gain in speed.

As for the server, this will also be investigated whether doing tasks on the threadpool vs. the current synchronous implementation will further optimize the speed.

@lfdversluis
Copy link
Author

And the results are in :D Now with the server also blocking / non-blocking for IO and CPU methods.

benchmark cpu network and io

The colors have a pattern:
If CPU is asynchronous (non-blocking); red (r) = 255 else 0
If Network is asynchronous (non-blocking); green (g) = 255 else 0
If IO is synchronous (non-blocking); blue (b) = 255 else 0

The color of the bar is this rgb(r, g, b)

Comparing the black bar with the green bar shows a 15.8% improvement. Doing the network and IO async yields a 14.4% improvement over the all synchronous scenario.

Interesting observation is that offloading only cpu making things worse, so we shouldn't do that. The overhead and thread switching + method calls is more performance lost than gained.

@whirm
Copy link

whirm commented May 3, 2016

Good work!
How much data is being compressed?

@lfdversluis
Copy link
Author

These results are without the zip inflation/deflation as an error is being thrown at random when unpacking things. I am trying to fix that, however I was thinking of maybe doing Crypto stuff on the JSON instead of zipping it, should also be CPU intensive and that's what we do now in the tunnels basically.

@lfdversluis
Copy link
Author

lfdversluis commented May 11, 2016

I have fixed the zip inflation/deflation issue and added encryption and decryption at the server resp. client side which encrypts/decrypts the zip being sent. The results are interesting:

benchmark cpu network and io

Now, the combination of cpu, io and network being offloaded scores best, with a gain of ~10.62%
Which is interesting, considering the fact that only offloading cpu or cpu + io is actually worse than doing everything synchronously.

So whenever networking is synchronous (blocking), it appears that offloading cpu or io will worsen the situation. Whenever it is asynchronous, all scenarios including this perform better.

@whirm
Copy link

whirm commented May 11, 2016

is the zip stuff freeing the GIL?

@lfdversluis
Copy link
Author

lfdversluis commented May 11, 2016

No, and that explains probably why the CPU offloading takes more time.
I think the crypto stuff may not be releasing the GIL either.

At this point it's safe to assume that most of the gzip+line count code requires the GIL. A quick look at "gzip.py" tells me that, yes, that is the case.
Source: http://www.dalkescientific.com/writings/diary/archive/2012/01/19/concurrent.futures.html

@lfdversluis
Copy link
Author

Apparently pycrypto does release the GIL now, but I used the crpytography package.

https://github.com/dlitz/pycrypto/blob/master/ChangeLog

@whirm
Copy link

whirm commented May 11, 2016

let's see what happens if you use pycrypto then

@lfdversluis
Copy link
Author

Do note that the tunnels use cryptography :) I will give pycrypto a go

@lfdversluis
Copy link
Author

Also, zlib releases the GIL apparently: https://docs.python.org/2/c-api/init.html

@whirm
Copy link

whirm commented May 11, 2016

Do note that the tunnels use cryptography :) I will give pycrypto a go

Well, if we want to improve throughput and python-cryptography doesn't release the GIL, something will have to change... Let's talk about it when we have results from both libs.

@lfdversluis
Copy link
Author

lfdversluis commented May 11, 2016

Alright, the results are in once more! Now non-blocking zipping and pycrypto which (should) releases the GIL as well.

benchmark cpu network and io

Comparing the best (blue) one to the black (everything blocking) shows a gain of ~18.93% which is huge. Nearly one fifth! Interesting that the yellow blue and white cases are very, very close, only 43 ms difference between white and blue.

Also, I love zlib, very easy to use (no need for StringIO or BytesIO file like objects).

@lfdversluis
Copy link
Author

lfdversluis commented May 16, 2016

I noticed in the last benchmark the creating of the table at the client side was included in the time measurement which shouldn't be included. Below the corrected plot:
benchmark cpu network and io

Blue remains the best, but the all synchronous case is now the worst. The gain became even bigger between blue and black: 22.45%. Comparing white (all sync) vs black yields 17.97%.

@lfdversluis
Copy link
Author

@whirm This initial result is there, so this ticked can be closed 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants