-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Further overhead reductions #203
base: main
Are you sure you want to change the base?
Conversation
It seems as if there are a decent number of test failures. Please consider this a draft pull request for now, until I can get them all fixed (hopefully in the next week or so). |
hey guys, any updates on this? might create a new branch in the autoprofiler fork and merge the current state of this pr to reduce overhead until you have some free time to spare for this |
This PR changes the data structures used to lower overhead. Specifically, a preshmap (pre-hashed map) is used for checking whether a thread has been seen before, which is significantly faster than a simple
threading.get_ident()
call every time. Additionally, a significant source of overhead is reduced by eliminating one of thehpTimer()
calls. Each call can take up to 50 cycles for nanosecond resolution (on Linux), and overall the two calls together summed to half of the overhead. By removing one of the calls, the accuracy hasn't been measurably reduced (since the callback is so much faster than whenhpTimer()
was originall added). The last major change is that the STL unordered_map container is no longer used, as it was causing significant overhead when hashing the keys, and a parallelized version can take advantage of multiple cores better. Preshed and cymem are both submodules because trying to include them from a pypi installation breaks on a pyenv virtual environment (on my system).Another fairly minor change is that with the release of Cython 3.0.0b1, the default for cdef functions has been changed to propagate python exceptions. This imposes a 3x speed penalty, raising it to unacceptable levels, so I added
noexcept
to function signatures to avoid this issue, since our cdef functions don't raise python exceptions.Here's a table showing the current state of overhead. Here's a comment explaining jsonpickle.
Note: Please test this on async code! I haven't had a chance to, and I don't know if the threading handling is entirely accurate with async code.