-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up id generation #65
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Nice results! Are we confident that the generated numbers have enough entropy? Might be worth running a quick test to see if any duplicate ids are generated in 10000 runs? |
After reading through the docs for getrandbits, this branch lg2m. Feel free to ignore my comments :) |
Not sure it will help with performance, but having a byte prefix of epoch seconds allows transparent conversion to AWS IDs and.. natural sort/partitioning by time. Here's an example implementation in java
|
most trace ID sampling only needs the lower 64-bits to be precisely random, and collision aversion should be pretty good, too, at least as good as AWS. |
3 similar comments
I've gone ahead and changed the 128-bit generator to set the most significant 32 bits as epoch seconds and the other 96 bits as random, so that we get x-ray interop. This adds about 30% overhead (almost a microsecond on my machine), however this is probably not a big deal given that traceids are only generated once per trace. There is also a small query performance concern with the epoch128 format. We sometimes do sampling by prefix by doing something like In practice, this is not a huge deal since you can always pass in your own traceids (we do this using pyramid_zipkin), but it's something to think about. |
Ah interesting about prefix sampling.. if your sampling is offset would it
be worse? Ex would it cost a lot to do "prefix" but offset 8 chars?
…On 12 Jan 2018 5:09 am, "Ben Plotnick" ***@***.***> wrote:
I've gone ahead and changed the 128-bit generator to set the most
significant 32 bits as epoch seconds and the other 96 bits as random, so
that we get x-ray interop.
This adds about 30% overhead (almost a microsecond on my machine), however
this is probably not a big deal given that traceids are only generated once
per trace.
There is also a small query performance concern with the epoch128 format.
We sometimes do sampling by prefix by doing something like traceId=b*,
which gives a 1/16 sampling. We could do a suffix query (e.g. traceId=*b),
but this is apparently less efficient for some systems.
In practice, this is not a huge deal since you can always pass in your own
traceids (we do this using pyramid_zipkin
<http://pyramid-zipkin.readthedocs.io/en/latest/configuring_zipkin.html#zipkin-trace-id-generator>),
but it's something to think about.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#65 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAD61_EfeA4nNFks1PRsnifzYqO9nT1Cks5tJnhwgaJpZM4RaMvL>
.
|
@adriancole I'm not sure about that. For splunk (which we use heavily), it generally says that wildcards in the beginning or middle can have performance issues https://docs.splunk.com/Documentation/SplunkCloud/6.6.3/Search/Wildcards#Prefix_wildcards_might_cause_performance_issues. I imagine that a regex-based search might be efficient in this manner. @thebostik any idea on this? |
|
Yeah was thinking fixed offset not wildcard but regardless it is probably
too niche anyway.. interesting to know impact though so i can summarize on
the related issue.
…On 13 Jan 2018 3:09 am, "Ben Plotnick" ***@***.***> wrote:
@adriancole <https://github.com/adriancole> I'm not sure about that. For
splunk (which we use heavily), it generally says that wildcards in the
beginning or middle can have performance issues https://docs.splunk.com/
Documentation/SplunkCloud/6.6.3/Search/Wildcards#Prefix_
wildcards_might_cause_performance_issues. I imagine that a regex-based
search might be efficient in this manner. @thebostik
<https://github.com/thebostik> any idea on this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#65 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAD6191JyfQEwVinLTM5v8SbG5TSUlvEks5tJ6uJgaJpZM4RaMvL>
.
|
For traces with lots of spans, generating span ids can add up. This branch changes
generate_random_{64,128}bit_string
from usingos.urandom
to usingrandom.getrandbits
, which is faster. It also adds benchmarking which shows a speedup of over 2x. The boolean is whether to use 128bit ids. I'm also not sure why the max is so high, maybe a gc pause?: