-
-
Notifications
You must be signed in to change notification settings - Fork 21.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve parse_utf8
performance
#99826
base: master
Are you sure you want to change the base?
Conversation
36704ca
to
af2cd34
Compare
6d2a454
to
64ef274
Compare
Regarding Agree that's for another day. Thanks for thoroughly testing this PR! Can you explain the changes on a high level? |
0c35969
to
984fc88
Compare
Nothing fancy, just do everything in one pass instead of two, and some small look ahead instead of one char per iteration with storing some state. |
Does this mean that the new implementation fails the previous tests? |
Previous implementation was decoding overlongs, new one do not (can be changed by removing few We also might want to decode unpaired surrogates, previous implementation was not doing it since #74760, but unpaired surrogates can be par of Windows file names, and we probably should be able to store/read them (not sure how common it is). Both cases should still set the parsing error flag (since it's not valid UTF-8). |
Seems like a user can manually name a file with an unpaired surrogate and windows will accept it. I see other issues in programming languages like zig and go mitigating this by using WTF-8 instead of UTF-8. This would be a small change, it consists mostly of accepting surrogates with a small transformation. This is probably better suited for another PR after this one though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks efficient and safe to me. I cannot comment on the correctness of encodings, though it seems you did you due diligence with adding some unit tests from the standard specification. Let's get this merged.
984fc88
to
e4f8a7f
Compare
Improve parseUtf8 performance.
All the timings are using the rdtsc instruction, the lower the better.
Timings using a 9mb GLTF file from this issue averaged on 6000 calls of the function. Old len is the current implementation with the length of the string passed in, Old -1 is the current implementation without the length of the string passed in. New is the proposed implementation.
Went from 3 083 895 cycles down to 1 816 624 cycles.
Timings using a 6mb Ascii file from here averaged on 15 000 calls of the function.
Went from 50 741 668 cycles down to 30 046 082 cycles.
Tested the output using utf8tests. Now passes all the section, except the Null Characters section (being able to have a 0 byte inside a longer string), but this is to be expected as changing this would probably break a lot of usage code.
Regarding tests, I had to remove the non standard ones, as it is now closer to standard. I replaced them with examples from the unicode standard specification.
Some tests from other people would be nice, as this seems pretty widely used through the engine.