Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parse_utf8 performance #99826

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kiroxas
Copy link
Contributor

@kiroxas kiroxas commented Nov 29, 2024

Improve parseUtf8 performance.

All the timings are using the rdtsc instruction, the lower the better.

Timings using a 9mb GLTF file from this issue averaged on 6000 calls of the function. Old len is the current implementation with the length of the string passed in, Old -1 is the current implementation without the length of the string passed in. New is the proposed implementation.
GLTF_FileComparison
Went from 3 083 895 cycles down to 1 816 624 cycles.

Timings using a 6mb Ascii file from here averaged on 15 000 calls of the function.
UTF8File_comparison
Went from 50 741 668 cycles down to 30 046 082 cycles.

Tested the output using utf8tests. Now passes all the section, except the Null Characters section (being able to have a 0 byte inside a longer string), but this is to be expected as changing this would probably break a lot of usage code.

Regarding tests, I had to remove the non standard ones, as it is now closer to standard. I replaced them with examples from the unicode standard specification.

Some tests from other people would be nice, as this seems pretty widely used through the engine.

@kiroxas kiroxas requested review from a team as code owners November 29, 2024 09:53
@kiroxas kiroxas force-pushed the improveParseUTF8Performance branch from 36704ca to af2cd34 Compare November 29, 2024 09:58
@Mickeon Mickeon added this to the 4.x milestone Nov 29, 2024
@kiroxas kiroxas force-pushed the improveParseUTF8Performance branch 2 times, most recently from 6d2a454 to 64ef274 Compare November 29, 2024 10:06
@Ivorforce
Copy link
Contributor

Regarding NULL termination, I'll link my proposal from just yesterday: godotengine/godot-proposals#11249

Agree that's for another day. Thanks for thoroughly testing this PR! Can you explain the changes on a high level?

@kiroxas kiroxas force-pushed the improveParseUTF8Performance branch 2 times, most recently from 0c35969 to 984fc88 Compare November 29, 2024 11:31
@kiroxas
Copy link
Contributor Author

kiroxas commented Nov 29, 2024

Regarding NULL termination, I'll link my proposal from just yesterday: godotengine/godot-proposals#11249

Agree that's for another day. Thanks for thoroughly testing this PR! Can you explain the changes on a high level?

Nothing fancy, just do everything in one pass instead of two, and some small look ahead instead of one char per iteration with storing some state.

@clayjohn
Copy link
Member

Regarding tests, I had to remove the non standard ones, as it is now closer to standard. I replaced them with examples from the unicode standard specification.

Does this mean that the new implementation fails the previous tests?

@bruvzg
Copy link
Member

bruvzg commented Nov 29, 2024

Does this mean that the new implementation fails the previous tests?

Previous implementation was decoding overlongs, new one do not (can be changed by removing few unicode = _replacement_char; lines).

We also might want to decode unpaired surrogates, previous implementation was not doing it since #74760, but unpaired surrogates can be par of Windows file names, and we probably should be able to store/read them (not sure how common it is).

Both cases should still set the parsing error flag (since it's not valid UTF-8).

@kiroxas
Copy link
Contributor Author

kiroxas commented Nov 29, 2024

Does this mean that the new implementation fails the previous tests?

Previous implementation was decoding overlongs, new one do not (can be changed by removing few unicode = _replacement_char; lines).

We also might want to decode unpaired surrogates, previous implementation was not doing it since #74760, but unpaired surrogates can be par of Windows file names, and we probably should be able to store/read them (not sure how common it is).

Both cases should still set the parsing error flag (since it's not valid UTF-8).

Seems like a user can manually name a file with an unpaired surrogate and windows will accept it. I see other issues in programming languages like zig and go mitigating this by using WTF-8 instead of UTF-8. This would be a small change, it consists mostly of accepting surrogates with a small transformation. This is probably better suited for another PR after this one though.

Copy link
Contributor

@Ivorforce Ivorforce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks efficient and safe to me. I cannot comment on the correctness of encodings, though it seems you did you due diligence with adding some unit tests from the standard specification. Let's get this merged.

core/string/ustring.cpp Outdated Show resolved Hide resolved
core/string/ustring.cpp Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants