Improve `parse_utf8` performance #99826

kiroxas · 2024-11-29T09:53:17Z

Improve parseUtf8 performance.

All the timings are using the rdtsc instruction, the lower the better.

Timings using a 9mb GLTF file from this issue averaged on 6000 calls of the function. Old len is the current implementation with the length of the string passed in, Old -1 is the current implementation without the length of the string passed in. New is the proposed implementation.

Went from 3 083 895 cycles down to 1 816 624 cycles.

Timings using a 6mb Ascii file from here averaged on 15 000 calls of the function.

Went from 50 741 668 cycles down to 30 046 082 cycles.

Tested the output using utf8tests. Now passes all the section, except the Null Characters section (being able to have a 0 byte inside a longer string), but this is to be expected as changing this would probably break a lot of usage code.

Regarding tests, I had to remove the non standard ones, as it is now closer to standard. I replaced them with examples from the unicode standard specification.

Some tests from other people would be nice, as this seems pretty widely used through the engine.

Ivorforce · 2024-11-29T10:30:13Z

Regarding NULL termination, I'll link my proposal from just yesterday: godotengine/godot-proposals#11249

Agree that's for another day. Thanks for thoroughly testing this PR! Can you explain the changes on a high level?

tests/core/string/test_string.h

kiroxas · 2024-11-29T13:15:35Z

Regarding NULL termination, I'll link my proposal from just yesterday: godotengine/godot-proposals#11249

Agree that's for another day. Thanks for thoroughly testing this PR! Can you explain the changes on a high level?

Nothing fancy, just do everything in one pass instead of two, and some small look ahead instead of one char per iteration with storing some state.

clayjohn · 2024-11-29T16:59:59Z

Regarding tests, I had to remove the non standard ones, as it is now closer to standard. I replaced them with examples from the unicode standard specification.

Does this mean that the new implementation fails the previous tests?

bruvzg · 2024-11-29T17:07:49Z

Does this mean that the new implementation fails the previous tests?

Previous implementation was decoding overlongs, new one do not (can be changed by removing few unicode = _replacement_char; lines).

We also might want to decode unpaired surrogates, previous implementation was not doing it since #74760, but unpaired surrogates can be par of Windows file names, and we probably should be able to store/read them (not sure how common it is).

Both cases should still set the parsing error flag (since it's not valid UTF-8).

kiroxas · 2024-11-29T17:39:31Z

Does this mean that the new implementation fails the previous tests?

Previous implementation was decoding overlongs, new one do not (can be changed by removing few unicode = _replacement_char; lines).

We also might want to decode unpaired surrogates, previous implementation was not doing it since #74760, but unpaired surrogates can be par of Windows file names, and we probably should be able to store/read them (not sure how common it is).

Both cases should still set the parsing error flag (since it's not valid UTF-8).

Seems like a user can manually name a file with an unpaired surrogate and windows will accept it. I see other issues in programming languages like zig and go mitigating this by using WTF-8 instead of UTF-8. This would be a small change, it consists mostly of accepting surrogates with a small transformation. This is probably better suited for another PR after this one though.

Ivorforce

The implementation looks efficient and safe to me. I cannot comment on the correctness of encodings, though it seems you did you due diligence with adding some unit tests from the standard specification. Let's get this merged.

core/string/ustring.cpp

kiroxas requested review from a team as code owners November 29, 2024 09:53

kiroxas force-pushed the improveParseUTF8Performance branch from 36704ca to af2cd34 Compare November 29, 2024 09:58

Mickeon added enhancement topic:core performance labels Nov 29, 2024

Mickeon added this to the 4.x milestone Nov 29, 2024

Mickeon added the needs testing label Nov 29, 2024

kiroxas force-pushed the improveParseUTF8Performance branch 2 times, most recently from 6d2a454 to 64ef274 Compare November 29, 2024 10:06

bruvzg reviewed Nov 29, 2024

View reviewed changes

tests/core/string/test_string.h Outdated Show resolved Hide resolved

kiroxas force-pushed the improveParseUTF8Performance branch 2 times, most recently from 0c35969 to 984fc88 Compare November 29, 2024 11:31

kiroxas mentioned this pull request Nov 29, 2024

Ensure parse_utf8 has length of string passed in when available #99834

Merged

kiroxas mentioned this pull request Dec 15, 2024

Rename String::copy_from functions to their respective encodings (parse_latin1, parse_wstring, parse_utf32). #100434

Merged

Ivorforce approved these changes Dec 15, 2024

View reviewed changes

core/string/ustring.cpp Outdated Show resolved Hide resolved

core/string/ustring.cpp Show resolved Hide resolved

improveParseUTF8Performance

e4f8a7f

kiroxas force-pushed the improveParseUTF8Performance branch from 984fc88 to e4f8a7f Compare December 16, 2024 08:55

kiroxas requested a review from bruvzg December 18, 2024 09:08

Ivorforce mentioned this pull request Dec 20, 2024

UTF-8 Strings are incorrectly parsed as latin1. #100641

Open

Ivorforce mentioned this pull request Jan 3, 2025

String::sha256_text() shows up as a hotspot during game launch #86249

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `parse_utf8` performance #99826

Improve `parse_utf8` performance #99826

kiroxas commented Nov 29, 2024

Ivorforce commented Nov 29, 2024

kiroxas commented Nov 29, 2024

clayjohn commented Nov 29, 2024

bruvzg commented Nov 29, 2024 •

edited

Loading

kiroxas commented Nov 29, 2024 •

edited

Loading

Ivorforce left a comment •

edited

Loading

Improve parse_utf8 performance #99826

Are you sure you want to change the base?

Improve parse_utf8 performance #99826

Conversation

kiroxas commented Nov 29, 2024

Ivorforce commented Nov 29, 2024

kiroxas commented Nov 29, 2024

clayjohn commented Nov 29, 2024

bruvzg commented Nov 29, 2024 • edited Loading

kiroxas commented Nov 29, 2024 • edited Loading

Ivorforce left a comment • edited Loading

Choose a reason for hiding this comment

Improve `parse_utf8` performance #99826

Improve `parse_utf8` performance #99826

bruvzg commented Nov 29, 2024 •

edited

Loading

kiroxas commented Nov 29, 2024 •

edited

Loading

Ivorforce left a comment •

edited

Loading