-
-
Notifications
You must be signed in to change notification settings - Fork 21.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 Strings are incorrectly parsed as latin1. #100641
Comments
The choice to make |
I guessed as much (though it doesn't appear to be documented). The actual performance gain for this decision should be tested. I don't think it has a big impact, and correctness is always more important than performance. |
I agree, there was talk of adding a check for correctness as well, but almost all string literals are latin1 so that's important to keep in mind as well, or rather ASCII |
I think making people extra aware of that raw strings are parsed as latin1 is important, and do a survey of how many actual string sequences in the codebase are using other encodings, either using Having a slight performance drop to parse utf8 (I'd say it's almost guaranteed that it would hurt performance at least somewhat, though might be barely noticeable) would be bad IMO if it would be essentially unused, which I suspect is the case in the existing codebase. Most cases, like names, types, etc. are strictly ASCII, and outside of certain string literals like property hints (using special characters like the degree symbol) we shouldn't be using non-ASCII literals generally I'd say, not just a matter of that we aren't Tl;dr; If something like 90% or more of the literals in the codebase are ASCII any drop in performance for swapping to utf8 would be a negative, as that feature would be nigh unused |
I agree, that's why I think just statically checking for non-ASCII in string literals would be best (and disallowing the default string constructor for anything else, forcing them to call |
Agreed, handling things with literals specially would be very useful, most if not all dynamic input is processed as utf8 already, with a few exceptions that should probably use the explicit case of |
Does anyone has some data on this ? Like how many static strings are parsed in the codebase, and what's the average size for them. Because if we're talking 20 bytes string, the difference between these should be very small. I'm not even sure parse utf8 parsing only ascii would ba that slow. But we can probably measure that. I have not access to my computer for some time, but it would be nice to gather those metrics, to avoid guessing. |
Not sure if there's any data, but you can estimate amount of string literals by checking translation files ( |
I think we can sefely ignore all files from I did some quick benchmarking for the cost of
So in the case of small strings, we're talking about 200-300 cycles difference per string. This is not nothing, but if you happen to do it 50 times in a frame, this is a 10k cycles budget. If anyone wants to test this, here is the code, to drop in
|
Tested versions
All versions are affected. This is old code.
Issue description
The current implementation of Godot has a bug where UTF-8 encoded strings are parsed as
latin1
.I'd like to hear what you all think is the best solution to this problem.
The
String(const char *)
constructor currently parseschar
strings using latin1 encoding:godot/core/string/ustring.h
Lines 615 to 617 in 89001f9
(
parse_latin1
was recently renamed fromcopy_from
because that's what the function does).This constructor is used for many strings, including static C strings. The encoding of static C strings is controlled by
-fexec-charset
on GCC / Clang1 and/execution-charset
on MSVC2. In the Godot codebase, it defaults to UTF-8 on GCC / Clang, and is explicitly set to UTF-8 on Windows.latin1
andutf-8
are compatible for values below 128, and incompatible otherwise. Therefore, there is a mismatch between encodings that can be encountered for non-ascii strings. It is likely that there are mismatches in other, non-static string use-cases, becauselatin1
encoded strings are a somewhat rare encounter (though ascii-only strings are pretty likely).The mismatch has apparently led to problems a few times in the past (though I don't know how often). Most times, strings use ascii-only characters, in which range
latin1
andutf-8
overlap.Possible Solutions
Here are some ideas that I came up with:
u8"String"
for utf-8 strings).-fexec-charset
(and force the use ofu8"String"
for utf-8 strings). This would be the best solution in my opinion, but it doesn't appear to be possible to pass anything else exceptUTF-8
to-fexec-charset
right now, at least in Clang.latin1
parsing (though accelerated by Improveparse_utf8
performance #99826). The exact difference would have to be measured and tested for significance.parse_latin1
, and somewhat faster thanparse_utf8
.Ideally, the default, implicit
String(char*)
constructor is removed, to avoid this problem in the future. Instead, every use of it should be replaced with an explicit call toString::utf8
or similar (maybe with the exception of construction from string literals), so we can be sure intent was put behind choosing the encoding. This is a bigger task though and not realistic for the near future.Solution 1 does not address possible encoding mismatches of non-static string constructor calls.
In either the second or third solution, users that actually want to parse
latin1
have to be switched to use that function explicitly.Footnotes
https://gcc.gnu.org/onlinedocs/gcc/Preprocessor-Options.html#index-fexec-charset ↩
https://learn.microsoft.com/en-us/cpp/build/reference/execution-charset-set-execution-character-set?view=msvc-170 ↩
The text was updated successfully, but these errors were encountered: