Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 Strings are incorrectly parsed as latin1. #100641

Open
Ivorforce opened this issue Dec 20, 2024 · 9 comments
Open

UTF-8 Strings are incorrectly parsed as latin1. #100641

Ivorforce opened this issue Dec 20, 2024 · 9 comments

Comments

@Ivorforce
Copy link
Contributor

Ivorforce commented Dec 20, 2024

Tested versions

All versions are affected. This is old code.

Issue description

The current implementation of Godot has a bug where UTF-8 encoded strings are parsed as latin1.

I'd like to hear what you all think is the best solution to this problem.

The String(const char *) constructor currently parses char strings using latin1 encoding:

godot/core/string/ustring.h

Lines 615 to 617 in 89001f9

String(const char *p_cstr) {
parse_latin1(p_cstr);
}

(parse_latin1 was recently renamed from copy_from because that's what the function does).

This constructor is used for many strings, including static C strings. The encoding of static C strings is controlled by -fexec-charset on GCC / Clang1 and /execution-charset on MSVC2. In the Godot codebase, it defaults to UTF-8 on GCC / Clang, and is explicitly set to UTF-8 on Windows. latin1 and utf-8 are compatible for values below 128, and incompatible otherwise. Therefore, there is a mismatch between encodings that can be encountered for non-ascii strings. It is likely that there are mismatches in other, non-static string use-cases, because latin1 encoded strings are a somewhat rare encounter (though ascii-only strings are pretty likely).

The mismatch has apparently led to problems a few times in the past (though I don't know how often). Most times, strings use ascii-only characters, in which range latin1 and utf-8 overlap.

Possible Solutions

Here are some ideas that I came up with:

  • Enforce ASCII-only for static C strings (and force the use of u8"String" for utf-8 strings).
    • Using -fexec-charset (and force the use of u8"String" for utf-8 strings). This would be the best solution in my opinion, but it doesn't appear to be possible to pass anything else except UTF-8 to -fexec-charset right now, at least in Clang.
    • Using external linter tools or options. I don't know if such a tool exists.
  • Use UTF-8 parsing for strings for the default constructors. This would be somewhat slower than latin1 parsing (though accelerated by Improve parse_utf8 performance #99826). The exact difference would have to be measured and tested for significance.
  • Use ASCII-only parsing for strings for the default constructors. Log errors if values > 127 are encountered. This should be negligibly slower than parse_latin1, and somewhat faster than parse_utf8.

Ideally, the default, implicit String(char*) constructor is removed, to avoid this problem in the future. Instead, every use of it should be replaced with an explicit call to String::utf8 or similar (maybe with the exception of construction from string literals), so we can be sure intent was put behind choosing the encoding. This is a bigger task though and not realistic for the near future.

Solution 1 does not address possible encoding mismatches of non-static string constructor calls.

In either the second or third solution, users that actually want to parse latin1 have to be switched to use that function explicitly.

Footnotes

  1. https://gcc.gnu.org/onlinedocs/gcc/Preprocessor-Options.html#index-fexec-charset

  2. https://learn.microsoft.com/en-us/cpp/build/reference/execution-charset-set-execution-character-set?view=msvc-170

@Ivorforce Ivorforce changed the title UTF-8 Strings are parsed as latin1. UTF-8 Strings are incorrectly parsed as latin1. Dec 20, 2024
@AThousandShips
Copy link
Member

The choice to make String(char *) only accept Latin1 is by design as a performance benefit, but needs to be communicated better to avoid issues

@Ivorforce
Copy link
Contributor Author

Ivorforce commented Dec 20, 2024

The choice to make String(char *) only accept Latin1 is by design as a performance benefit, but needs to be communicated better to avoid issues

I guessed as much (though it doesn't appear to be documented).

The actual performance gain for this decision should be tested. I don't think it has a big impact, and correctness is always more important than performance.

@AThousandShips
Copy link
Member

I agree, there was talk of adding a check for correctness as well, but almost all string literals are latin1 so that's important to keep in mind as well, or rather ASCII

@AThousandShips
Copy link
Member

AThousandShips commented Dec 20, 2024

I think making people extra aware of that raw strings are parsed as latin1 is important, and do a survey of how many actual string sequences in the codebase are using other encodings, either using String::utf8 or U"...", because I feel that changing the standard constructor (which would also break compatibility at least partially as this is also exposed to GDExtension) would risk worsening performance

Having a slight performance drop to parse utf8 (I'd say it's almost guaranteed that it would hurt performance at least somewhat, though might be barely noticeable) would be bad IMO if it would be essentially unused, which I suspect is the case in the existing codebase. Most cases, like names, types, etc. are strictly ASCII, and outside of certain string literals like property hints (using special characters like the degree symbol) we shouldn't be using non-ASCII literals generally I'd say, not just a matter of that we aren't

Tl;dr; If something like 90% or more of the literals in the codebase are ASCII any drop in performance for swapping to utf8 would be a negative, as that feature would be nigh unused

@Ivorforce
Copy link
Contributor Author

Ivorforce commented Dec 20, 2024

Tl;dr; If something like 90% or more of the literals in the codebase are ASCII any drop in performance for swapping to utf8 would be a negative, as that feature would be nigh unused

I agree, that's why I think just statically checking for non-ASCII in string literals would be best (and disallowing the default string constructor for anything else, forcing them to call String::utf8() or something else explicitly). That would give us the best of both worlds.

@AThousandShips
Copy link
Member

Agreed, handling things with literals specially would be very useful, most if not all dynamic input is processed as utf8 already, with a few exceptions that should probably use the explicit case of ascii instead to make things explicit

@kiroxas
Copy link
Contributor

kiroxas commented Dec 20, 2024

Does anyone has some data on this ? Like how many static strings are parsed in the codebase, and what's the average size for them. Because if we're talking 20 bytes string, the difference between these should be very small. I'm not even sure parse utf8 parsing only ascii would ba that slow. But we can probably measure that. I have not access to my computer for some time, but it would be nice to gather those metrics, to avoid guessing.

@bruvzg
Copy link
Member

bruvzg commented Dec 20, 2024

Not sure if there's any data, but you can estimate amount of string literals by checking translation files (editor/translations), it won't include console only error messages and some other stuff, but should give a lower bound. And since engine UI translation is done via po files, all string literals in the code in general are in English (ASCII only), there are only few special cases using non ASCII characters (µs, , , ×, ±, box drawing characters, font preview sample strings, few special strings for pseudolocalizations).

@kiroxas
Copy link
Contributor

kiroxas commented Dec 29, 2024

I think we can sefely ignore all files from po files, as they uses parse_utf8 already anyway, this would not change anything for them.

I did some quick benchmarking for the cost of parse_utf8 vs parse_latin.
I tested 2 kind of strings, long ones around (500 bytes) to amortize the startup cost (aka memory allocation), and short ones (around 20 bytes), probably more relevant to static string in the codebase. (parse_utf8_PR is the implementation from here)

Function Cycles (Long) Cycles (Short)
latin1 5 39
parse_utf8 11 60
parse_utf8_PR 8.5 55

So in the case of small strings, we're talking about 200-300 cycles difference per string. This is not nothing, but if you happen to do it 50 times in a frame, this is a 10k cycles budget.
The right question now is, is this cost too much to remove some unlikely errors ? (might be, might not be).
If those static strings are not transformed into Strings every frame, but only once on startup, then I think we should parse_utf8 by default, as the difference would not be noticeable, even for thousands of those.

If anyone wants to test this, here is the code, to drop in Main::iteration somewhere ( maybe need some tuning for the rdtsc intrinsics depending on your compiler)

const char mySuperString[] = "Does anyone has some ?";

static uint64_t latin = 0;
static uint64_t utf = 0;
static uint64_t iterations = 0;

++iterations;
uint64_t start = __rdtsc();
String normalString(mySuperString);
uint64_t end = __rdtsc();
String utfString;
utfString.parse_utf8(mySuperString);
uint64_t utfEnd = __rdtsc();

latin += end - start;
utf += utfEnd - end;

real_t latins = (real_t)(latin / iterations);
real_t utfs   = (real_t)(utf / iterations);
uint32_t size = sizeof(mySuperString) / sizeof(mySuperString[0]);

print_line(vformat("(%d:%d)  Normal : %f : %f, Utf: %f : %f", iterations, size, latins, latins / size, utfs,  utfs / size));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants