UTF-8 Strings are incorrectly parsed as latin1. #100641

Ivorforce · 2024-12-20T09:56:04Z

Tested versions

All versions are affected. This is old code.

Issue description

The current implementation of Godot has a bug where UTF-8 encoded strings are parsed as latin1.

I'd like to hear what you all think is the best solution to this problem.

The String(const char *) constructor currently parses char strings using latin1 encoding:

godot/core/string/ustring.h

Lines 615 to 617 in 89001f9

    
           String(const char *p_cstr) { 
        
           	parse_latin1(p_cstr); 
        
           }

(parse_latin1 was recently renamed from copy_from because that's what the function does).

This constructor is used for many strings, including static C strings. The encoding of static C strings is controlled by -fexec-charset on GCC / Clang¹ and /execution-charset on MSVC². In the Godot codebase, it defaults to UTF-8 on GCC / Clang, and is explicitly set to UTF-8 on Windows. latin1 and utf-8 are compatible for values below 128, and incompatible otherwise. Therefore, there is a mismatch between encodings that can be encountered for non-ascii strings. It is likely that there are mismatches in other, non-static string use-cases, because latin1 encoded strings are a somewhat rare encounter (though ascii-only strings are pretty likely).

The mismatch has apparently led to problems a few times in the past (though I don't know how often). Most times, strings use ascii-only characters, in which range latin1 and utf-8 overlap.

Possible Solutions

Here are some ideas that I came up with:

Enforce ASCII-only for static C strings (and force the use of u8"String" for utf-8 strings).
- Using -fexec-charset (and force the use of u8"String" for utf-8 strings). This would be the best solution in my opinion, but it doesn't appear to be possible to pass anything else except UTF-8 to -fexec-charset right now, at least in Clang.
- Using external linter tools or options. I don't know if such a tool exists.
Use UTF-8 parsing for strings for the default constructors. This would be somewhat slower than latin1 parsing (though accelerated by Improve parse_utf8 performance #99826). The exact difference would have to be measured and tested for significance.
Use ASCII-only parsing for strings for the default constructors. Log errors if values > 127 are encountered. This should be negligibly slower than parse_latin1, and somewhat faster than parse_utf8.

Ideally, the default, implicit String(char*) constructor is removed, to avoid this problem in the future. Instead, every use of it should be replaced with an explicit call to String::utf8 or similar (maybe with the exception of construction from string literals), so we can be sure intent was put behind choosing the encoding. This is a bigger task though and not realistic for the near future.

Solution 1 does not address possible encoding mismatches of non-static string constructor calls.

In either the second or third solution, users that actually want to parse latin1 have to be switched to use that function explicitly.

The text was updated successfully, but these errors were encountered:

AThousandShips · 2024-12-20T15:08:42Z

The choice to make String(char *) only accept Latin1 is by design as a performance benefit, but needs to be communicated better to avoid issues

Ivorforce · 2024-12-20T15:36:54Z

The choice to make String(char *) only accept Latin1 is by design as a performance benefit, but needs to be communicated better to avoid issues

I guessed as much (though it doesn't appear to be documented).

The actual performance gain for this decision should be tested. I don't think it has a big impact, and correctness is always more important than performance.

AThousandShips · 2024-12-20T15:53:24Z

I agree, there was talk of adding a check for correctness as well, but almost all string literals are latin1 so that's important to keep in mind as well, or rather ASCII

AThousandShips · 2024-12-20T17:22:10Z

I think making people extra aware of that raw strings are parsed as latin1 is important, and do a survey of how many actual string sequences in the codebase are using other encodings, either using String::utf8 or U"...", because I feel that changing the standard constructor (which would also break compatibility at least partially as this is also exposed to GDExtension) would risk worsening performance

Having a slight performance drop to parse utf8 (I'd say it's almost guaranteed that it would hurt performance at least somewhat, though might be barely noticeable) would be bad IMO if it would be essentially unused, which I suspect is the case in the existing codebase. Most cases, like names, types, etc. are strictly ASCII, and outside of certain string literals like property hints (using special characters like the degree symbol) we shouldn't be using non-ASCII literals generally I'd say, not just a matter of that we aren't

Tl;dr; If something like 90% or more of the literals in the codebase are ASCII any drop in performance for swapping to utf8 would be a negative, as that feature would be nigh unused

Ivorforce · 2024-12-20T17:31:22Z

Tl;dr; If something like 90% or more of the literals in the codebase are ASCII any drop in performance for swapping to utf8 would be a negative, as that feature would be nigh unused

I agree, that's why I think just statically checking for non-ASCII in string literals would be best (and disallowing the default string constructor for anything else, forcing them to call String::utf8() or something else explicitly). That would give us the best of both worlds.

AThousandShips · 2024-12-20T17:41:23Z

Agreed, handling things with literals specially would be very useful, most if not all dynamic input is processed as utf8 already, with a few exceptions that should probably use the explicit case of ascii instead to make things explicit

kiroxas · 2024-12-20T17:42:50Z

Does anyone has some data on this ? Like how many static strings are parsed in the codebase, and what's the average size for them. Because if we're talking 20 bytes string, the difference between these should be very small. I'm not even sure parse utf8 parsing only ascii would ba that slow. But we can probably measure that. I have not access to my computer for some time, but it would be nice to gather those metrics, to avoid guessing.

bruvzg · 2024-12-20T18:13:23Z

Not sure if there's any data, but you can estimate amount of string literals by checking translation files (editor/translations), it won't include console only error messages and some other stuff, but should give a lower bound. And since engine UI translation is done via po files, all string literals in the code in general are in English (ASCII only), there are only few special cases using non ASCII characters (µs, •, ⚠, ×, ±, box drawing characters, font preview sample strings, few special strings for pseudolocalizations).

kiroxas · 2024-12-29T11:05:17Z

I think we can sefely ignore all files from po files, as they uses parse_utf8 already anyway, this would not change anything for them.

I did some quick benchmarking for the cost of parse_utf8 vs parse_latin.
I tested 2 kind of strings, long ones around (500 bytes) to amortize the startup cost (aka memory allocation), and short ones (around 20 bytes), probably more relevant to static string in the codebase. (parse_utf8_PR is the implementation from here)

Function	Cycles (Long)	Cycles (Short)
`latin1`	5	39
`parse_utf8`	11	60
`parse_utf8_PR`	8.5	55

So in the case of small strings, we're talking about 200-300 cycles difference per string. This is not nothing, but if you happen to do it 50 times in a frame, this is a 10k cycles budget.
The right question now is, is this cost too much to remove some unlikely errors ? (might be, might not be).
If those static strings are not transformed into Strings every frame, but only once on startup, then I think we should parse_utf8 by default, as the difference would not be noticeable, even for thousands of those.

If anyone wants to test this, here is the code, to drop in Main::iteration somewhere ( maybe need some tuning for the rdtsc intrinsics depending on your compiler)

const char mySuperString[] = "Does anyone has some ?";

static uint64_t latin = 0;
static uint64_t utf = 0;
static uint64_t iterations = 0;

++iterations;
uint64_t start = __rdtsc();
String normalString(mySuperString);
uint64_t end = __rdtsc();
String utfString;
utfString.parse_utf8(mySuperString);
uint64_t utfEnd = __rdtsc();

latin += end - start;
utf += utfEnd - end;

real_t latins = (real_t)(latin / iterations);
real_t utfs   = (real_t)(utf / iterations);
uint32_t size = sizeof(mySuperString) / sizeof(mySuperString[0]);

print_line(vformat("(%d:%d)  Normal : %f : %f, Utf: %f : %f", iterations, size, latins, latins / size, utfs,  utfs / size));

Ivorforce changed the title ~~UTF-8 Strings are parsed as latin1.~~ UTF-8 Strings are incorrectly parsed as latin1. Dec 20, 2024

Ivorforce mentioned this issue Dec 20, 2024

Optimize StringName by storing StrRange<char> instead of const char * for static C strings. #100358

Open

AThousandShips added enhancement discussion topic:core labels Dec 20, 2024

This was referenced Jan 8, 2025

Add a python test to sanity-check string literals for non-ascii characters. #101301

Open

Add String::ascii creator functions, to parse a char buffer as ASCII. #101304

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 Strings are incorrectly parsed as latin1. #100641

UTF-8 Strings are incorrectly parsed as latin1. #100641

Ivorforce commented Dec 20, 2024 •

edited

Loading

AThousandShips commented Dec 20, 2024

Ivorforce commented Dec 20, 2024 •

edited

Loading

AThousandShips commented Dec 20, 2024

AThousandShips commented Dec 20, 2024 •

edited

Loading

Ivorforce commented Dec 20, 2024 •

edited

Loading

AThousandShips commented Dec 20, 2024

kiroxas commented Dec 20, 2024 •

edited

Loading

bruvzg commented Dec 20, 2024

kiroxas commented Dec 29, 2024

UTF-8 Strings are incorrectly parsed as latin1. #100641

UTF-8 Strings are incorrectly parsed as latin1. #100641

Comments

Ivorforce commented Dec 20, 2024 • edited Loading

Tested versions

Issue description

Possible Solutions

Footnotes

AThousandShips commented Dec 20, 2024

Ivorforce commented Dec 20, 2024 • edited Loading

AThousandShips commented Dec 20, 2024

AThousandShips commented Dec 20, 2024 • edited Loading

Ivorforce commented Dec 20, 2024 • edited Loading

AThousandShips commented Dec 20, 2024

kiroxas commented Dec 20, 2024 • edited Loading

bruvzg commented Dec 20, 2024

kiroxas commented Dec 29, 2024

Ivorforce commented Dec 20, 2024 •

edited

Loading

Ivorforce commented Dec 20, 2024 •

edited

Loading

AThousandShips commented Dec 20, 2024 •

edited

Loading

Ivorforce commented Dec 20, 2024 •

edited

Loading

kiroxas commented Dec 20, 2024 •

edited

Loading