Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maniacs - String Variables - String encoding interpretation #3298

Closed
elsemieni opened this issue Nov 22, 2024 · 2 comments
Closed

Maniacs - String Variables - String encoding interpretation #3298

elsemieni opened this issue Nov 22, 2024 · 2 comments
Milestone

Comments

@elsemieni
Copy link
Member

Tested with Player Master jenkins-1807.
I know Maniacs support is still a very extensive and WIP thing, but I guess it's worth (instead nothing) reporting stuff.

While playing with String Variables I noticed that Maniacs RPG_RT and Player handles String Variables differently when those contains not-so-typical characters. Basically, if a string contains some sort of particular characters Player reports more characters than RPG_RT.

Putting it in an example, if I set
T[1]=ß
Then I obtain the length of it, RPG_RT will report 1, but Player will report 2 instead, as seen at this image:
image

I believe it could be related about unicode interpretation/conversion of characters.

@Ghabry
Copy link
Member

Ghabry commented Nov 22, 2024

Yeah this is our implementation leaking through. Maniac uses the local encoding (ß is 1 byte). We use UTF-8 (ß is 2 bytes).

There is also no easy way to fix this. Keeping the strings in the legacy encoding will break stuff as everything else in our Player is UTF-8 (e.g. assigning an actor name from a string).

For the translation feature it is also expected that everything is UTF-8 (so you can read a redirected text file)

@Ghabry
Copy link
Member

Ghabry commented Nov 24, 2024

Actually after some further testing just returning the bytes is incorrect. A better approximation is returning the codepoints.

As an example the string "XXひらがなXX". When converted to shift-jis (japanese encoding) this results in a size of 12 bytes (X = 1, Hiragana = 2).

When using 1252 as encoding (Western European) where every character is 1 byte Maniacs GetLen returns 12. (Note that you cannot run the game this way as the characters cannot be displayed)

When running with 932 (Shift-Jis) it gives 8.

So RPG_RT seems to use a wide character set when running with SJIS (= 2 bytes count as 1 character).

So just using Unicode Codepoints (A single encoded character) seems to give a pretty good approximation:

  • ß is one codepoint (WORKS)
  • XXひらがなXX has 8 codepoints (WORKS)

For Russian this should also work.


Conclusion:

Fortunately there are no games with any complex scripts because RPG_RT cannot render this. So this codepoint trick should work in 99.9% of all cases :).

So just sending everything that is string index/length related through our utf-8 codepoint iterator code will fix most of it.

Using UTF-8 codepoints also prevents any data loss from reencoding the string (Great).

@Ghabry Ghabry added this to the 0.8.1 milestone Nov 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants