Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polish diacritic characters from map files are not displayed correctly in game #1741

Open
Hirotaro opened this issue Feb 21, 2025 · 10 comments

Comments

@Hirotaro
Copy link

The issue connected is to map files.
Map files contain map names.

For Western EU languages special characters are displayed correctly
In case of Polish special characters like ĄŚĆŻŹĘŻŃŁÓ, those are rendered as ? when read from map file.

It may be that the CODEPAGE used for reading names from map files is set to CP850, while extended Latin characters that include Polish and Czech diacritic characters require CP852.

Note: All texts from LUA or MO/PO are displayed correctly.

Example:
Image

Fuchsia - issue while reading from Map Files.
Green - correctly rendered from MO/PO text base.

@Flamefire
Copy link
Member

Flamefire commented Feb 21, 2025

This is actually a limitation of the map format. The names are decoded to ANSI and stored in the map in a OEM codepage. And not all characters are available in the ANSI codepage

Where exactly are those maps from? Can you attach one of them?

It might be impossible to fix this issue at all if e.g. BlueByte delivered different maps to different languages using the local OEM codepage. There would be no way for the decoder to know which one to use.

@Hirotaro
Copy link
Author

Hirotaro commented Feb 21, 2025

In original Polish release those were displayed correctly - during that times I believe files were coded in ASCII (DOS).
It may require CP852 instead of CP850 to be used ref: https://www.ascii-codes.com/cp852.html

If the read function uses ANSI though, it probably should use 1250 (Eastern Europe Latin-2), not 1252 (Western Latin-1 only)

Screenshots and example maps attached.

Original S2:
Image

S2 RttR:
Image

Map: In mountains "W górach"

Map: The Turtle "Żółw"

@Spikeone
Copy link
Member

or we could use this one: #1638

although I'm interested in what would happen if you created a new map using the polish editor using those characters

@Flamefire
Copy link
Member

I guess we need to check where the ANSI chars are converted to UTF-8. But when using the 1250 codepage we might run into the same issue just with another language. See the comparison: With 1250 we'd loose e.g. ù ò and ê which might be relevant for French users

Is there any hint in the map about the encoding/codepage? @Spikeone do you happen to know if the maps are the same in different languages? If only the metadata is different between e.g. the German and Polish map release we can check if there are any differences that allow us to infer the encoding.

@Spikeone
Copy link
Member

@Flamefire sadly so far I wasn't aware that there are french or polish (original) versions out there at all - altough I may remember that someone once told me about the polish version.

@Hirotaro do you happen to know the source for the version?

@Hirotaro
Copy link
Author

Hirotaro commented Feb 21, 2025

@Spikeone Yes, Settlers II PL version and other Ubisoft games those times were officially prepared and released by CD PROJEKT Sp. z o.o. (currently known as CD PROJECT RED S.A.). In 1990-2015 CD PROJEKT (Publishing aka 'Blue' was responsible for hundreds of official Polish, Czech and Hungarian releases of games).

Right now the GOG.COM version of the game (GOG is part of CD PROJEKT group) includes PL files in their release of the digital Settlers 2 Gold Edition.

I was working in CD PROJEKT for 10 years, taking care of localization for most of the time.
And I do own one copy of Settlers 1-4 Saga release I was working on personally :)

Image
Image
Image

@Flamefire
Copy link
Member

Thanks for the information! I checked the map "turtle" which in German is "Schildkröte" and the Polish one and the only difference is indeed the Name:

  • German 53 63 68 69 6C 64 6B 72 94 74 65 ö=0x94
  • Polish AF F3 B3 77 00 00 00 00 00 00 00 Żółw

According to https://settlers2.net/archives/language-packs IBM CP437 or OEM CP850 or CP852 is used, with the another page at the same website states CP437

Other German characters: ü=0x81, ß=0xe1, Ö=0x99

That matches all 3 codepages but not CP 1250.

So it looks like the original game did use either CP437, CP850 or CP852 but the Polish one used CP1250

The map format is expected to be in OEM format and we convert it to Windows-1250 during reading and back during writing.

Unfortunately our code for the conversion isn't well enough documented to know which of the 3 OEM codepages is actually used. And in fact I wasn't able to find any codepage for which that mapping fits completely. There are also some unmapped characters such as Ź (0x80) which would have a mapping in CP852 but not CP850 or CP437

With all that being said: I don't see how we can reasonable implement support for the maps you posted as those seem to use CP1250 which would break the currently supported maps.
If you have any ideas I'd be glad to to hear them.

@Hirotaro
Copy link
Author

Hirotaro commented Feb 22, 2025

In such case it seems to be unsolvable on the Map File side, as we would have to alter the format to add language data or codepage data to the header.
Even if we would like to strip names from diacritic characters, converting Diacritic to Normal i.e.: Ł => L or Ż => Z would require for code to understand in which codepage specific char was encoded with...

I do not know how editor works though - it is probably also too expensive as a new feature, but...

You could think on adding optional LUA per each map, where modders/creators could add all languages they want for title and description (like campaign files), that later could be displayed in game directly from corresponding LUA instead of WLD if LUA exists (otherwise WLD name would be displayed).

But it may be to complex, and is not that high prio, as this is not critical error or crash, more quality/polishing.

P.S. In such case I could deliver translations for all RttR maps and for Roman Campaign maps for English, Polish and Czech. As my German is not the best nowadays, I could try to gather from original and prepare v0.1 translations for review.

@Flamefire
Copy link
Member

I do not know how editor works though - it is probably also too expensive as a new feature, but...

Our editor doesn't seem to handle non-ASCII at all. That should be fixed
We should also make our mapping consistent. It might be intended to be a portable version of the Windows function OemToChar but that seems to just specify CP_OEM which might be any:

CP_OEMCP, comes from MS-DOS, is used for the Windows console, contains glyphs to create text interfaces (draw boxes) and has a number between 437 and 874. Example of a French setup: ANSI is cp1252 and OEM is cp850.

I assume our function was intended for CP437. With some checking I found that CP850 matches better as CP852. And CP850 has Ì which CP437 doesn't and we match ANSI 0xCC to OEM 0x49, i.e. I instead of OEM CP 850 0xDE

I don't see a disadvantage for using CP 850 over CP 437 as it supports letters where the latter has symbols. We can still include a way to translate map names in which case people can use the English maps which IIRC are freely available.
To make it easier for standard maps I guess we can "hack" the S2 map names into the source code such that they can be translated via launchpad.

@Hirotaro
Copy link
Author

Yes, hack/hardcoded solution for original maps could be a solution. Maybe not very subtle, though for sure effective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants