Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strz type support for UTF-16 and UTF-32 #187

Closed
dreckard opened this issue Jun 20, 2017 · 34 comments
Closed

Strz type support for UTF-16 and UTF-32 #187

dreckard opened this issue Jun 20, 2017 · 34 comments
Assignees
Milestone

Comments

@dreckard
Copy link

This issue is very similar to #13 and there is a lot of relevant discussion there.

Observed

  • If strz is used with UTF-16 or UTF-32 then a single null byte is enough to result in termination rather than 2 (U16) or 4 (U32) bytes.

Expected

  • If the encoding is given as UTF-16 or UTF-32 then the strz parse only terminates when the corresponding number of consecutive null bytes appear.

To describe the use case, I am writing a descriptor for the output of an internal Windows tool which reserves a fixed length for the string then null terminates it. While this practice is a bit inefficient and I can't point to any widely used formats that do the same I suspect that it is not uncommon for Windows programs since it is analogous to doing so with UTF-8 in a cross-platform or Unix scenario.

On Windows, an example C struct might look like this:

struct TextBlob
{
	UCHAR DataPrefix[n];
	WCHAR WideString[64];
	UCHAR DataSuffix[n];
};

For an example data blob and ksy file that that minimally reproduce the issue (at least with the web IDE) please see the following zip:
utf16_test.zip

@GreyCat
Copy link
Member

GreyCat commented Jun 20, 2017

Ok, to collect all the arguments that were mentioned in previous discussions on this topic in one place:

  • Implementing "read until a combination of multiple bytes is encountered" is non-trivial. For example, none of our supported languages have such a method in stdlib (but note almost everyone includes something like "read until a byte is encountered").
  • Actually, "read until a sequence of 00 00 is encountered" is plain wrong for this case. Namely, "01 00 00 02 00 00" should be read as two UTF-16LE chars "\u0001\u0200", not as just "\u0001". I.e. "00 00" much appear as whole character, 2-byte aligned.
  • It's totally possible (although not really usual) to get this kind of problem outside of C + Windows world. Although majority of applications don't seem to use strings of zero-terminated wide chars family of functions, stuff like wcslen, wcscat, etc (akin to much more popular strlen, strcat, etc) technically exists in C since C99 standard.
  • No popular (and publically available) format have been found using this so far.
  • Looks like this issue is only relevant to UTF-16 representation of strings, and probably 99.9% of the time it is used on actually a fixed string to trim remaining garbage, i.e. it should be solved around bytes_terminate function, but probably applied in somewhat different manner.

Current archetypical application generates something like that:

foo = bytes_to_str(
    bytes_terminate(
        bytes_strip_right(
            io.read_bytes(20),
            43 // padding byte
        ),
        64, // terminator byte
        false
    ),
    "UTF-8"
);

Obviously, applying a function like "bytes_terminate" which operates on strings (not just byte arrays) requires us to convert byte array to string first with "bytes_to_str", i.e. executing something like that:

foo = str_terminate(
    bytes_to_str(
        io.read_bytes(20),
    ),
    0x0000, // terminator char
);

The big catch is that actually trailing garbage might have something that will be invalid in chosen encoding. Unfortunately, contrary to popular opinion, this is true even for UTF16. For example, if we're working with C++ or PHP (and iconv-based implementation, which converts everything that should be treated as string internally to UTF-8), this string will trigger an error on conversion:

61 00|00 00|00 D8 61 00

Although we would expect result a (as 0x61 is ASCII code for "a", and then string terminates with 00 00), it will trigger something like InvalidByteSequenceError. This is because 00 D8 61 00 is not a valid UTF16 character: 00 D8 is a lead of surrogate pair, and it must be followed by tail of surrogate pair, and 61 00 is illegal here.

@sirg3
Copy link

sirg3 commented Jun 20, 2017

No popular (and publically available) format have been found using this so far.

Windows version info resources, found in executables and .res files, use them.

@GreyCat
Copy link
Member

GreyCat commented Jun 21, 2017

VS_VERSIONINFO per se has no variable-length strings. However, it includes StringFileInfo → StringTable → String, which actually includes them.

What's even more peculiar, actually, it seems that there are literally double-null strings to store two-level lists: https://devblogs.microsoft.com/oldnewthing/20091008-00/?p=16443 — I believe they effectively become quad-null terminated strings when laid out in UTF-16. Can anyone confirm/deny that?

@dreckard
Copy link
Author

dreckard commented Jun 22, 2017

Good clarification on the byte alignment requirement. I believe that is correct regarding the quad-null termination for WCHAR string lists. The only other documentation I have seen is in some of the APIs that return data in this style such as GetLogicalDriveStrings, but I don't see any other way it could be interpreted.

The Windows KUSER_SHARED_DATA struct's NtSystemRoot is another example of a null terminated UTF-16 (single) string. The struct is fairly commonly parsed in memory dump analysis at least.

typedef struct _KUSER_SHARED_DATA
{
     ULONG TickCountLowDeprecated;
     ULONG TickCountMultiplier;
     KSYSTEM_TIME InterruptTime;
     KSYSTEM_TIME SystemTime;
     KSYSTEM_TIME TimeZoneBias;
     WORD ImageNumberLow;
     WORD ImageNumberHigh;
     WCHAR NtSystemRoot[260];
     ULONG MaxStackTraceDepth;
     ...
}

http://www.geoffchappell.com/studies/windows/km/ntoskrnl/structs/kuser_shared_data.htm

@GreyCat
Copy link
Member

GreyCat commented Jun 22, 2017

I've implemented simple Windows resource parser as an excercise: https://github.com/kaitai-io/windows_resource_file.ksy/blob/master/windows_resource_file.ksy — now it uses a hack to parse strings (especially because there's an extra twist there — same byte space is used to designate numeric and string IDs).

While doing so, I've noticed that there are at least 2 very distinct cases we're talking here:

  1. Pure "read until double 00 terminator, convert to string" — encountered at least in NAME and TYPE fields of
  2. "Read fixed number of bytes, then trim it using double 00 terminator logic, then convert to string"

That means my original proposal is wrong. We indeed need both (1) and (2) implemented, not only (2).

@arekbulski
Copy link
Member

arekbulski commented Feb 7, 2018

Construct has both (fixed-length string with filler, and c-string). Admittedly it took me 2 years to get it implemented. I have never seen a protocol that actually used the first either, and I cant even imagine why would anyone design such a protocol in the first place, but it was requested on Construct forum more than once, so I guess someone out there actually needs it, so here we are.

@GreyCat If you give me a go, I will add 2 methods to Python runtime to effectuate this, but you would need to update the compiler (translator). I could update C# then too. Restriction is, these 2 methods need to know what encoding it is, or rather what is the unit size (2 for UTF16, 4 for UTF32, 1 for UTF8).

@GreyCat
Copy link
Member

GreyCat commented Feb 8, 2018

Let's start with inventing some ksy syntax that covers all the cases discussed in this ticket.

@arekbulski
Copy link
Member

arekbulski commented Feb 11, 2018

Fixed string: "type: str, size: N, terminator: 0, encoding: utf16, [unitsize: T=2]" (EDIT, added terminator)

Reads N bytes, then successively strips last T bytes if they are null, down to empty string. If encoding is recognizable like UTF*, unitsize can be inferred. N must be multiple of T.

CString: "type: str, terminator: 0, encoding: utf16, [unitsize: T=2]"
CString: "type: strz, encoding: utf16, [unitsize: T=2]"

Reads T bytes at a time, until that chunk is all null bytes. If first chunk is nulls, its an empty string. If encoding is recognizable like UTF*, unitsize can be inferred. Terminator other than 0 should be compile error, because at least UTF encodings support only one way of terminating it.

By recognizable encodings I mean those:

construct.possiblestringencodings = {'U16': 2, 'utf_8': 1, 'utf32': 4, 'utf_32_le': 4, 'utf8': 1, 'utf_32_be': 4, 'utf_32': 4, 'utf_16_be': 2, 'U32': 4, 'utf16': 2, 'ascii': 1, 'utf_16': 2, 'utf_16_le': 2, 'U8': 1}¶

@arekbulski arekbulski self-assigned this Feb 11, 2018
@arekbulski
Copy link
Member

arekbulski commented Mar 4, 2018

There is also a problem when both size and terminator are used.

meta:
  id: test1
seq:
  - id: value
    type: str
    terminator: 0
    size: 10
    encoding: utf-8

Compiles into following. Problem is, bytes_terminate only supports single characters bytes.

(KaitaiStream.bytes_terminate(self._io.read_bytes(10), 0, False)).decode(u"utf-8")

Extract from the runtime:

    def bytes_terminate(data, term, include_term):
        new_len = 0
        max_len = len(data)
        while new_len < max_len and data[new_len] != term:  #<--- indexing not slicing
            new_len += 1   #<--- non variable
        if include_term and new_len < max_len:
            new_len += 1
        return data[:new_len]

@GreyCat
Copy link
Member

GreyCat commented Mar 4, 2018

bytes_terminate, as its name suggest does not support any characters at all, only bytes. What problem do you see here?

@arekbulski
Copy link
Member

arekbulski commented Mar 4, 2018

To support UTF16/32, bytes_terminate would need to slice data[newlen:newlen+unit] and then increment like newlen += unit. In other words, current implementation would not support UTF16/32. Thats what this issue is about. And by characters I meant single bytes, sorry.

@GreyCat
Copy link
Member

GreyCat commented Mar 4, 2018

It doesn't, that's true. My main concern here is that technically we should not deal with bytes at all: if we're dealing with encodings like UTF16, for example, in C++, that would call for char16_t type instead of uint8_t and relevant reinterpret cast.

If we'll stick with bytes-centric implementation, however, from runtime's point of view, we can probably cover all possible cases by specifying something like byte_terminate_multi, something like that:

byte[] bytesTerminateMulti(byte[] bytes, byte[] term, int unitSize, boolean includeTerm)

and the same thing about bytes_strip_right_multi:

public static byte[] bytesStripRight(byte[] bytes, int unitSize, byte[] padBytes) {

@arekbulski
Copy link
Member

Good idea, it would be better to have a separate (multi) method instead of generalizing the existing (single) method, because multi will have less performance than single.

Would unitsize parameter be actually needed? I think its just len(term) and len(padbytes).

@GreyCat
Copy link
Member

GreyCat commented Mar 6, 2018

Specifying unitSize is need because we need so support both aligned 00 00 and unaligned variants, i.e. something like: 30 00 40 00 50 50 50 00 00 12 34 56 should 30 00 40 05 50 50 50 (or O\0@\0P in ASCII) after termination, if unitSize is 1 and term is [0, 0].

Also, I think that this place is more than anything warrants plenty of language-specific APIs. For example, in C/C++, you don't want to pass an array around, you'd want one uint16_t integer and then one just reinterpret_casts parts of the array and compares them to that integer. This obviously doesn't make much sense for higher-level languages, in which "parsing 2 bytes as integer" is an expensive operation.

@arekbulski
Copy link
Member

I thought it should be only unit-aligned version. The usecase for aligned would be obviously strings, but what would be the usecase for unaligned?

@KOLANICH
Copy link

KOLANICH commented Mar 7, 2018

BTW, do we really have to carve strzs ourselves for C++? I mean that string functions, like basic_string ctor from a pointer, expect certain termination, so can't we just pass a pointer, and then shift to the size() of the string parsed by c++?

@arekbulski
Copy link
Member

It probably only works with single-null terminator, right? This issue is about supporting UTF16 and 32.

@KOLANICH
Copy link

KOLANICH commented Mar 7, 2018

wchar_t is usually (though it is not standardised) a utf-16 encoded code point. Wstring is a string of wchar_ts. Arrays of wchar_t are assummed to be terminated with 2 null bytes.

@arekbulski
Copy link
Member

So someone uses it with UTF32 encoding, then what?

@GreyCat
Copy link
Member

GreyCat commented Mar 7, 2018

wchar_t is usually (though it is not standardised) a utf-16 encoded code point.

I'd say that from practical point of view, the only system that uses UTF16 wide chars is Windows. Virtually everyone else use UTF32 there.

@arekbulski
Copy link
Member

Python unicode strings use 1/2/4 bytes per character, depending on actual text.

@GreyCat
Copy link
Member

GreyCat commented Mar 7, 2018

The question @KOLANICH raised was about wchar_t in C++. Python, Java, C#, Go and everything else is non-relevant to that.

Let's get back to original topic.

The usecase for aligned would be obviously strings, but what would be the usecase for unaligned?

I believe someone further above this issue provide some examples why that would be useful. I recall some strings in UTF16 wanted 4-byte [0, 0, 0, 0] terminator and stuff like that.

@generalmimon
Copy link
Member

generalmimon commented Jul 20, 2024

My gut feeling is that eventually we'd want to converge on implementation akin to #538. Everything we do regarding string terminations — e.g. strz, terminator, etc, is just a shortcut to one of the possible options in that engine.

Logically, from standpoint of least possible friction, having some syntactic sugar which will make strz behave as you've explained, absolutely seem to make sense to me.

@GreyCat Makes sense, thanks for commenting. I'll probably keep it simple then and only implement type: strz + encoding: UTF-{16,32}* for now, with no other options. As you mentioned, this is the what most people really need.

Anything more complex will be the subject of #538.


To be clear, does that mean that we will not pursue #158, and will instead move that to #538 as well? If so, perhaps we should close #158 in favor of #538 for clarity.

On second thought, I'm not quite sure what all #158 suggests, it sounds like a bunch of different vaguely specified things:

Wildcard bytes, regular expressions, number ranges and other helpers could also be of assistance in defining terminators in other file formats.

So maybe it's not completely covered by #538 after all.

@GreyCat
Copy link
Member

GreyCat commented Jul 20, 2024

To be clear, does that mean that we will not pursue #158, and will instead move that to #538 as well? If so, perhaps we should close #158 in favor of #538 for clarity.

Generally, I would agree. I believe that design with "scanners" concept (#538) in general is solid, it's just a question of careful implementation.

Wildcard bytes, regular expressions, number ranges and other helpers could also be of assistance in defining terminators in other file formats.

All these things absolutely can be covered with custom scanning procedures. We can do a library of built-in scanning procedures in different languages, that's for sure, but ultimately it will be to up user to pick and use one that fits their purpose.

generalmimon added a commit to kaitai-io/kaitai_struct_python_runtime that referenced this issue Jul 23, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_java_runtime that referenced this issue Jul 24, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_ruby_runtime that referenced this issue Jul 30, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_php_runtime that referenced this issue Jul 31, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_go_runtime that referenced this issue Aug 2, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_javascript_runtime that referenced this issue Aug 6, 2024
See kaitai-io/kaitai_struct#187

bytesTerminateMulti() follows the existing implementation in Java:
https://github.com/kaitai-io/kaitai_struct_java_runtime/blob/20af3acef2959778853bb659756cf18092ae2420/src/main/java/io/kaitai/struct/KaitaiStream.java#L353-L373

readBytesTermMulti() has an original implementation written specifically
for the JavaScript runtime, but the semantics should be exactly the same
as in other languages.
generalmimon added a commit to kaitai-io/kaitai_struct_perl_runtime that referenced this issue Aug 6, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_compiler that referenced this issue Aug 7, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_compiler that referenced this issue Aug 7, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_compiler that referenced this issue Aug 7, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_compiler that referenced this issue Aug 7, 2024
See kaitai-io/kaitai_struct#187

The new methods used in this commit have been implemented in
kaitai-io/kaitai_struct_lua_runtime@aa9fa84

Unfortunately, it cannot be said that this adds UTF-16 and UTF-32
support to `type: strz`, because the Lua runtime library doesn't support
these encodings yet. It only supports ASCII and UTF-8, see
https://github.com/kaitai-io/kaitai_struct_lua_runtime/blob/a7b8a9144f0978cd75d3b176c61a773be56370c0/string_decode.lua#L35-L43

So this is more of a promise of future support for
kaitai-io/kaitai_struct#187 once UTF-16 and
UTF-32 encodings are implemented.
generalmimon added a commit to kaitai-io/kaitai_struct_compiler that referenced this issue Aug 7, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_compiler that referenced this issue Aug 7, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_compiler that referenced this issue Aug 7, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_compiler that referenced this issue Aug 7, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_compiler that referenced this issue Aug 7, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_compiler that referenced this issue Aug 7, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_compiler that referenced this issue Aug 7, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_compiler that referenced this issue Aug 7, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_tests that referenced this issue Aug 7, 2024
@generalmimon
Copy link
Member

This is now implemented in all 11 target languages that Kaitai Struct supports (C++/STL, C#, Go, Java, JavaScript, Lua, Nim, Perl, PHP, Python, Ruby).

Unfortunately, it cannot be used in Lua yet, because our Lua runtime library doesn't support UTF-16* or UTF-32* encodings at all, but it has been implemented there as well, so it will be available once the UTF-{16,32}* encodings are implemented (see kaitai-io/kaitai_struct_compiler@b0cbf6d).

The corresponding tests at https://ci.kaitai.io/ are called StrPadTermUtf16 (this is for the size + type: strz case) and TermStrzUtf16V{1,2,3,4} (for parsing type: strz directly from the stream, without any known size). They don't pass in Lua as expected (Encoding UTF-16LE not supported).

The fact that they don't pass in Nim at the moment is an infrastructure issue that will disappear after the Docker image for Nim gets updated (by the CI pipeline at https://github.com/kaitai-io/kaitai_struct_docker_images). Unlike all other languages, the Nim Docker image bundles the runtime library inside (src/nim/1.6.0-linux-x86_64/Dockerfile:7) instead of using the latest runtime library mapped at /runtime in the Docker container. As a result, it uses the old version of the Nim runtime library that doesn't implement the necessary methods.

I also ran into this problem when testing locally and fixed it by adding ( cd "$NIM_RUNTIME_DIR" && nimble develop -y ) at ci-nim:7. I don't think ci-nim is the right place for resolving this properly, though. Instead, this should ideally be performed when building the Docker image.

When tested locally, the new tests passed in Nim.

@generalmimon generalmimon added this to the v0.11 milestone Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants