-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strz type support for UTF-16 and UTF-32 #187
Comments
Ok, to collect all the arguments that were mentioned in previous discussions on this topic in one place:
Current archetypical application generates something like that: foo = bytes_to_str(
bytes_terminate(
bytes_strip_right(
io.read_bytes(20),
43 // padding byte
),
64, // terminator byte
false
),
"UTF-8"
); Obviously, applying a function like "bytes_terminate" which operates on strings (not just byte arrays) requires us to convert byte array to string first with "bytes_to_str", i.e. executing something like that: foo = str_terminate(
bytes_to_str(
io.read_bytes(20),
),
0x0000, // terminator char
); The big catch is that actually trailing garbage might have something that will be invalid in chosen encoding. Unfortunately, contrary to popular opinion, this is true even for UTF16. For example, if we're working with C++ or PHP (and iconv-based implementation, which converts everything that should be treated as string internally to UTF-8), this string will trigger an error on conversion:
Although we would expect result |
Windows version info resources, found in executables and .res files, use them. |
VS_VERSIONINFO per se has no variable-length strings. However, it includes StringFileInfo → StringTable → String, which actually includes them. What's even more peculiar, actually, it seems that there are literally double-null strings to store two-level lists: https://devblogs.microsoft.com/oldnewthing/20091008-00/?p=16443 — I believe they effectively become quad-null terminated strings when laid out in UTF-16. Can anyone confirm/deny that? |
Good clarification on the byte alignment requirement. I believe that is correct regarding the quad-null termination for WCHAR string lists. The only other documentation I have seen is in some of the APIs that return data in this style such as GetLogicalDriveStrings, but I don't see any other way it could be interpreted. The Windows typedef struct _KUSER_SHARED_DATA
{
ULONG TickCountLowDeprecated;
ULONG TickCountMultiplier;
KSYSTEM_TIME InterruptTime;
KSYSTEM_TIME SystemTime;
KSYSTEM_TIME TimeZoneBias;
WORD ImageNumberLow;
WORD ImageNumberHigh;
WCHAR NtSystemRoot[260];
ULONG MaxStackTraceDepth;
...
} http://www.geoffchappell.com/studies/windows/km/ntoskrnl/structs/kuser_shared_data.htm |
I've implemented simple Windows resource parser as an excercise: https://github.com/kaitai-io/windows_resource_file.ksy/blob/master/windows_resource_file.ksy — now it uses a hack to parse strings (especially because there's an extra twist there — same byte space is used to designate numeric and string IDs). While doing so, I've noticed that there are at least 2 very distinct cases we're talking here:
That means my original proposal is wrong. We indeed need both (1) and (2) implemented, not only (2). |
Construct has both (fixed-length string with filler, and c-string). Admittedly it took me 2 years to get it implemented. I have never seen a protocol that actually used the first either, and I cant even imagine why would anyone design such a protocol in the first place, but it was requested on Construct forum more than once, so I guess someone out there actually needs it, so here we are. @GreyCat If you give me a go, I will add 2 methods to Python runtime to effectuate this, but you would need to update the compiler (translator). I could update C# then too. Restriction is, these 2 methods need to know what encoding it is, or rather what is the unit size (2 for UTF16, 4 for UTF32, 1 for UTF8). |
Let's start with inventing some ksy syntax that covers all the cases discussed in this ticket. |
Fixed string: "type: str, size: N, terminator: 0, encoding: utf16, [unitsize: T=2]" (EDIT, added terminator) Reads N bytes, then successively strips last T bytes if they are null, down to empty string. If encoding is recognizable like UTF*, unitsize can be inferred. N must be multiple of T. CString: "type: str, terminator: 0, encoding: utf16, [unitsize: T=2]" Reads T bytes at a time, until that chunk is all null bytes. If first chunk is nulls, its an empty string. If encoding is recognizable like UTF*, unitsize can be inferred. Terminator other than 0 should be compile error, because at least UTF encodings support only one way of terminating it. By recognizable encodings I mean those: construct.possiblestringencodings = {'U16': 2, 'utf_8': 1, 'utf32': 4, 'utf_32_le': 4, 'utf8': 1, 'utf_32_be': 4, 'utf_32': 4, 'utf_16_be': 2, 'U32': 4, 'utf16': 2, 'ascii': 1, 'utf_16': 2, 'utf_16_le': 2, 'U8': 1}¶ |
There is also a problem when both size and terminator are used. meta:
id: test1
seq:
- id: value
type: str
terminator: 0
size: 10
encoding: utf-8 Compiles into following. Problem is, bytes_terminate only supports single (KaitaiStream.bytes_terminate(self._io.read_bytes(10), 0, False)).decode(u"utf-8") Extract from the runtime: def bytes_terminate(data, term, include_term):
new_len = 0
max_len = len(data)
while new_len < max_len and data[new_len] != term: #<--- indexing not slicing
new_len += 1 #<--- non variable
if include_term and new_len < max_len:
new_len += 1
return data[:new_len] |
|
To support UTF16/32, bytes_terminate would need to slice |
It doesn't, that's true. My main concern here is that technically we should not deal with bytes at all: if we're dealing with encodings like UTF16, for example, in C++, that would call for If we'll stick with bytes-centric implementation, however, from runtime's point of view, we can probably cover all possible cases by specifying something like byte[] bytesTerminateMulti(byte[] bytes, byte[] term, int unitSize, boolean includeTerm) and the same thing about public static byte[] bytesStripRight(byte[] bytes, int unitSize, byte[] padBytes) { |
Good idea, it would be better to have a separate (multi) method instead of generalizing the existing (single) method, because multi will have less performance than single. Would unitsize parameter be actually needed? I think its just len(term) and len(padbytes). |
Specifying Also, I think that this place is more than anything warrants plenty of language-specific APIs. For example, in C/C++, you don't want to pass an array around, you'd want one |
I thought it should be only unit-aligned version. The usecase for aligned would be obviously strings, but what would be the usecase for unaligned? |
BTW, do we really have to carve |
It probably only works with single-null terminator, right? This issue is about supporting UTF16 and 32. |
|
So someone uses it with UTF32 encoding, then what? |
I'd say that from practical point of view, the only system that uses UTF16 wide chars is Windows. Virtually everyone else use UTF32 there. |
Python unicode strings use 1/2/4 bytes per character, depending on actual text. |
The question @KOLANICH raised was about Let's get back to original topic.
I believe someone further above this issue provide some examples why that would be useful. I recall some strings in UTF16 wanted 4-byte [0, 0, 0, 0] terminator and stuff like that. |
@GreyCat Makes sense, thanks for commenting. I'll probably keep it simple then and only implement Anything more complex will be the subject of #538. To be clear, does that mean that we will not pursue #158, and will instead move that to #538 as well? If so, perhaps we should close #158 in favor of #538 for clarity. On second thought, I'm not quite sure what all #158 suggests, it sounds like a bunch of different vaguely specified things:
So maybe it's not completely covered by #538 after all. |
Generally, I would agree. I believe that design with "scanners" concept (#538) in general is solid, it's just a question of careful implementation.
All these things absolutely can be covered with custom scanning procedures. We can do a library of built-in scanning procedures in different languages, that's for sure, but ultimately it will be to up user to pick and use one that fits their purpose. |
See kaitai-io/kaitai_struct#187 Based on the existing Python implementation: kaitai-io/kaitai_struct_python_runtime@07aea9c
See kaitai-io/kaitai_struct#187 bytes_terminate_multi() is similar to the existing implementation in Java: https://github.com/kaitai-io/kaitai_struct_java_runtime/blob/deb426e24ff1b75d537b7d903f5a971cae540987/src/main/java/io/kaitai/struct/KaitaiStream.java#L353-L365 read_bytes_term_multi() is similar to the existing implementation in Python: https://github.com/kaitai-io/kaitai_struct_python_runtime/blob/07aea9c6cdb1cc5be8677004680382602d7323f3/kaitaistruct.py#L434-L457
See kaitai-io/kaitai_struct#187 bytes_terminate_multi() is essentially identical to the existing implementation in Java: https://github.com/kaitai-io/kaitai_struct_java_runtime/blob/deb426e24ff1b75d537b7d903f5a971cae540987/src/main/java/io/kaitai/struct/KaitaiStream.java#L353-L365 read_bytes_term_multi() is similar to the existing implementation in Python: https://github.com/kaitai-io/kaitai_struct_python_runtime/blob/07aea9c6cdb1cc5be8677004680382602d7323f3/kaitaistruct.py#L434-L457
See kaitai-io/kaitai_struct#187 Based on the existing Python implementation: kaitai-io/kaitai_struct_python_runtime@812ae7e...9bdaeb3
See kaitai-io/kaitai_struct#187 Based on the existing Python implementation: kaitai-io/kaitai_struct_python_runtime@812ae7e...9bdaeb3
See kaitai-io/kaitai_struct#187 BytesTerminateMulti() is similar to the existing implementation in Java: https://github.com/kaitai-io/kaitai_struct_java_runtime/blob/20af3acef2959778853bb659756cf18092ae2420/src/main/java/io/kaitai/struct/KaitaiStream.java#L353-L373 ReadBytesTermMulti() is similar to the existing implementation in Python: https://github.com/kaitai-io/kaitai_struct_python_runtime/blob/9bdaeb32a844dd1ed44c83ef969e483e5d7b736e/kaitaistruct.py#L434-L457
See kaitai-io/kaitai_struct#187 Based on the existing Python implementation: kaitai-io/kaitai_struct_python_runtime@812ae7e...9bdaeb3
See kaitai-io/kaitai_struct#187 bytesTerminateMulti() follows the existing implementation in Java: https://github.com/kaitai-io/kaitai_struct_java_runtime/blob/20af3acef2959778853bb659756cf18092ae2420/src/main/java/io/kaitai/struct/KaitaiStream.java#L353-L373 readBytesTermMulti() has an original implementation written specifically for the JavaScript runtime, but the semantics should be exactly the same as in other languages.
See kaitai-io/kaitai_struct#187 Based on the existing Python implementation: kaitai-io/kaitai_struct_python_runtime@812ae7e...9bdaeb3
See kaitai-io/kaitai_struct#187 bytesTerminateMulti() follows the existing implementation in Java: https://github.com/kaitai-io/kaitai_struct_java_runtime/blob/20af3acef2959778853bb659756cf18092ae2420/src/main/java/io/kaitai/struct/KaitaiStream.java#L353-L373 readBytesTermMulti() is similar to the existing implementation in Python: https://github.com/kaitai-io/kaitai_struct_python_runtime/blob/9bdaeb32a844dd1ed44c83ef969e483e5d7b736e/kaitaistruct.py#L434-L457
See kaitai-io/kaitai_struct#187 The new methods used in this commit have been implemented in kaitai-io/kaitai_struct_python_runtime@07aea9c
See kaitai-io/kaitai_struct#187 The new methods used in this commit have been implemented in kaitai-io/kaitai_struct_java_runtime@6a321d1
See kaitai-io/kaitai_struct#187 The new methods used in this commit have been implemented in kaitai-io/kaitai_struct_lua_runtime@aa9fa84 Unfortunately, it cannot be said that this adds UTF-16 and UTF-32 support to `type: strz`, because the Lua runtime library doesn't support these encodings yet. It only supports ASCII and UTF-8, see https://github.com/kaitai-io/kaitai_struct_lua_runtime/blob/a7b8a9144f0978cd75d3b176c61a773be56370c0/string_decode.lua#L35-L43 So this is more of a promise of future support for kaitai-io/kaitai_struct#187 once UTF-16 and UTF-32 encodings are implemented.
See kaitai-io/kaitai_struct#187 The new methods used in this commit have been implemented in kaitai-io/kaitai_struct_cpp_stl_runtime@57ab3fa
See kaitai-io/kaitai_struct#187 The new methods used in this commit have been implemented in kaitai-io/kaitai_struct_ruby_runtime@a203e14
See kaitai-io/kaitai_struct#187 The new methods used in this commit have been implemented in kaitai-io/kaitai_struct_php_runtime@78f00d9
See kaitai-io/kaitai_struct#187 The new methods used in this commit have been implemented in kaitai-io/kaitai_struct_csharp_runtime@6556820
See kaitai-io/kaitai_struct#187 The new methods used in this commit have been implemented in kaitai-io/kaitai_struct_go_runtime@83c47c9
See kaitai-io/kaitai_struct#187 The new methods used in this commit have been implemented in kaitai-io/kaitai_struct_javascript_runtime@a911d62
See kaitai-io/kaitai_struct#187 The new methods used in this commit have been implemented in kaitai-io/kaitai_struct_perl_runtime@66262b7
See kaitai-io/kaitai_struct#187 The new methods used in this commit have been implemented in kaitai-io/kaitai_struct_nim_runtime@f42f323
See kaitai-io/kaitai_struct#187 The set of TermStrzUtf16* tests follows the TermStrz{,2,3,4} tests on the `serialization` branch - see https://github.com/kaitai-io/kaitai_struct_tests/tree/629484b021cf5835e9bfee40bc621f0108120b7c/formats
This is now implemented in all 11 target languages that Kaitai Struct supports (C++/STL, C#, Go, Java, JavaScript, Lua, Nim, Perl, PHP, Python, Ruby). Unfortunately, it cannot be used in Lua yet, because our Lua runtime library doesn't support UTF-16* or UTF-32* encodings at all, but it has been implemented there as well, so it will be available once the UTF-{16,32}* encodings are implemented (see kaitai-io/kaitai_struct_compiler@b0cbf6d). The corresponding tests at https://ci.kaitai.io/ are called The fact that they don't pass in Nim at the moment is an infrastructure issue that will disappear after the Docker image for Nim gets updated (by the CI pipeline at https://github.com/kaitai-io/kaitai_struct_docker_images). Unlike all other languages, the Nim Docker image bundles the runtime library inside ( I also ran into this problem when testing locally and fixed it by adding When tested locally, the new tests passed in Nim. |
This issue is very similar to #13 and there is a lot of relevant discussion there.
Observed
strz
is used with UTF-16 or UTF-32 then a single null byte is enough to result in termination rather than 2 (U16) or 4 (U32) bytes.Expected
strz
parse only terminates when the corresponding number of consecutive null bytes appear.To describe the use case, I am writing a descriptor for the output of an internal Windows tool which reserves a fixed length for the string then null terminates it. While this practice is a bit inefficient and I can't point to any widely used formats that do the same I suspect that it is not uncommon for Windows programs since it is analogous to doing so with UTF-8 in a cross-platform or Unix scenario.
On Windows, an example C struct might look like this:
For an example data blob and ksy file that that minimally reproduce the issue (at least with the web IDE) please see the following zip:
utf16_test.zip
The text was updated successfully, but these errors were encountered: