Encoding value casing (eg. ASCII) #391

koczkatamas · 2018-03-25T13:51:07Z

Currently the JS runtime only allows the encoding value to be ASCII, but not ascii:

https://github.com/kaitai-io/kaitai_struct_javascript_runtime/blob/8fedb4d2f278f55d0d8fcf24a9bbf1c1ea2b150f/KaitaiStream.js#L511-L513

I checked the formats repo and 9 file formats used lowercase ascii at 23 occurrences.

Should we fix the formats or should the runtimes accept lowercase ascii too?

(I presume we should fix the runtimes, but maybe add a KSY styling rule which casing should be used generally?)

The text was updated successfully, but these errors were encountered:

KOLANICH · 2018-03-25T19:51:31Z

I vote for matching against lower case strings, but using toLowerCase on encoding string first.

GreyCat · 2018-03-25T21:06:40Z

Totally agree, runtimes should treat these strings in case-insensitive way (actually, I thought all of them behave that way already — and they almost do so), and we should declare some canonical way to spell them in the style guide. I'd go with lowercasing everything, as it's simple and

Also, I'd declare a canonical way to spell widespread encoding names, i.e.:

utf8 vs utf-8
utf-16-le vs utf16-le vs utf-16le vs utf16le
iso8859-1 vs iso-8859-1 vs iso88591

And, while we're there, some encoding names should be definitely banned, for example, utf16 and ucs2 (as it lacks information on endianness and relies of current platform's native endianness) and I'm somewhat reluctant about ucs2le and ucs2be (as it's kind of hard to find true UCS2 encoding parser, not UTF16 one).

KOLANICH · 2018-03-25T21:16:10Z

And, while we're there, some encoding names should be definitely banned, for example, utf16 and ucs2 (as it lacks information on endianness and relies of current platform's native endianness) and I'm somewhat reluctant about ucs2le and ucs2be (as it's kind of hard to find true UCS2 encoding parser, not UTF16 one).

I'm against it.
1 If we specify endianness in meta I guess we mean that text encodings would also have that endianness.
2 platform endianness is not that bad and may be useful somewhere, for example if some system service, for example kernel, produces binary messages in machine endianness and we need to read these messages from the languages other than C/C++ (which has the natural way - just to include the header used to build that service and use the type defined in it).

arekbulski · 2018-03-25T21:45:59Z

Also, I'd declare a canonical way to spell widespread encoding names

On Python _ and - are interchangable, like in: utf_8 utf-8 utf8. See python docs.

some encoding names should be definitely banned, for example, utf16

If the issue is that this encoding has no endianness, then I should also point out the same python docs. UTF32 and UTF16 I think as well, have BOM markers within the stream that define the endianness. So the encodings are valid and endianness is known at runtime (well, at parsing time).

If we specify endianness in meta I guess we mean that text encodings would also have that endianness.

It would be better not to couple those, not without some guarantees that is the case.

koczkatamas · 2018-03-25T22:05:24Z

Okay, I will fix the JS runtime then and check other runtimes if they have the same issue (I don't expect them).

Quick search gave me the following "canonical" names for encodings (if I am not mistaken): UTF-8, UTF-16LE, ISO-8859-1. So I presume we could use their lowercase variants: utf-8, utf-16le and iso-8859-1, etc.

((About modifying the existing formats in the repo: I think it won't be necessary now to do it, but it will be enough later when we will(?) enforce the style guide for formats in the repo.))

GreyCat · 2018-03-26T12:20:03Z

There is actually #116 where I've started doing massive encodings table for it a long time ago: https://docs.google.com/spreadsheets/d/1l87kGi9_U4Xrgaw2CGaTc9-_f5UEf1nf-68Dk_e1_iA/edit?usp=sharing

So, yeah, I believe we should stick to "canonical" spellings as per IANA "preferred MIME name", or just IANA's "name", if it's there.

koczkatamas · 2018-03-26T12:39:13Z

US-ASCII looks pretty weird for me, I would keep ascii as an alias (compatibility reasons and simply it's easier to write down). But the others look okay.

GreyCat · 2018-03-26T13:21:24Z

There's no universal agreement in the world for all these names, so the very point of that table is to find some common ground and the foundation. While it's probably a very long term goal, what I propose now is to:

add a "white list" of encodings that are well-known, canonically named and are guaranteed to work in all languages
add a "grey list" of encodings which are known to be aliases and non-canonical spellings; for these, ksc would map them into canonical names in all languages, and issue a warning to the user
add a "black list" of encodings that should not be allowed; using these should break the build and clear error should be given to the user

koczkatamas · 2018-03-26T21:54:34Z

Fixed the issue in the JS runtime (kaitai-io/kaitai_struct_javascript_runtime@dd658ee) and released new npm+Github version (0.8.0-SNAPSHOT.11).

Checked the other runtimes and I did not find similar issue in them (usually they just pass the encoding value to the platforms native text decoder; in Lua it already lowercases the encoding before it compares to "ascii").

From my side this issue can be closed, but we can continue to use it for what @GreyCat just described in the previous comment.

GreyCat · 2018-03-26T21:56:21Z

Let's close this one ;) I'll add relevant issues separately.

GreyCat closed this as completed Mar 26, 2018

GreyCat added this to the v0.9 milestone Mar 26, 2018

GreyCat added the bug found label Mar 26, 2018

GreyCat mentioned this issue Mar 26, 2018

Encodings: black / grey / white list #393

Closed

3 tasks

KOLANICH mentioned this issue May 25, 2019

Add Python Pickle format kaitai-io/kaitai_struct_formats#150

Merged

1 task

generalmimon mentioned this issue Jan 15, 2021

Mii Structs kaitai-io/kaitai_struct_formats#400

Closed

KOLANICH mentioned this issue Feb 13, 2022

Add EDF parser kaitai-io/kaitai_struct_formats#581

Open

generalmimon mentioned this issue Mar 28, 2022

add utf16 string types to golang generator kaitai-io/kaitai_struct_compiler#244

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding value casing (eg. ASCII) #391

Encoding value casing (eg. ASCII) #391

koczkatamas commented Mar 25, 2018

KOLANICH commented Mar 25, 2018

GreyCat commented Mar 25, 2018

KOLANICH commented Mar 25, 2018 •

edited

Loading

arekbulski commented Mar 25, 2018

koczkatamas commented Mar 25, 2018

GreyCat commented Mar 26, 2018

koczkatamas commented Mar 26, 2018

GreyCat commented Mar 26, 2018

koczkatamas commented Mar 26, 2018

GreyCat commented Mar 26, 2018

Encoding value casing (eg. ASCII) #391

Encoding value casing (eg. ASCII) #391

Comments

koczkatamas commented Mar 25, 2018

KOLANICH commented Mar 25, 2018

GreyCat commented Mar 25, 2018

KOLANICH commented Mar 25, 2018 • edited Loading

arekbulski commented Mar 25, 2018

koczkatamas commented Mar 25, 2018

GreyCat commented Mar 26, 2018

koczkatamas commented Mar 26, 2018

GreyCat commented Mar 26, 2018

koczkatamas commented Mar 26, 2018

GreyCat commented Mar 26, 2018

KOLANICH commented Mar 25, 2018 •

edited

Loading