Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding value casing (eg. ASCII) #391

Closed
koczkatamas opened this issue Mar 25, 2018 · 10 comments
Closed

Encoding value casing (eg. ASCII) #391

koczkatamas opened this issue Mar 25, 2018 · 10 comments
Milestone

Comments

@koczkatamas
Copy link
Member

Currently the JS runtime only allows the encoding value to be ASCII, but not ascii:

https://github.com/kaitai-io/kaitai_struct_javascript_runtime/blob/8fedb4d2f278f55d0d8fcf24a9bbf1c1ea2b150f/KaitaiStream.js#L511-L513

I checked the formats repo and 9 file formats used lowercase ascii at 23 occurrences.

Should we fix the formats or should the runtimes accept lowercase ascii too?

(I presume we should fix the runtimes, but maybe add a KSY styling rule which casing should be used generally?)

@KOLANICH
Copy link

I vote for matching against lower case strings, but using toLowerCase on encoding string first.

@GreyCat
Copy link
Member

GreyCat commented Mar 25, 2018

Totally agree, runtimes should treat these strings in case-insensitive way (actually, I thought all of them behave that way already — and they almost do so), and we should declare some canonical way to spell them in the style guide. I'd go with lowercasing everything, as it's simple and

Also, I'd declare a canonical way to spell widespread encoding names, i.e.:

  • utf8 vs utf-8
  • utf-16-le vs utf16-le vs utf-16le vs utf16le
  • iso8859-1 vs iso-8859-1 vs iso88591

And, while we're there, some encoding names should be definitely banned, for example, utf16 and ucs2 (as it lacks information on endianness and relies of current platform's native endianness) and I'm somewhat reluctant about ucs2le and ucs2be (as it's kind of hard to find true UCS2 encoding parser, not UTF16 one).

@KOLANICH
Copy link

KOLANICH commented Mar 25, 2018

And, while we're there, some encoding names should be definitely banned, for example, utf16 and ucs2 (as it lacks information on endianness and relies of current platform's native endianness) and I'm somewhat reluctant about ucs2le and ucs2be (as it's kind of hard to find true UCS2 encoding parser, not UTF16 one).

I'm against it.
1 If we specify endianness in meta I guess we mean that text encodings would also have that endianness.
2 platform endianness is not that bad and may be useful somewhere, for example if some system service, for example kernel, produces binary messages in machine endianness and we need to read these messages from the languages other than C/C++ (which has the natural way - just to include the header used to build that service and use the type defined in it).

@arekbulski
Copy link
Member

Also, I'd declare a canonical way to spell widespread encoding names

On Python _ and - are interchangable, like in: utf_8 utf-8 utf8. See python docs.

some encoding names should be definitely banned, for example, utf16

If the issue is that this encoding has no endianness, then I should also point out the same python docs. UTF32 and UTF16 I think as well, have BOM markers within the stream that define the endianness. So the encodings are valid and endianness is known at runtime (well, at parsing time).

If we specify endianness in meta I guess we mean that text encodings would also have that endianness.

It would be better not to couple those, not without some guarantees that is the case.

@koczkatamas
Copy link
Member Author

Okay, I will fix the JS runtime then and check other runtimes if they have the same issue (I don't expect them).

Quick search gave me the following "canonical" names for encodings (if I am not mistaken): UTF-8, UTF-16LE, ISO-8859-1. So I presume we could use their lowercase variants: utf-8, utf-16le and iso-8859-1, etc.

((About modifying the existing formats in the repo: I think it won't be necessary now to do it, but it will be enough later when we will(?) enforce the style guide for formats in the repo.))

@GreyCat
Copy link
Member

GreyCat commented Mar 26, 2018

There is actually #116 where I've started doing massive encodings table for it a long time ago: https://docs.google.com/spreadsheets/d/1l87kGi9_U4Xrgaw2CGaTc9-_f5UEf1nf-68Dk_e1_iA/edit?usp=sharing

So, yeah, I believe we should stick to "canonical" spellings as per IANA "preferred MIME name", or just IANA's "name", if it's there.

@koczkatamas
Copy link
Member Author

US-ASCII looks pretty weird for me, I would keep ascii as an alias (compatibility reasons and simply it's easier to write down). But the others look okay.

@GreyCat
Copy link
Member

GreyCat commented Mar 26, 2018

There's no universal agreement in the world for all these names, so the very point of that table is to find some common ground and the foundation. While it's probably a very long term goal, what I propose now is to:

  • add a "white list" of encodings that are well-known, canonically named and are guaranteed to work in all languages
  • add a "grey list" of encodings which are known to be aliases and non-canonical spellings; for these, ksc would map them into canonical names in all languages, and issue a warning to the user
  • add a "black list" of encodings that should not be allowed; using these should break the build and clear error should be given to the user

@koczkatamas
Copy link
Member Author

Fixed the issue in the JS runtime (kaitai-io/kaitai_struct_javascript_runtime@dd658ee) and released new npm+Github version (0.8.0-SNAPSHOT.11).

Checked the other runtimes and I did not find similar issue in them (usually they just pass the encoding value to the platforms native text decoder; in Lua it already lowercases the encoding before it compares to "ascii").

From my side this issue can be closed, but we can continue to use it for what @GreyCat just described in the previous comment.

@GreyCat
Copy link
Member

GreyCat commented Mar 26, 2018

Let's close this one ;) I'll add relevant issues separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants