Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Win32API-based encoding option to C++ runtime #61

Merged
merged 4 commits into from
Jul 14, 2023

Conversation

GreyCat
Copy link
Member

@GreyCat GreyCat commented Jul 12, 2023

This adds capability to convert string encodings in KS runtime C++ on Windows using only system-supplied Win32 APIs. This offers alternative to bringing external iconv library (which is the only way we support string conversion in C++ now).

Other changes worth mentioning:

  • This assumes standardized encoding names, as runtime now does conversion from encoding name in string format into Windows codepage (which is an integer).
    • Actual discussion around standard for encoding names is happening in List of supported encodings kaitai_struct#116, and there's a table there which I've started filling out to decide on canonical encoding names.
    • In future, ksc will make sure that runtime always gets passed a canonical name, and will either issue a warning or an error to the user for specifying non-canonical name in their .ksy files.
    • Adjusted the tests to match these now-canonical names.
  • This conversion (string->integer) is costly (and mostly unnecessary to do in runtime), so in future I will offer a way to do convert encoding string in compile-time.
  • To be able to do that, I've started switching from std::string src_enc to const char* src_enc, which paves the way to optimize it further.
  • Added UTF-16BE test.
  • Reformatted most of the encoding tests to spell out exact codepoints which are returned.

@GreyCat GreyCat force-pushed the add_win32api_encodings branch from 058cfdc to 2939b29 Compare July 12, 2023 14:16
@generalmimon
Copy link
Member

@GreyCat:

  • This assumes standardized encoding names, as runtime now does conversion from encoding name in string format into Windows codepage (which is an integer).

Well... why not - given that STRING_ENCODING_TYPE = "WIN32API" is a new feature, it can provide partial support, but without accompanying compiler changes to canonicalize the encoding names, this implementation will be considerably stricter than the vast majority of target languages we support. What I have in mind is that the encoding key should for example:

  • be case-insensitive (e.g. if ASCII works, then ascii or Ascii should too, with or without eventual warnings)
  • tolerate typographical mistakes in hyphens (i.e. not only canonical spellings like UTF-8 and UTF-16BE should be accepted, but also utf8, utf_8, UTF16BE, UTF16-BE or UTF-16-BE, etc.)
  • as you know, sometimes there are several distinct similarly popular names for one encoding, e.g. Shift_JIS vs. SJIS, etc.

I support the idea of having a clear idea what encodings Kaitai Struct supports and what their canonical names are (kaitai-io/kaitai_struct#393 (comment)), but I think we should do it in reverse order (first have the compiler canonicalize, then we can potentially take advantage of accepting only canonical names in runtime libraries).

@GreyCat
Copy link
Member Author

GreyCat commented Jul 13, 2023

@generalmimon That's exactly what I've started with, but then I've realized that implementing all that in one single branch in C++ is roughly same amount of work as implementing this in the compiler, so I went ahead and done kaitai-io/kaitai_struct_compiler#254. Please take a look too?

tests/unittest.cpp Outdated Show resolved Hide resolved
@GreyCat GreyCat requested a review from generalmimon July 13, 2023 19:19
kaitai/kaitaistream.h Show resolved Hide resolved
kaitai/kaitaistream.cpp Outdated Show resolved Hide resolved
kaitai/kaitaistream.cpp Outdated Show resolved Hide resolved
@GreyCat GreyCat merged commit 1f2837e into master Jul 14, 2023
@GreyCat GreyCat deleted the add_win32api_encodings branch July 14, 2023 11:29
static std::string bytes_to_str(std::string src, std::string src_enc);
static std::string bytes_to_str(const std::string src, const char *src_enc);
Copy link
Member

@generalmimon generalmimon Jul 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GreyCat As evidenced by our CI dashboard (https://ci.kaitai.io/), this change breaks half of our tests for target cpp_stl_11/msvc141_windows_x64 (see kaitai-io/ci_artifacts@cf98b22#diff-ddeb848e50b284745134d0b3371b48f0cf63c52ed1d2f1134ff625c1217b57a3). The ci.json diff starts as (I simplified it a bit and only showed the first 2 tests that regressed, out of a total of 62):

diff --git 1/ci_json-cpp_stl_11-msvc141_windows_x64-f8948e3f-good.json 2/ci_json-cpp_stl_11-msvc141_windows_x64-cf98b222-bad.json
index d9b49a7..fe807ae 100644
--- 1/ci_json-cpp_stl_11-msvc141_windows_x64-f8948e3f-good.json
+++ 2/ci_json-cpp_stl_11-msvc141_windows_x64-cf98b222-bad.json
@@ -1,13 +1,13 @@
 {
   "$meta": {
     "lang": "cpp_stl",
-    "timestamp": "2023-07-08T22:54:28Z",
+    "timestamp": "2023-07-14T19:03:17Z",
     "ci": {
       "ci": "appveyor",
-      "build_id": "47500100",
-      "job_id": "a5x62yaygwv8fuxc",
+      "build_id": "47552134",
+      "job_id": "1rj2gtnxn51dxib8",
       "job_number": "3",
-      "url": "https://ci.appveyor.com/project/kaitai-io/ci-targets/builds/47500100/job/a5x62yaygwv8fuxc"
+      "url": "https://ci.appveyor.com/project/kaitai-io/ci-targets/builds/47552134/job/1rj2gtnxn51dxib8"
     }
   },
   "BcdUserTypeBe": "passed",
@@ -29,13 +29,13 @@
   "BitsUnalignedB64Le": "passed",
   "BufferedStruct": "passed",
   "BytesPadTerm": "passed",
-  "CastNested": "passed",
+  "CastNested": "format_build_failed",
   "CastToImported": "passed",
   "CastToTop": "passed",
   "CombineBool": "passed",
   "CombineBytes": "passed",
   "CombineEnum": "passed",
-  "CombineStr": "passed",
+  "CombineStr": "format_build_failed",
   "Debug0": "passed",
   "DebugArrayUser": "passed",
   "DebugEnumName": "passed",
@@ ... @@

The error is the following (https://ci.appveyor.com/project/kaitai-io/ci-targets/builds/47552134/job/1rj2gtnxn51dxib8?fullLog=true#L492):

  cast_nested.cpp
C:\projects\ci-targets\tests\compiled\cpp_stl_11\cast_nested.cpp(90):
  error C2664: 'std::string kaitai::kstream::bytes_to_str(const std::string,const char *)':
    cannot convert argument 2 from 'std::string' to 'const char *'
    [C:\projects\ci-targets\tests\compiled\cpp_stl_11\bin\ks_tests.vcxproj]
  C:\projects\ci-targets\tests\compiled\cpp_stl_11\cast_nested.cpp(90):
    note: No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called

Line compiled/cpp_stl_11/cast_nested.cpp:90 is:

    m_value = kaitai::kstream::bytes_to_str(m__io->read_bytes_term(0, false, true, true), std::string("ASCII"));

But this apparently isn't a MSVC-specific issue. I've just checked cpp_stl_11/gcc4.8 and the impact looks very much the same (see kaitai-io/ci_artifacts@4ab4ae1#diff-ddeb848e50b284745134d0b3371b48f0cf63c52ed1d2f1134ff625c1217b57a3), again the same 62 regressions in total:

diff --git 1/ci_json-cpp_stl_11-gcc4.8-00689cd1-good.json 2/ci_json-cpp_stl_11-gcc4.8-4ab4ae15-bad.json
index fd80bd6..942318f 100644
--- 1/ci_json-cpp_stl_11-gcc4.8-00689cd1-good.json
+++ 2/ci_json-cpp_stl_11-gcc4.8-4ab4ae15-bad.json
@@ -1,13 +1,13 @@
 {
   "$meta": {
     "lang": "cpp_stl",
-    "timestamp": "2023-07-08T22:37:56Z",
+    "timestamp": "2023-07-14T18:46:24Z",
     "ci": {
       "ci": "github",
-      "run_id": "5496725152",
-      "run_number": "254",
-      "job_id": "14883558372",
-      "url": "https://github.com/kaitai-io/ci_targets/actions/runs/5496725152/jobs/10016910686"
+      "run_id": "5557284492",
+      "run_number": "255",
+      "job_id": "15054460788",
+      "url": "https://github.com/kaitai-io/ci_targets/actions/runs/5557284492/jobs/10150864944"
     }
   },
   "BcdUserTypeBe": "passed",
@@ -29,13 +29,13 @@
   "BitsUnalignedB64Le": "passed",
   "BufferedStruct": "passed",
   "BytesPadTerm": "passed",
-  "CastNested": "passed",
+  "CastNested": "format_build_failed",
   "CastToImported": "passed",
   "CastToTop": "passed",
   "CombineBool": "passed",
   "CombineBytes": "passed",
   "CombineEnum": "passed",
-  "CombineStr": "passed",
+  "CombineStr": "format_build_failed",
   "Debug0": "passed",
   "DebugArrayUser": "passed",
   "DebugEnumName": "passed",
@@ ... @@

The cpp_stl_11/gcc4.8 target isn't visible at https://ci.kaitai.io/ though, because the CI dashboard is configured to display cpp_stl_11/gcc4.8_linux, see src/App.vue:63:

  ["cpp_stl_11", "gcc4.8_linux"],

I assume pushing to cpp_stl_11/gcc4.8 was an oversight? I guess keeping the _linux suffix makes sense - we want to test C++ on different platforms, so it's a good idea to specify the platform in all C++ target names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants