Add schema validation code to C generator #1098

tuxji · 2023-10-19T22:13:38Z

Modify C generator to compare floating point and integer numbers with enumerations and ranges specified in DFDL schemas. Test validation code with additional simple type root elements in simple.dfdl.xsd schema and additional TDML tests in simple_errors.tdml.

Also allow C generator to compare hexBinary elements to values in enumerations since Daffodil does the same thing and C generator already compares hexBinary elements to values in fixed attributes.

Allow C generator to ignore dfdl:assert expressions in DFDL schemas and generate code anyway instead of throwing an exception.

Allow C generator to get schema version from DFDL schemas and put it into the generated C code to version the generated code as well.

Found out Daffodil's default alignment properties (alignment="1" and alignmentUnits="bytes") makes Daffodil append extra fill bits to elements with odd sizes not a multiple of 8 bits. Because compatibility between Daffodil and DaffodilC depends on defining both dfdl:alignment="1" (default) and dfdl:alignmentUnits="bits" (non-default), define a known good binary data format in one place (network/format.dfdl.xsd) and include it in rest of daffodil-codegen-c's test schemas.

DAFFODIL-2853

BUILD.md: Document how to install iwyu and libcriterion-dev (used only for Daffodil C code generator development and maintenance).

README.md: Document how to check formatting of or reformat Daffodil.

c/files/.clang-format: Update to clang-format 14. Format declarations more concisely by removing AlignConsecutiveDeclarations (this is why some reformatted C files lose some extra whitespace).

c/files/libcli/daffodil_getopt.c: Skip and ignore -r and -s options so one can run "c/daffodil parse" with the same options as "daffodil parse -r root -s schema -o outfile infile" without having to remove these options.

c/files/libcli/daffodil_main.c: Initialize ParserOrUnparserState fields as well as PState/UState fields. Make sure any diagnostics will fail unparse as well as parse if validate mode is on.

c/files/libruntime/errors.[ch]: Add ERR_ENUM_MATCH and ERR_OUTSIDE_RANGE error messages to diagnose elements not matching any of their enumerations or having values outside their allowed ranges. Remove daffodil_program_version since we use daffodil_version in daffodil_version.h instead.

c/files/libruntime/infoset.h: Move common PState/UState fields to ParseOrUnparseState to allow parse and unparse functions to call common validate functions.

c/files/libruntime/parsers.[ch]: Use ParserOrUnparserState fields as well as PState/UState fields. Replace parse_check_bounds and parse_validate_fixed functions with validate_array_bounds and validate_fixed_attribute functions in validators.[ch].

c/files/libruntime/unparsers.[ch]: Use ParserOrUnparserState fields as well as UState fields. Replace unparse_check_bounds and unparse_validate_fixed functions with validate_array_bounds and validate_fixed_attribute functions in validators.[ch].

c/files/libruntime/validators.[ch]: Move common validation functions here from parsers.[ch] and unparsers.[ch]. Add new float, hexBinary, int, and universal validation functions to check that elements match their allowed enumerations (requires passing array of floating point numbers, hexBinary structs, or integer numbers) and fit within their allowed ranges. Validation functions create diagnostics instead of errors; the CLI or a caller is responsible for printing the diagnostics and exiting with the appropriate status code.

c/files/tests/bits.c: Use ParserOrUnparserState fields as well as PState/UState fields.

c/files/tests/extras.c: Regenerate iwyu comments.

c/DaffodilCCodeGenerator.scala: Accept but ignore dfdl:assert statements (for now).

c/DaffodilCExamplesGenerator.scala: Generate code from simple's all-in-one root element in order to show generated code for all simple types, enums, and ranges.

c/generators/AlignmentFillCodeGenerator.scala: Use ParserOrUnparserState fields as well as PState/UState fields.

c/generators/BinaryBooleanCodeGenerator.scala: Use ParserOrUnparserState fields as well as PState/UState fields.

c/generators/BinaryValueCodeGenerator.scala: Generate C code to validate enumerations and ranges of primitive elements. Remove escape characters in enumeration values added by Facets.escapeForRegex (turn "1.0" back into "1.0"). Avoid unsigned >= 0 comparisons to prevent gcc warnings. Use ParserOrUnparserState fields as well as PState/UState fields. Call correct function to validate enums depending on element's primType. Generate extra C initialization code for hexBinary enumerations to define arrays of hexBinary structs and pass them to validate_hexbinary_enumeration.

c/generators/CodeGeneratorState.scala: Remove unnecessary immutable (built-in Set is immutable). Use ParserOrUnparserState fields as well as PState/UState fields. Get actual schema version from Daffodil and assign its value to schema_version in generated_code.c. Declare schema_version in generated_code.h. Call validate_array_bounds instead of parse/unparse_check_bounds. Make cStructFieldAccess work correctly for DFDL expressions like "/rr:ReqstReply/..." used by NFS schemas.

c/generators/HexBinaryCodeGenerator.scala: Use ParserOrUnparserState fields as well as PState/UState fields. Call validate_fixed_attribute instead of parse/unparse_validate_fixed.

examples/**/generated_code.[ch]: Regenerate to show changes in C generator such as defining schema_version, using pu fields, and calling validation functions.

c/data/simple*.dat: Remove (contents now inside simple.tdml).

c/ex_nums.dfdl.xsd: Define schema version to be "1.0.2". Include network format instead of defining own binary format. Adjust elements' own format properties as needed to match original data file (explicitness is better than using defaults).

c/infosets/simple*.xml: Remove (contents now inside simple.tdml).

c/nested.dfdl.xsd: Include network format instead of defining own binary format.

c/network/format.dfdl.xsd: Define network format in only one place, making sure to use vitally needed alignment properties but keep the format as minimal as possible to make it easier to include in a wide variety of schemas.

c/padtest.dfdl.xsd: Include network format instead of defining own binary format.

c/simple.dfdl.xsd: Define schema version to be "1.0.0". Include network format instead of defining own binary format. Add new elements to test enumerations and ranges of primitive types. Use custom simple types in order to define simple type root elements, enumerations of simple type root elements, ranges of simple type elements, and all-in-one root element as compactly as possible.

c/simple.tdml: Make comments clearer how to run tests and include all data and infosets in test cases instead of using files.

c/simple_errors.tdml: Make comments clearer how to run tests and add new test cases to validate enumerations and ranges of primitive types.

c/variablelen.dfdl.xsd: Include network format instead of defining own binary format.

core/dsom/SchemaDocument.scala: Modify SchemaDocument to capture schema versions from XML schema definitions and provide to codegen-c.

tdml/TestDaffodilC.scala: Add future-proofing test to check test schema compiles without any warnings (would catch "relative location deprecated" warning if DFDLGeneralFormat url was still relative). Fix inconsistent use of "dp"/"tdp" and "isError"/"isProcessingError".

c/TestSimpleErrors.scala: Call new test cases in simple_errors.tdml.

stevedlawrence

+1 👍 just a bunch of minor comments/suggestions

stevedlawrence · 2023-10-20T11:54:32Z

README.md

+
+Reformat source and sbt files if necessary:
+
+    sbt scalafmtAll scalafmtSbt


Is it worth adding commands for what you use to format h C code?

Maybe for a future PR, it might be worth adding an sbt task that executes those commands with the right options (e.g. sbt cfmtAll) so we don't have to remember them. Could also be useful to add a step in our github actions lint job to run that and error if git diff reports and changes.

Yes, we can write a sbt task which runs clang-format on the C files and errors if it finds and reports any diffs. However, some people may get an error running the task (especially if the task's purpose is to run all the language formatters from a single command) since clang-format is not commonly installed on many (most?) machines.

Makes sense. As long as the error is something useful that should be fine. I recommend doing it later though, doesn't have to be part of this PR.

stevedlawrence · 2023-10-20T12:01:32Z

README.md

+
+    sbt scalafmtAll scalafmtSbt
+
+### Build


I wonder if we need another build section for building and locally publishing the jars? Anyone developing something integrating Daffodil may want that. Maybe something like this?

Build the Daffodil jars and publish to the a Maven or Ivy repository Maven: sbt publishM2 Ivy: sbt publishLocal

Yes, I'll document the publish commands too and explain Ivy is also used by sbt to build schema-only projects.

stevedlawrence · 2023-10-20T12:02:33Z

...il-codegen-c/src/main/resources/org/apache/daffodil/codegen/c/files/libcli/daffodil_getopt.c

+            break;
+        case 's':
+            // Ignore "-s schema" option/optarg
+            break;


I'm not sure I understand this, why of options to just ignore them? Seems that could be very confusing.

A workflow I sometimes use with code generators, DFDL schemas, and TDML tests is to run daffodil generate on the DFDL schemas, run a pair of daffodil parse and c/daffodil parse commands on the same data files, and check that they both produce the same exit status and similar infosets. This manual workflow is finer-grained than and sometimes easier to debug than running daffodil test on TDML tests using the same data and infosets. I thought it could be nice if I could run the same c/daffodil parse command as the daffodil parse command by appending only c/ to daffodil without having to change any of the options.

I thought I could hide this behavior from users (the usage message doesn't mention the -r and -s options) while making life easier for developers and testers who know about this hidden behavior for greater compatibility between daffodil and c/daffodil. I can revert this source file if you think users would be surprised and confused that c/daffodil allows -r and -s options.

I see, makes sense. As long as the optinos aren't in the usage I think it's fine. But it would be worth adding that reasoning as a comment.

stevedlawrence · 2023-10-20T12:03:32Z

...il-codegen-c/src/main/resources/org/apache/daffodil/codegen/c/files/libcli/daffodil_getopt.h

@@ -42,7 +42,7 @@ extern struct daffodil_parse_cli
    const char *infoset_converter;
    const char *infile;
    const char *outfile;
-    bool        validate;
+    bool validate;


I like this change 👍 . I've never been a huge fan of aligning variable names in C.

stevedlawrence · 2023-10-20T12:05:08Z

...odil-codegen-c/src/main/resources/org/apache/daffodil/codegen/c/files/libcli/daffodil_main.c

-#include "errors.h"           // for continue_or_exit, print_diagnostics, Error, Diagnostics, Error::(anonymous)
-#include "infoset.h"          // for get_infoset, walk_infoset, PState, UState, parse_data, unparse_infoset, InfosetBase, VisitEventHandler
+#include "errors.h"           // for continue_or_exit, Diagnostics, print_diagnostics, Error
+#include "infoset.h"          // for ParserOrUnparserState, PState, UState, get_infoset, walk_infoset, parse_data, unparse_infoset, InfosetBase, VisitEventHandler


Just curious, do you find it useful to include the functions each header includes? Seems like it's just noise to me. I guess this is auto-generated so not a big deal, but personally I think it would be nice if this could be disabled. Especially since some of these lines get pretty long.

Maybe just add a comment above the auto-maintained part indicating that these comments and the include list are auto-maintained by iwyu. I know the whole concept of this was news to me when I first read this code.

My viewpoint (coming back from developing Java to developing C) is that it would be nice if C header lines tell you which symbols they import just like Java import lines tell you which symbols they import. However, I can tell iwyu not to create these comments if you all don't like them. Then the remaining use case for iwyu is to sort and canonicalize C header lines just like Java/Scala tools sort and canonicalize Java/Scala import lines.

Good idea, I've added a // auto-maintained by iwyu comment to each iwyu-maintained part.

stevedlawrence · 2023-10-20T13:26:33Z

...gen-c/src/main/scala/org/apache/daffodil/codegen/c/generators/BinaryValueCodeGenerator.scala

+    if (e.typeDef.optRestriction.exists(_.hasEnumeration)) {
+      // Split the enumeration values into an array of strings ready to be inserted into the C code
+      val enumerationValues = e.typeDef.optRestriction.flatMap(_.enumerationValues).get
+      val enums = enumerationValues.split('|').map(_.replaceAll("\\\\(\\W)", "$1"))


I think maybe Daffodil generates a regex to check enumerations, so you're having undo that regex? Instead, I think you can access the raw enum values like this:

val enums = e.typeDef.optRestriction.get.enumerations.map(_.enumValueRaw)

I'll try that expression, thanks.

stevedlawrence · 2023-10-20T13:32:31Z

...gen-c/src/main/scala/org/apache/daffodil/codegen/c/generators/BinaryValueCodeGenerator.scala

+      s"{$arraysName[$index], ${s.length / 2}, false}"
+    }
+    val hexEnumsInit = enums.map(_.grouped(2).map("0x" + _).mkString("{", ", ", "}"))
+    val enumsLenMax = enums.map(_.length / 2).max


Suggest hexEnumsLenMax since this is specific to hex? Or move all the hex specific stuff to the HexBinary cases so we don't have to calculate this stuff where it doesn't apply.

First change is easier, so will accept that change.

stevedlawrence · 2023-10-20T13:36:01Z

...gen-c/src/main/scala/org/apache/daffodil/codegen/c/generators/BinaryValueCodeGenerator.scala

+    val (enumsInit, extraInit, primType, valType, fieldArg) = e.optPrimType.get match {
+      case PrimType.Double | PrimType.Float => (enums, "", "double", "floatpt", field)
+      case PrimType.HexBinary => (hexEnums, arraysInit, "HexBinary", "hexbinary", s"&$field")
+      case _ => (enums, "", "int64_t", "integer", field)


Is there value in listing the integer types? I imagine this failing on things like Date/Time/Integer primitives. THough I guess those primitives would fail somewhere much earlier since they aren't supported?

I don't think I ever checked dates, times, or arbitrary length integers longer than 64 bits. I am sure they would fail somewhere else too since the code generator doesn't support them at all (it won't be able to find the right primitive C type, for example).

stevedlawrence · 2023-10-20T14:00:33Z

...gen-c/src/main/scala/org/apache/daffodil/codegen/c/generators/BinaryValueCodeGenerator.scala

+          "{",
+          ", ",
+          "}",
+        )};\n"""


Are there any concerns that there might be a bunch of hex binary enums with one being much larger than the other, and so this array would have a lot of empty space? I guess the other option would be to create a bunch of separate variables (e.g. hexEnum1, hexEnum2, hexEnum3, ... 0), maybe that's not worth it. Or alternatively, concat the hex arrays into a single long array, and then then hexEnums have to correctly index into that array? Maybe that's too much complexity for a case that isn't that likely? I imagine most hex enums are all the same length, or at least close in length.

I have never seen enumeration facets on hexBinary, so I wouldn't sweat it. I had to check the DFDL spec to see if they were even allowed!

I have not seen any schemas with hex binary enums yet, so let's put off any changes until we see hex enums actually used. I only found out through experimentation that daffodil supports hex enums and thought daffodilC should try to support them too. I think your imagination that most hex enums are close in length is probably right anyway.

stevedlawrence · 2023-10-20T14:08:38Z

daffodil-core/src/main/scala/org/apache/daffodil/core/dsom/SchemaDocument.scala

@@ -212,6 +212,8 @@ final class SchemaDocument private (xmlSDoc: XMLSchemaDocument)
  final override lazy val optLexicalParent = Some(xmlSDoc)
  final override lazy val optXMLSchemaDocument = Some(xmlSDoc)

+  final lazy val version = (xml \ "@version").text


I didn't know XSD had a version attribute. Coudld be useful!

Though, I wonder if this should be an Option so we can tell if it's provided or not? In the code gen case, it can just to optVersion.getOrElse("") so it wouldn't change much.

Somewhere on WWW is a diatribe on why using namespace URIs with versions in them is so wrong wrong wrong as a versioning means for XSD schemas.

How do you change the line to return an Option[String]?

Somewhere on WWW is a diatribe on why using namespace URIs with versions in them is so wrong wrong wrong as a versioning means for XSD schemas.

The built-in XML Schema version attribute is orthogonal to and separate from any namespace URI. We plan to take advantage of it to assign semantic versions (strings like "3.6.0") to our filters.

How do you change the line to return an Option[String]?

(xml \ "@version") returns a NodeSeq, which has an implicit conversion to Seq[Node], so you can use any Seq functions on it. So something like this should work:

final lazy val optVersion = (xml \ "@version").headOption.map(_.text)

Alas, I merged the PR before I checked my emails and saw your reply, It's too late to put the change into the PR now, but let's keep it in mind for another PR:

final lazy val optVersion = (xml \ "@version").headOption.map(_.text)

mbeckerle

+1

This is an amazing change set. Rather big, and long to review, but really great.

I have a bunch of minor suggestions and a few questions that would be good to understand better, but generally speaking really good.

mbeckerle · 2023-10-20T17:02:51Z

BUILD.md


-    sudo dnf install clang gcc git java-11-openjdk-devel llvm make mxml-devel pkgconf
+    sudo dnf install clang gcc git iwyu java-11-openjdk-devel llvm make mxml-devel pkgconf

 If you want to use clang instead of gcc, you'll have to set your


Question: why give people the choice?

We don't have to provide this flexibility unless we or users depend on it for some reason.

This works, so "it aint broke" so no need to change anything. I just wanted to comment that we don't really have to provide options like this unless they're really really required.

If you're talking about the paragraph If you want to use clang instead of gcc, ..., I originally added that paragraph because you told me you didn't want to require GNU-licensed tools to develop or use any part of Daffodil. I even made sure that the GitHub Action workflow (main.yml) would set these environment variables to build Daffodil with clang and lvvm-ar instead of gcc and ar. I can reword the paragraph to say you can set these environment variables to use clang and llvm-ar instead of gcc and ar if you have a policy forbidding GNU-licensed tools.

Yeah, gcc is GPLv3 which is category X, so I had concerns about implying it was required. I did find this open Jira LEGAL issue that discusses this:

https://issues.apache.org/jira/browse/LEGAL-390?jql=project+%3D+LEGAL+AND+text+%7E+%22gcc%22+ORDER+BY+updated+DESC%2C+priority+DESC

Sounds like the resolution it's it's fine to use. I'd suggest we still do mention clang support so that people can use it if they want and don't have to figure it all out. But maybe that's an addendum or something, and everything else implies the use of gcc if the goal is to simplify this?

mbeckerle · 2023-10-20T17:08:12Z

...odil-codegen-c/src/main/resources/org/apache/daffodil/codegen/c/files/libcli/daffodil_main.c

-#include "errors.h"           // for continue_or_exit, print_diagnostics, Error, Diagnostics, Error::(anonymous)
-#include "infoset.h"          // for get_infoset, walk_infoset, PState, UState, parse_data, unparse_infoset, InfosetBase, VisitEventHandler
+#include "errors.h"           // for continue_or_exit, Diagnostics, print_diagnostics, Error
+#include "infoset.h"          // for ParserOrUnparserState, PState, UState, get_infoset, walk_infoset, parse_data, unparse_infoset, InfosetBase, VisitEventHandler


Maybe just add a comment above the auto-maintained part indicating that these comments and the include list are auto-maintained by iwyu. I know the whole concept of this was news to me when I first read this code.

mbeckerle · 2023-10-20T17:18:45Z

daffodil-codegen-c/src/main/resources/org/apache/daffodil/codegen/c/files/libruntime/parsers.c

@@ -21,7 +21,7 @@
 #include <stdbool.h>  // for bool, false, true
 #include <stdio.h>    // for fread, fgetc, EOF
 #include <stdlib.h>   // for free, malloc
-#include "errors.h"   // for Error, eof_or_error, Error::(anonymous), ERR_LEFTOVER_DATA, add_diagnostic, get_diagnostics, ERR_ARRAY_BOUNDS, ERR_FIXED_VALUE, ERR_HEXBINARY_ALLOC, ERR_PARSE_BOOL, Diagnostics
+#include "errors.h"   // for Error, eof_or_error, ERR_LEFTOVER_DATA, Error::(anonymous), ERR_HEXBINARY_ALLOC, ERR_PARSE_BOOL


Hmm. iwyu specified what "errors.h" was needed for here, but above where SL commented on ERR_ZZZ there's no such comment on the include of "errors.h". What are the rules here about what iwyu maintains?

What you're noticing is that iwyu never puts a comment on a C source file's own header file. That is, parsers.c "uses" a few symbols from errors.h, but errors.c "implements" nearly all of errors.h's symbols. Even iwyu put a comment anyway, the comment would be too long trying to list all the header symbols implemented by the source file.

mbeckerle · 2023-10-20T17:19:31Z

daffodil-codegen-c/src/main/resources/org/apache/daffodil/codegen/c/files/libruntime/infoset.h

-    const Error *error;         // any error which stops parse
-    uint8_t      unreadBits;    // any buffered bits not read yet
-    uint8_t      numUnreadBits; // number of buffered bits not read yet
+    ParserOrUnparserState pu; // common mutable state


Yessssss. embedded by value, not by reference. 👍

mbeckerle · 2023-10-20T17:27:21Z

daffodil-codegen-c/src/main/resources/org/apache/daffodil/codegen/c/files/libruntime/parsers.c

 void
-parse_check_bounds(const char *name, size_t count, size_t minOccurs, size_t maxOccurs, PState *pstate)
+parse_fill_bits(size_t end_bitPos0b, PState *pstate)


This name confused me. It has to be "fill" as in short for the alignmentFill region.

Do you mind changing the name to parse_alignment_bits?

Sure, I've changed the name.

mbeckerle · 2023-10-20T18:48:55Z

daffodil-codegen-c/src/test/resources/org/apache/daffodil/codegen/c/ex_nums.dfdl.xsd


-    <!-- Representation property bindings -->
+  <!-- Network order binary format (net:format) -->


I know folks at Sun (back in the day) and IETF wanted to define "Network Byte Order", but they failed because Intel little-endian CPUs took over the world for a while and everybody serialized data little endian because it was easiest. Nobody sees the TCP/IP headers of packets. (or cares) They see their payload data, which is little-endian a huge fraction of the time.

So please say big or little endian here.

OK, saying Network order big endian format (net:format).

mbeckerle · 2023-10-20T18:50:08Z

daffodil-codegen-c/src/test/resources/org/apache/daffodil/codegen/c/network/format.dfdl.xsd

+  limitations under the License.
+-->
+
+<!-- Network order binary format: <https://en.wikipedia.org/wiki/Endianness#Networking> -->


Really. Just tell people this is big endian.

OK, Network order big endian format.

mbeckerle · 2023-10-20T18:53:44Z

daffodil-codegen-c/src/test/resources/org/apache/daffodil/codegen/c/network/format.dfdl.xsd

+  <annotation>
+    <appinfo source="http://www.ogf.org/dfdl/">
+      <dfdl:defineFormat name="format">
+        <dfdl:format


You really should define byteOrder here as bigEndian and bitOrder as mostSignificantBitFirst, because it matters in order to understand this format.

DFDLGeneralFormat.dfdl.xsd is not supposed to change, but it's also supposed to be mostly so you don't have to remember obscure properties Daffodil requires that your format otherwise wouldn't bother to mention.

For properties that really define the essence of the format, you should have them explicitly regardless of whether they match what is in DFDLGeneralFormat or not.

All right, adding byteOrder as bigEndian and bitOrder as mostSignificantBitFirst to the format.

mbeckerle · 2023-10-20T18:59:53Z

daffodil-codegen-c/src/test/resources/org/apache/daffodil/codegen/c/simple.dfdl.xsd

+  </simpleType>
+  <simpleType name="simple-hexBinaryPrefixed"
+              dfdl:prefixLengthType="si:prefixedCount"
+              dfdl:lengthKind="prefixed"


BTW: there was a DFDL clarification recently that a prefixedLengthType can have facets, and those are checked so as to be able to have say, a 32-bit integer as a lengthKind, but specify only sane values like < 1Million and even minimum lengths are to be checked.

This is super necessary if reading from a TCP socket. Last thing you want to do is a blocking read for 4Gigs of data.

Is generated C code going to implement this prefix value checking with your range validators?

I didn't know about this requirement until now. It would be a good idea to implement prefix value checking, but not in this PR (going on vacation soon).

mbeckerle · 2023-10-20T19:06:16Z

daffodil-core/src/main/scala/org/apache/daffodil/core/dsom/SchemaDocument.scala

@@ -212,6 +212,8 @@ final class SchemaDocument private (xmlSDoc: XMLSchemaDocument)
  final override lazy val optLexicalParent = Some(xmlSDoc)
  final override lazy val optXMLSchemaDocument = Some(xmlSDoc)

+  final lazy val version = (xml \ "@version").text


Somewhere on WWW is a diatribe on why using namespace URIs with versions in them is so wrong wrong wrong as a versioning means for XSD schemas.

tuxji

Thanks for the review, see responses below.

tuxji · 2023-10-20T17:25:05Z

README.md

+
+Reformat source and sbt files if necessary:
+
+    sbt scalafmtAll scalafmtSbt


Yes, we can write a sbt task which runs clang-format on the C files and errors if it finds and reports any diffs. However, some people may get an error running the task (especially if the task's purpose is to run all the language formatters from a single command) since clang-format is not commonly installed on many (most?) machines.

tuxji · 2023-10-20T17:32:07Z

README.md

+
+    sbt scalafmtAll scalafmtSbt
+
+### Build


Yes, I'll document the publish commands too and explain Ivy is also used by sbt to build schema-only projects.

tuxji · 2023-10-20T17:59:17Z

...il-codegen-c/src/main/resources/org/apache/daffodil/codegen/c/files/libcli/daffodil_getopt.c

+            break;
+        case 's':
+            // Ignore "-s schema" option/optarg
+            break;


A workflow I sometimes use with code generators, DFDL schemas, and TDML tests is to run daffodil generate on the DFDL schemas, run a pair of daffodil parse and c/daffodil parse commands on the same data files, and check that they both produce the same exit status and similar infosets. This manual workflow is finer-grained than and sometimes easier to debug than running daffodil test on TDML tests using the same data and infosets. I thought it could be nice if I could run the same c/daffodil parse command as the daffodil parse command by appending only c/ to daffodil without having to change any of the options.

I thought I could hide this behavior from users (the usage message doesn't mention the -r and -s options) while making life easier for developers and testers who know about this hidden behavior for greater compatibility between daffodil and c/daffodil. I can revert this source file if you think users would be surprised and confused that c/daffodil allows -r and -s options.

tuxji · 2023-10-20T18:10:03Z

...odil-codegen-c/src/main/resources/org/apache/daffodil/codegen/c/files/libcli/daffodil_main.c

-#include "errors.h"           // for continue_or_exit, print_diagnostics, Error, Diagnostics, Error::(anonymous)
-#include "infoset.h"          // for get_infoset, walk_infoset, PState, UState, parse_data, unparse_infoset, InfosetBase, VisitEventHandler
+#include "errors.h"           // for continue_or_exit, Diagnostics, print_diagnostics, Error
+#include "infoset.h"          // for ParserOrUnparserState, PState, UState, get_infoset, walk_infoset, parse_data, unparse_infoset, InfosetBase, VisitEventHandler


My viewpoint (coming back from developing Java to developing C) is that it would be nice if C header lines tell you which symbols they import just like Java import lines tell you which symbols they import. However, I can tell iwyu not to create these comments if you all don't like them. Then the remaining use case for iwyu is to sort and canonicalize C header lines just like Java/Scala tools sort and canonicalize Java/Scala import lines.

tuxji · 2023-10-20T18:24:37Z

...odil-codegen-c/src/main/resources/org/apache/daffodil/codegen/c/files/libcli/daffodil_main.c

+        continue_or_exit(ustate.pu.error);
+
+        // Any diagnostics will fail the unparse if validate mode is on
+        if (daffodil_parse.validate && ustate.pu.diagnostics)


You're right on both points. It would be even easier to merge daffodil_parse_cli and daffodil_unparse_cli into daffodil_pu_cli than it would be to add validate to daffodil_unparse_cli.

I'll wait for Mike and you to respond to the rest of my PR comments before I do another pass on the PR, so please weigh in on how much you want the other suggested changes.

tuxji · 2023-10-20T19:41:31Z

...gen-c/src/main/scala/org/apache/daffodil/codegen/c/generators/BinaryValueCodeGenerator.scala

+    if (e.typeDef.optRestriction.exists(_.hasEnumeration)) {
+      // Split the enumeration values into an array of strings ready to be inserted into the C code
+      val enumerationValues = e.typeDef.optRestriction.flatMap(_.enumerationValues).get
+      val enums = enumerationValues.split('|').map(_.replaceAll("\\\\(\\W)", "$1"))


I'll try that expression, thanks.

tuxji · 2023-10-20T19:44:18Z

...gen-c/src/main/scala/org/apache/daffodil/codegen/c/generators/BinaryValueCodeGenerator.scala

+      s"{$arraysName[$index], ${s.length / 2}, false}"
+    }
+    val hexEnumsInit = enums.map(_.grouped(2).map("0x" + _).mkString("{", ", ", "}"))
+    val enumsLenMax = enums.map(_.length / 2).max


First change is easier, so will accept that change.

tuxji · 2023-10-20T19:48:30Z

...gen-c/src/main/scala/org/apache/daffodil/codegen/c/generators/BinaryValueCodeGenerator.scala

+          "{",
+          ", ",
+          "}",
+        )};\n"""


I have not seen any schemas with hex binary enums yet, so let's put off any changes until we see hex enums actually used. I only found out through experimentation that daffodil supports hex enums and thought daffodilC should try to support them too. I think your imagination that most hex enums are close in length is probably right anyway.

tuxji · 2023-10-20T19:52:09Z

...gen-c/src/main/scala/org/apache/daffodil/codegen/c/generators/BinaryValueCodeGenerator.scala

+    val (enumsInit, extraInit, primType, valType, fieldArg) = e.optPrimType.get match {
+      case PrimType.Double | PrimType.Float => (enums, "", "double", "floatpt", field)
+      case PrimType.HexBinary => (hexEnums, arraysInit, "HexBinary", "hexbinary", s"&$field")
+      case _ => (enums, "", "int64_t", "integer", field)


I don't think I ever checked dates, times, or arbitrary length integers longer than 64 bits. I am sure they would fail somewhere else too since the code generator doesn't support them at all (it won't be able to find the right primitive C type, for example).

tuxji · 2023-10-20T19:58:57Z

daffodil-core/src/main/scala/org/apache/daffodil/core/dsom/SchemaDocument.scala

@@ -212,6 +212,8 @@ final class SchemaDocument private (xmlSDoc: XMLSchemaDocument)
  final override lazy val optLexicalParent = Some(xmlSDoc)
  final override lazy val optXMLSchemaDocument = Some(xmlSDoc)

+  final lazy val version = (xml \ "@version").text


How do you change the line to return an Option[String]?

tuxji

Thanks again for the comments, responses below.

tuxji · 2023-10-21T03:29:47Z

daffodil-codegen-c/src/test/resources/org/apache/daffodil/codegen/c/network/format.dfdl.xsd

+  <annotation>
+    <appinfo source="http://www.ogf.org/dfdl/">
+      <dfdl:defineFormat name="format">
+        <dfdl:format


All right, adding byteOrder as bigEndian and bitOrder as mostSignificantBitFirst to the format.

tuxji · 2023-10-21T03:30:14Z

daffodil-codegen-c/src/test/resources/org/apache/daffodil/codegen/c/network/format.dfdl.xsd

+  limitations under the License.
+-->
+
+<!-- Network order binary format: <https://en.wikipedia.org/wiki/Endianness#Networking> -->


OK, Network order big endian format.

tuxji · 2023-10-21T03:30:59Z

daffodil-codegen-c/src/test/resources/org/apache/daffodil/codegen/c/ex_nums.dfdl.xsd


-    <!-- Representation property bindings -->
+  <!-- Network order binary format (net:format) -->


OK, saying Network order big endian format (net:format).

tuxji · 2023-10-21T03:34:09Z

daffodil-codegen-c/src/test/resources/org/apache/daffodil/codegen/c/simple.dfdl.xsd

+  </simpleType>
+  <simpleType name="simple-hexBinaryPrefixed"
+              dfdl:prefixLengthType="si:prefixedCount"
+              dfdl:lengthKind="prefixed"


I didn't know about this requirement until now. It would be a good idea to implement prefix value checking, but not in this PR (going on vacation soon).

tuxji · 2023-10-21T03:39:24Z

daffodil-core/src/main/scala/org/apache/daffodil/core/dsom/SchemaDocument.scala

@@ -212,6 +212,8 @@ final class SchemaDocument private (xmlSDoc: XMLSchemaDocument)
  final override lazy val optLexicalParent = Some(xmlSDoc)
  final override lazy val optXMLSchemaDocument = Some(xmlSDoc)

+  final lazy val version = (xml \ "@version").text


Somewhere on WWW is a diatribe on why using namespace URIs with versions in them is so wrong wrong wrong as a versioning means for XSD schemas.

The built-in XML Schema version attribute is orthogonal to and separate from any namespace URI. We plan to take advantage of it to assign semantic versions (strings like "3.6.0") to our filters.

Modify C generator to compare floating point and integer numbers with enumerations and ranges specified in DFDL schemas. Test validation code with additional simple type root elements in simple.dfdl.xsd schema and additional TDML tests in simple_errors.tdml. Also allow C generator to compare hexBinary elements to values in enumerations since Daffodil does the same thing and C generator already compares hexBinary elements to values in fixed attributes. Allow C generator to ignore dfdl:assert expressions in DFDL schemas and generate code anyway instead of throwing an exception. Allow C generator to get schema version from DFDL schemas and put it into the generated C code to version the generated code as well. Found out Daffodil's default alignment properties (alignment="1" and alignmentUnits="bytes") makes Daffodil append extra fill bits to elements with odd sizes not a multiple of 8 bits. Because compatibility between Daffodil and DaffodilC depends on defining both dfdl:alignment="1" (default) and dfdl:alignmentUnits="bits" (non-default), define a known good binary data format in one place (network/format.dfdl.xsd) and include it in rest of daffodil-codegen-c's test schemas. Finally, make changes requested by PR review. DAFFODIL-2853 BUILD.md: Document how to install iwyu and libcriterion-dev (used only for Daffodil C code generator development and maintenance). README.md: Document how to check formatting of or reformat Daffodil. Also document how to put the Daffodil jars in the maven and ivy caches. c/files/.clang-format: Update to clang-format 14. Format declarations more concisely by removing AlignConsecutiveDeclarations (this is why some reformatted C files lose some extra whitespace). c/files/Makefile: Shorten iwyu's maximum line length from 999 to 111. c/files/**/**.[ch]: Add "auto-maintained by iwyu" comment to headers. c/files/libcli/cli_errors.[ch]: Rename CLI_ZZZ, ERR_ZZZ, and FIELD_ZZZ to CLI__NUM_CODES, ERR__NUM_CODES, and FIELD__NO_ARGS. c/files/libcli/daffodil_getopt.[ch]: Merge daffodil_parse_cli and daffodil_unparse_cli structs and objects into daffodil_pu_cli and daffodil_pu. Skip and ignore -r and -s options so one can run "c/daffodil parse" with the same options as "daffodil parse -r root -s schema -o outfile infile" without having to remove these options. c/files/libcli/daffodil_main.c: Merge daffodil_parse_cli and daffodil_unparse_cli structs and objects into daffodil_pu_cli and daffodil_pu. Initialize ParserOrUnparserState fields as well as PState/UState fields. Make sure any diagnostics will fail unparse as well as parse if validate mode is on. c/files/libruntime/errors.[ch]: Add ERR_ENUM_MATCH and ERR_OUTSIDE_RANGE error messages to diagnose elements not matching any of their enumerations or having values outside their allowed ranges. Rename ERR_ZZZ and FIELD_ZZZ to ERR__NUM_CODES and FIELD__NO_ARGS. Rename ERR_ENUM_MATCH, ERR_FIXED_VALUE, and ERR_OUTSIDE_RANGE to ERR_RESTR_ENUM, ERR_RESTR_FIXED, and ERR_RESTR_RANGE. Remove `daffodil_program_version` since we use `daffodil_version` in daffodil_version.h instead. c/files/libruntime/infoset.h: Move common PState/UState fields to ParseOrUnparseState to allow parse and unparse functions to call common validate functions. c/files/libruntime/parsers.[ch]: Use ParserOrUnparserState fields as well as PState/UState fields. Replace parse_check_bounds and parse_validate_fixed functions with validate_array_bounds and validate_fixed_attribute functions in validators.[ch]. Rename parse_align and parse_fill_bits to parse_align_to and parse_alignment_bits. c/files/libruntime/unparsers.[ch]: Use ParserOrUnparserState fields as well as UState fields. Replace unparse_check_bounds and unparse_validate_fixed functions with validate_array_bounds and validate_fixed_attribute functions in validators.[ch]. Rename unparse_align and unparse_fill_bits to unparse_align_to and unparse_alignment_bits. c/files/libruntime/validators.[ch]: Move common validation functions here from parsers.[ch] and unparsers.[ch]. Add new float, hexBinary, int, and universal validation functions to check that elements match their allowed enumerations (requires passing array of floating point numbers, hexBinary structs, or integer numbers) and fit within their allowed ranges. Validation functions create diagnostics instead of errors; the CLI or a caller is responsible for printing the diagnostics and exiting with the appropriate status code. c/files/tests/bits.c: Use ParserOrUnparserState fields as well as PState/UState fields. c/files/tests/extras.c: Regenerate iwyu comments. c/DaffodilCCodeGenerator.scala: Accept but ignore dfdl:assert statements (for now). c/DaffodilCExamplesGenerator.scala: Generate code from simple's all-in-one root element in order to show generated code for all simple types, enums, and ranges. c/generators/AlignmentFillCodeGenerator.scala: Rename parse_align and unparse_align to parse_align_to and unparse_align_to. Use ParserOrUnparserState fields as well as PState/UState fields. c/generators/BinaryBooleanCodeGenerator.scala: Use ParserOrUnparserState fields as well as PState/UState fields. c/generators/BinaryValueCodeGenerator.scala: Generate C code to validate enumerations and ranges of primitive elements. Get raw enumeration values into a Seq[String]. Avoid unsigned >= 0 comparisons to prevent gcc warnings. Use ParserOrUnparserState fields as well as PState/UState fields. Call correct function to validate enums depending on element's primType. Generate extra C initialization code for hexBinary enumerations to define arrays of hexBinary structs and pass them to validate_hexbinary_enumeration. c/generators/CodeGeneratorState.scala: Remove unnecessary immutable (built-in Set is immutable). Use ParserOrUnparserState fields as well as PState/UState fields. Rename parse_fill_bits and unparse_fill_bits to parse_alignment_bits and unparse_alignment_bits. Get actual schema version from Daffodil and assign its value to `schema_version` in generated_code.c. Declare `schema_version` in generated_code.h. Call validate_array_bounds instead of parse/unparse_check_bounds. Make cStructFieldAccess work correctly for DFDL expressions like "/rr:ReqstReply/..." used by NFS schemas. c/generators/HexBinaryCodeGenerator.scala: Use ParserOrUnparserState fields as well as PState/UState fields. Call validate_fixed_attribute instead of parse/unparse_validate_fixed. examples/**/generated_code.[ch]: Regenerate to show changes in C generator such as adding new "auto-maintained by iwyu" comment, calls to renamed functions such as parse_align_to and unparse_align_to, defining schema_version, using pu fields, and calling validation functions. c/data/simple*.dat: Remove (contents now inside simple.tdml). c/ex_nums.dfdl.xsd: Define schema version to be "1.0.2". Include network format instead of defining own binary format. Adjust elements' own format properties as needed to match original data file (explicitness is better than using defaults). c/infosets/simple*.xml: Remove (contents now inside simple.tdml). c/nested.dfdl.xsd: Include network format instead of defining own binary format. c/network/format.dfdl.xsd: Define network format in only one place, making sure to use vitally needed alignment properties but keep the format as minimal as possible to make it easier to include in a wide variety of schemas. Change "Network order binary format" comment to "Network order big endian format" as requested. Define bitOrder and byteOrder explicitly in format schema instead of inheriting them from DFDLGeneralFormat schema. c/padtest.dfdl.xsd: Include network format instead of defining own binary format. c/simple.dfdl.xsd: Define schema version to be "1.0.0". Include network format instead of defining own binary format. Add new elements to test enumerations and ranges of primitive types. Use custom simple types in order to define simple type root elements, enumerations of simple type root elements, ranges of simple type elements, and all-in-one root element as compactly as possible. c/simple.tdml: Make comments clearer how to run tests and include all data and infosets in test cases instead of using files. c/simple_errors.tdml: Make comments clearer how to run tests and add new test cases to validate enumerations and ranges of primitive types. c/variablelen.dfdl.xsd: Include network format instead of defining own binary format. core/dsom/SchemaDocument.scala: Modify SchemaDocument to capture schema versions from XML schema definitions and provide to codegen-c. tdml/TestDaffodilC.scala: Add future-proofing test to check test schema compiles without any warnings (would catch "relative location deprecated" warning if DFDLGeneralFormat url was still relative). Fix inconsistent use of "dp"/"tdp" and "isError"/"isProcessingError". c/TestSimpleErrors.scala: Call new test cases in simple_errors.tdml.

tuxji requested review from mbeckerle and stevedlawrence October 19, 2023 22:13

tuxji self-assigned this Oct 19, 2023

stevedlawrence approved these changes Oct 20, 2023

View reviewed changes

mbeckerle approved these changes Oct 20, 2023

View reviewed changes

tuxji commented Oct 20, 2023

View reviewed changes

tuxji commented Oct 21, 2023

View reviewed changes

tuxji force-pushed the ji/2853/validate-enums-ranges branch from 20b9d96 to 604d509 Compare October 22, 2023 18:53

tuxji merged commit be3038e into apache:main Oct 22, 2023

tuxji deleted the ji/2853/validate-enums-ranges branch October 22, 2023 18:54


		Reformat source and sbt files if necessary:

		sbt scalafmtAll scalafmtSbt


		<!-- Representation property bindings -->
		<!-- Network order binary format (net:format) -->

Add schema validation code to C generator #1098

Add schema validation code to C generator #1098

Conversation

tuxji commented Oct 19, 2023

stevedlawrence left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbeckerle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tuxji left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tuxji left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment