Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schema validation code to C generator #1098

Merged
merged 1 commit into from
Oct 22, 2023

Conversation

tuxji
Copy link
Contributor

@tuxji tuxji commented Oct 19, 2023

Modify C generator to compare floating point and integer numbers with enumerations and ranges specified in DFDL schemas. Test validation code with additional simple type root elements in simple.dfdl.xsd schema and additional TDML tests in simple_errors.tdml.

Also allow C generator to compare hexBinary elements to values in enumerations since Daffodil does the same thing and C generator already compares hexBinary elements to values in fixed attributes.

Allow C generator to ignore dfdl:assert expressions in DFDL schemas and generate code anyway instead of throwing an exception.

Allow C generator to get schema version from DFDL schemas and put it into the generated C code to version the generated code as well.

Found out Daffodil's default alignment properties (alignment="1" and alignmentUnits="bytes") makes Daffodil append extra fill bits to elements with odd sizes not a multiple of 8 bits. Because compatibility between Daffodil and DaffodilC depends on defining both dfdl:alignment="1" (default) and dfdl:alignmentUnits="bits" (non-default), define a known good binary data format in one place (network/format.dfdl.xsd) and include it in rest of daffodil-codegen-c's test schemas.

DAFFODIL-2853

BUILD.md: Document how to install iwyu and libcriterion-dev (used only for Daffodil C code generator development and maintenance).

README.md: Document how to check formatting of or reformat Daffodil.

c/files/.clang-format: Update to clang-format 14. Format declarations more concisely by removing AlignConsecutiveDeclarations (this is why some reformatted C files lose some extra whitespace).

c/files/libcli/daffodil_getopt.c: Skip and ignore -r and -s options so one can run "c/daffodil parse" with the same options as "daffodil parse -r root -s schema -o outfile infile" without having to remove these options.

c/files/libcli/daffodil_main.c: Initialize ParserOrUnparserState fields as well as PState/UState fields. Make sure any diagnostics will fail unparse as well as parse if validate mode is on.

c/files/libruntime/errors.[ch]: Add ERR_ENUM_MATCH and ERR_OUTSIDE_RANGE error messages to diagnose elements not matching any of their enumerations or having values outside their allowed ranges. Remove daffodil_program_version since we use daffodil_version in daffodil_version.h instead.

c/files/libruntime/infoset.h: Move common PState/UState fields to ParseOrUnparseState to allow parse and unparse functions to call common validate functions.

c/files/libruntime/parsers.[ch]: Use ParserOrUnparserState fields as well as PState/UState fields. Replace parse_check_bounds and parse_validate_fixed functions with validate_array_bounds and validate_fixed_attribute functions in validators.[ch].

c/files/libruntime/unparsers.[ch]: Use ParserOrUnparserState fields as well as UState fields. Replace unparse_check_bounds and unparse_validate_fixed functions with validate_array_bounds and validate_fixed_attribute functions in validators.[ch].

c/files/libruntime/validators.[ch]: Move common validation functions here from parsers.[ch] and unparsers.[ch]. Add new float, hexBinary, int, and universal validation functions to check that elements match their allowed enumerations (requires passing array of floating point numbers, hexBinary structs, or integer numbers) and fit within their allowed ranges. Validation functions create diagnostics instead of errors; the CLI or a caller is responsible for printing the diagnostics and exiting with the appropriate status code.

c/files/tests/bits.c: Use ParserOrUnparserState fields as well as PState/UState fields.

c/files/tests/extras.c: Regenerate iwyu comments.

c/DaffodilCCodeGenerator.scala: Accept but ignore dfdl:assert statements (for now).

c/DaffodilCExamplesGenerator.scala: Generate code from simple's all-in-one root element in order to show generated code for all simple types, enums, and ranges.

c/generators/AlignmentFillCodeGenerator.scala: Use ParserOrUnparserState fields as well as PState/UState fields.

c/generators/BinaryBooleanCodeGenerator.scala: Use ParserOrUnparserState fields as well as PState/UState fields.

c/generators/BinaryValueCodeGenerator.scala: Generate C code to validate enumerations and ranges of primitive elements. Remove escape characters in enumeration values added by Facets.escapeForRegex (turn "1.0" back into "1.0"). Avoid unsigned >= 0 comparisons to prevent gcc warnings. Use ParserOrUnparserState fields as well as PState/UState fields. Call correct function to validate enums depending on element's primType. Generate extra C initialization code for hexBinary enumerations to define arrays of hexBinary structs and pass them to validate_hexbinary_enumeration.

c/generators/CodeGeneratorState.scala: Remove unnecessary immutable (built-in Set is immutable). Use ParserOrUnparserState fields as well as PState/UState fields. Get actual schema version from Daffodil and assign its value to schema_version in generated_code.c. Declare schema_version in generated_code.h. Call validate_array_bounds instead of parse/unparse_check_bounds. Make cStructFieldAccess work correctly for DFDL expressions like "/rr:ReqstReply/..." used by NFS schemas.

c/generators/HexBinaryCodeGenerator.scala: Use ParserOrUnparserState fields as well as PState/UState fields. Call validate_fixed_attribute instead of parse/unparse_validate_fixed.

examples/**/generated_code.[ch]: Regenerate to show changes in C generator such as defining schema_version, using pu fields, and calling validation functions.

c/data/simple*.dat: Remove (contents now inside simple.tdml).

c/ex_nums.dfdl.xsd: Define schema version to be "1.0.2". Include network format instead of defining own binary format. Adjust elements' own format properties as needed to match original data file (explicitness is better than using defaults).

c/infosets/simple*.xml: Remove (contents now inside simple.tdml).

c/nested.dfdl.xsd: Include network format instead of defining own binary format.

c/network/format.dfdl.xsd: Define network format in only one place, making sure to use vitally needed alignment properties but keep the format as minimal as possible to make it easier to include in a wide variety of schemas.

c/padtest.dfdl.xsd: Include network format instead of defining own binary format.

c/simple.dfdl.xsd: Define schema version to be "1.0.0". Include network format instead of defining own binary format. Add new elements to test enumerations and ranges of primitive types. Use custom simple types in order to define simple type root elements, enumerations of simple type root elements, ranges of simple type elements, and all-in-one root element as compactly as possible.

c/simple.tdml: Make comments clearer how to run tests and include all data and infosets in test cases instead of using files.

c/simple_errors.tdml: Make comments clearer how to run tests and add new test cases to validate enumerations and ranges of primitive types.

c/variablelen.dfdl.xsd: Include network format instead of defining own binary format.

core/dsom/SchemaDocument.scala: Modify SchemaDocument to capture schema versions from XML schema definitions and provide to codegen-c.

tdml/TestDaffodilC.scala: Add future-proofing test to check test schema compiles without any warnings (would catch "relative location deprecated" warning if DFDLGeneralFormat url was still relative). Fix inconsistent use of "dp"/"tdp" and "isError"/"isProcessingError".

c/TestSimpleErrors.scala: Call new test cases in simple_errors.tdml.

@tuxji tuxji self-assigned this Oct 19, 2023
Copy link
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 👍 just a bunch of minor comments/suggestions


Reformat source and sbt files if necessary:

sbt scalafmtAll scalafmtSbt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding commands for what you use to format h C code?

Maybe for a future PR, it might be worth adding an sbt task that executes those commands with the right options (e.g. sbt cfmtAll) so we don't have to remember them. Could also be useful to add a step in our github actions lint job to run that and error if git diff reports and changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can write a sbt task which runs clang-format on the C files and errors if it finds and reports any diffs. However, some people may get an error running the task (especially if the task's purpose is to run all the language formatters from a single command) since clang-format is not commonly installed on many (most?) machines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. As long as the error is something useful that should be fine. I recommend doing it later though, doesn't have to be part of this PR.


sbt scalafmtAll scalafmtSbt

### Build
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need another build section for building and locally publishing the jars? Anyone developing something integrating Daffodil may want that. Maybe something like this?

Build the Daffodil jars and publish to the a Maven or Ivy repository

Maven:

    sbt publishM2

Ivy:

    sbt publishLocal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll document the publish commands too and explain Ivy is also used by sbt to build schema-only projects.

break;
case 's':
// Ignore "-s schema" option/optarg
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this, why of options to just ignore them? Seems that could be very confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A workflow I sometimes use with code generators, DFDL schemas, and TDML tests is to run daffodil generate on the DFDL schemas, run a pair of daffodil parse and c/daffodil parse commands on the same data files, and check that they both produce the same exit status and similar infosets. This manual workflow is finer-grained than and sometimes easier to debug than running daffodil test on TDML tests using the same data and infosets. I thought it could be nice if I could run the same c/daffodil parse command as the daffodil parse command by appending only c/ to daffodil without having to change any of the options.

I thought I could hide this behavior from users (the usage message doesn't mention the -r and -s options) while making life easier for developers and testers who know about this hidden behavior for greater compatibility between daffodil and c/daffodil. I can revert this source file if you think users would be surprised and confused that c/daffodil allows -r and -s options.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense. As long as the optinos aren't in the usage I think it's fine. But it would be worth adding that reasoning as a comment.

@@ -42,7 +42,7 @@ extern struct daffodil_parse_cli
const char *infoset_converter;
const char *infile;
const char *outfile;
bool validate;
bool validate;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change 👍 . I've never been a huge fan of aligning variable names in C.

#include "errors.h" // for continue_or_exit, print_diagnostics, Error, Diagnostics, Error::(anonymous)
#include "infoset.h" // for get_infoset, walk_infoset, PState, UState, parse_data, unparse_infoset, InfosetBase, VisitEventHandler
#include "errors.h" // for continue_or_exit, Diagnostics, print_diagnostics, Error
#include "infoset.h" // for ParserOrUnparserState, PState, UState, get_infoset, walk_infoset, parse_data, unparse_infoset, InfosetBase, VisitEventHandler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, do you find it useful to include the functions each header includes? Seems like it's just noise to me. I guess this is auto-generated so not a big deal, but personally I think it would be nice if this could be disabled. Especially since some of these lines get pretty long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just add a comment above the auto-maintained part indicating that these comments and the include list are auto-maintained by iwyu. I know the whole concept of this was news to me when I first read this code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My viewpoint (coming back from developing Java to developing C) is that it would be nice if C header lines tell you which symbols they import just like Java import lines tell you which symbols they import. However, I can tell iwyu not to create these comments if you all don't like them. Then the remaining use case for iwyu is to sort and canonicalize C header lines just like Java/Scala tools sort and canonicalize Java/Scala import lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I've added a // auto-maintained by iwyu comment to each iwyu-maintained part.

if (e.typeDef.optRestriction.exists(_.hasEnumeration)) {
// Split the enumeration values into an array of strings ready to be inserted into the C code
val enumerationValues = e.typeDef.optRestriction.flatMap(_.enumerationValues).get
val enums = enumerationValues.split('|').map(_.replaceAll("\\\\(\\W)", "$1"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe Daffodil generates a regex to check enumerations, so you're having undo that regex? Instead, I think you can access the raw enum values like this:

val enums = e.typeDef.optRestriction.get.enumerations.map(_.enumValueRaw)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try that expression, thanks.

s"{$arraysName[$index], ${s.length / 2}, false}"
}
val hexEnumsInit = enums.map(_.grouped(2).map("0x" + _).mkString("{", ", ", "}"))
val enumsLenMax = enums.map(_.length / 2).max
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest hexEnumsLenMax since this is specific to hex? Or move all the hex specific stuff to the HexBinary cases so we don't have to calculate this stuff where it doesn't apply.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First change is easier, so will accept that change.

val (enumsInit, extraInit, primType, valType, fieldArg) = e.optPrimType.get match {
case PrimType.Double | PrimType.Float => (enums, "", "double", "floatpt", field)
case PrimType.HexBinary => (hexEnums, arraysInit, "HexBinary", "hexbinary", s"&$field")
case _ => (enums, "", "int64_t", "integer", field)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there value in listing the integer types? I imagine this failing on things like Date/Time/Integer primitives. THough I guess those primitives would fail somewhere much earlier since they aren't supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I ever checked dates, times, or arbitrary length integers longer than 64 bits. I am sure they would fail somewhere else too since the code generator doesn't support them at all (it won't be able to find the right primitive C type, for example).

"{",
", ",
"}",
)};\n"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any concerns that there might be a bunch of hex binary enums with one being much larger than the other, and so this array would have a lot of empty space? I guess the other option would be to create a bunch of separate variables (e.g. hexEnum1, hexEnum2, hexEnum3, ... 0), maybe that's not worth it. Or alternatively, concat the hex arrays into a single long array, and then then hexEnums have to correctly index into that array? Maybe that's too much complexity for a case that isn't that likely? I imagine most hex enums are all the same length, or at least close in length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never seen enumeration facets on hexBinary, so I wouldn't sweat it. I had to check the DFDL spec to see if they were even allowed!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not seen any schemas with hex binary enums yet, so let's put off any changes until we see hex enums actually used. I only found out through experimentation that daffodil supports hex enums and thought daffodilC should try to support them too. I think your imagination that most hex enums are close in length is probably right anyway.

@@ -212,6 +212,8 @@ final class SchemaDocument private (xmlSDoc: XMLSchemaDocument)
final override lazy val optLexicalParent = Some(xmlSDoc)
final override lazy val optXMLSchemaDocument = Some(xmlSDoc)

final lazy val version = (xml \ "@version").text
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know XSD had a version attribute. Coudld be useful!

Though, I wonder if this should be an Option so we can tell if it's provided or not? In the code gen case, it can just to optVersion.getOrElse("") so it wouldn't change much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere on WWW is a diatribe on why using namespace URIs with versions in them is so wrong wrong wrong as a versioning means for XSD schemas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you change the line to return an Option[String]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere on WWW is a diatribe on why using namespace URIs with versions in them is so wrong wrong wrong as a versioning means for XSD schemas.

The built-in XML Schema version attribute is orthogonal to and separate from any namespace URI. We plan to take advantage of it to assign semantic versions (strings like "3.6.0") to our filters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you change the line to return an Option[String]?

(xml \ "@version") returns a NodeSeq, which has an implicit conversion to Seq[Node], so you can use any Seq functions on it. So something like this should work:

final lazy val optVersion = (xml \ "@version").headOption.map(_.text)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alas, I merged the PR before I checked my emails and saw your reply, It's too late to put the change into the PR now, but let's keep it in mind for another PR:

final lazy val optVersion = (xml \ "@version").headOption.map(_.text)

Copy link
Contributor

@mbeckerle mbeckerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

This is an amazing change set. Rather big, and long to review, but really great.

I have a bunch of minor suggestions and a few questions that would be good to understand better, but generally speaking really good.


sudo dnf install clang gcc git java-11-openjdk-devel llvm make mxml-devel pkgconf
sudo dnf install clang gcc git iwyu java-11-openjdk-devel llvm make mxml-devel pkgconf

If you want to use clang instead of gcc, you'll have to set your
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: why give people the choice?

We don't have to provide this flexibility unless we or users depend on it for some reason.

This works, so "it aint broke" so no need to change anything. I just wanted to comment that we don't really have to provide options like this unless they're really really required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're talking about the paragraph If you want to use clang instead of gcc, ..., I originally added that paragraph because you told me you didn't want to require GNU-licensed tools to develop or use any part of Daffodil. I even made sure that the GitHub Action workflow (main.yml) would set these environment variables to build Daffodil with clang and lvvm-ar instead of gcc and ar. I can reword the paragraph to say you can set these environment variables to use clang and llvm-ar instead of gcc and ar if you have a policy forbidding GNU-licensed tools.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, gcc is GPLv3 which is category X, so I had concerns about implying it was required. I did find this open Jira LEGAL issue that discusses this:

https://issues.apache.org/jira/browse/LEGAL-390?jql=project+%3D+LEGAL+AND+text+%7E+%22gcc%22+ORDER+BY+updated+DESC%2C+priority+DESC

Sounds like the resolution it's it's fine to use. I'd suggest we still do mention clang support so that people can use it if they want and don't have to figure it all out. But maybe that's an addendum or something, and everything else implies the use of gcc if the goal is to simplify this?

#include "errors.h" // for continue_or_exit, print_diagnostics, Error, Diagnostics, Error::(anonymous)
#include "infoset.h" // for get_infoset, walk_infoset, PState, UState, parse_data, unparse_infoset, InfosetBase, VisitEventHandler
#include "errors.h" // for continue_or_exit, Diagnostics, print_diagnostics, Error
#include "infoset.h" // for ParserOrUnparserState, PState, UState, get_infoset, walk_infoset, parse_data, unparse_infoset, InfosetBase, VisitEventHandler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just add a comment above the auto-maintained part indicating that these comments and the include list are auto-maintained by iwyu. I know the whole concept of this was news to me when I first read this code.

@@ -21,7 +21,7 @@
#include <stdbool.h> // for bool, false, true
#include <stdio.h> // for fread, fgetc, EOF
#include <stdlib.h> // for free, malloc
#include "errors.h" // for Error, eof_or_error, Error::(anonymous), ERR_LEFTOVER_DATA, add_diagnostic, get_diagnostics, ERR_ARRAY_BOUNDS, ERR_FIXED_VALUE, ERR_HEXBINARY_ALLOC, ERR_PARSE_BOOL, Diagnostics
#include "errors.h" // for Error, eof_or_error, ERR_LEFTOVER_DATA, Error::(anonymous), ERR_HEXBINARY_ALLOC, ERR_PARSE_BOOL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. iwyu specified what "errors.h" was needed for here, but above where SL commented on ERR_ZZZ there's no such comment on the include of "errors.h". What are the rules here about what iwyu maintains?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you're noticing is that iwyu never puts a comment on a C source file's own header file. That is, parsers.c "uses" a few symbols from errors.h, but errors.c "implements" nearly all of errors.h's symbols. Even iwyu put a comment anyway, the comment would be too long trying to list all the header symbols implemented by the source file.

const Error *error; // any error which stops parse
uint8_t unreadBits; // any buffered bits not read yet
uint8_t numUnreadBits; // number of buffered bits not read yet
ParserOrUnparserState pu; // common mutable state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yessssss. embedded by value, not by reference. 👍

void
parse_check_bounds(const char *name, size_t count, size_t minOccurs, size_t maxOccurs, PState *pstate)
parse_fill_bits(size_t end_bitPos0b, PState *pstate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name confused me. It has to be "fill" as in short for the alignmentFill region.

Do you mind changing the name to parse_alignment_bits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I've changed the name.


<!-- Representation property bindings -->
<!-- Network order binary format (net:format) -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know folks at Sun (back in the day) and IETF wanted to define "Network Byte Order", but they failed because Intel little-endian CPUs took over the world for a while and everybody serialized data little endian because it was easiest. Nobody sees the TCP/IP headers of packets. (or cares) They see their payload data, which is little-endian a huge fraction of the time.

So please say big or little endian here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, saying Network order big endian format (net:format).

limitations under the License.
-->

<!-- Network order binary format: <https://en.wikipedia.org/wiki/Endianness#Networking> -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really. Just tell people this is big endian.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, Network order big endian format.

<annotation>
<appinfo source="http://www.ogf.org/dfdl/">
<dfdl:defineFormat name="format">
<dfdl:format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You really should define byteOrder here as bigEndian and bitOrder as mostSignificantBitFirst, because it matters in order to understand this format.

DFDLGeneralFormat.dfdl.xsd is not supposed to change, but it's also supposed to be mostly so you don't have to remember obscure properties Daffodil requires that your format otherwise wouldn't bother to mention.

For properties that really define the essence of the format, you should have them explicitly regardless of whether they match what is in DFDLGeneralFormat or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, adding byteOrder as bigEndian and bitOrder as mostSignificantBitFirst to the format.

</simpleType>
<simpleType name="simple-hexBinaryPrefixed"
dfdl:prefixLengthType="si:prefixedCount"
dfdl:lengthKind="prefixed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: there was a DFDL clarification recently that a prefixedLengthType can have facets, and those are checked so as to be able to have say, a 32-bit integer as a lengthKind, but specify only sane values like < 1Million and even minimum lengths are to be checked.

This is super necessary if reading from a TCP socket. Last thing you want to do is a blocking read for 4Gigs of data.

Is generated C code going to implement this prefix value checking with your range validators?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about this requirement until now. It would be a good idea to implement prefix value checking, but not in this PR (going on vacation soon).

@@ -212,6 +212,8 @@ final class SchemaDocument private (xmlSDoc: XMLSchemaDocument)
final override lazy val optLexicalParent = Some(xmlSDoc)
final override lazy val optXMLSchemaDocument = Some(xmlSDoc)

final lazy val version = (xml \ "@version").text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere on WWW is a diatribe on why using namespace URIs with versions in them is so wrong wrong wrong as a versioning means for XSD schemas.

Copy link
Contributor Author

@tuxji tuxji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, see responses below.


Reformat source and sbt files if necessary:

sbt scalafmtAll scalafmtSbt
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can write a sbt task which runs clang-format on the C files and errors if it finds and reports any diffs. However, some people may get an error running the task (especially if the task's purpose is to run all the language formatters from a single command) since clang-format is not commonly installed on many (most?) machines.


sbt scalafmtAll scalafmtSbt

### Build
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll document the publish commands too and explain Ivy is also used by sbt to build schema-only projects.

break;
case 's':
// Ignore "-s schema" option/optarg
break;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A workflow I sometimes use with code generators, DFDL schemas, and TDML tests is to run daffodil generate on the DFDL schemas, run a pair of daffodil parse and c/daffodil parse commands on the same data files, and check that they both produce the same exit status and similar infosets. This manual workflow is finer-grained than and sometimes easier to debug than running daffodil test on TDML tests using the same data and infosets. I thought it could be nice if I could run the same c/daffodil parse command as the daffodil parse command by appending only c/ to daffodil without having to change any of the options.

I thought I could hide this behavior from users (the usage message doesn't mention the -r and -s options) while making life easier for developers and testers who know about this hidden behavior for greater compatibility between daffodil and c/daffodil. I can revert this source file if you think users would be surprised and confused that c/daffodil allows -r and -s options.

#include "errors.h" // for continue_or_exit, print_diagnostics, Error, Diagnostics, Error::(anonymous)
#include "infoset.h" // for get_infoset, walk_infoset, PState, UState, parse_data, unparse_infoset, InfosetBase, VisitEventHandler
#include "errors.h" // for continue_or_exit, Diagnostics, print_diagnostics, Error
#include "infoset.h" // for ParserOrUnparserState, PState, UState, get_infoset, walk_infoset, parse_data, unparse_infoset, InfosetBase, VisitEventHandler
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My viewpoint (coming back from developing Java to developing C) is that it would be nice if C header lines tell you which symbols they import just like Java import lines tell you which symbols they import. However, I can tell iwyu not to create these comments if you all don't like them. Then the remaining use case for iwyu is to sort and canonicalize C header lines just like Java/Scala tools sort and canonicalize Java/Scala import lines.

continue_or_exit(ustate.pu.error);

// Any diagnostics will fail the unparse if validate mode is on
if (daffodil_parse.validate && ustate.pu.diagnostics)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right on both points. It would be even easier to merge daffodil_parse_cli and daffodil_unparse_cli into daffodil_pu_cli than it would be to add validate to daffodil_unparse_cli.

I'll wait for Mike and you to respond to the rest of my PR comments before I do another pass on the PR, so please weigh in on how much you want the other suggested changes.

if (e.typeDef.optRestriction.exists(_.hasEnumeration)) {
// Split the enumeration values into an array of strings ready to be inserted into the C code
val enumerationValues = e.typeDef.optRestriction.flatMap(_.enumerationValues).get
val enums = enumerationValues.split('|').map(_.replaceAll("\\\\(\\W)", "$1"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try that expression, thanks.

s"{$arraysName[$index], ${s.length / 2}, false}"
}
val hexEnumsInit = enums.map(_.grouped(2).map("0x" + _).mkString("{", ", ", "}"))
val enumsLenMax = enums.map(_.length / 2).max
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First change is easier, so will accept that change.

"{",
", ",
"}",
)};\n"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not seen any schemas with hex binary enums yet, so let's put off any changes until we see hex enums actually used. I only found out through experimentation that daffodil supports hex enums and thought daffodilC should try to support them too. I think your imagination that most hex enums are close in length is probably right anyway.

val (enumsInit, extraInit, primType, valType, fieldArg) = e.optPrimType.get match {
case PrimType.Double | PrimType.Float => (enums, "", "double", "floatpt", field)
case PrimType.HexBinary => (hexEnums, arraysInit, "HexBinary", "hexbinary", s"&$field")
case _ => (enums, "", "int64_t", "integer", field)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I ever checked dates, times, or arbitrary length integers longer than 64 bits. I am sure they would fail somewhere else too since the code generator doesn't support them at all (it won't be able to find the right primitive C type, for example).

@@ -212,6 +212,8 @@ final class SchemaDocument private (xmlSDoc: XMLSchemaDocument)
final override lazy val optLexicalParent = Some(xmlSDoc)
final override lazy val optXMLSchemaDocument = Some(xmlSDoc)

final lazy val version = (xml \ "@version").text
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you change the line to return an Option[String]?

Copy link
Contributor Author

@tuxji tuxji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the comments, responses below.

<annotation>
<appinfo source="http://www.ogf.org/dfdl/">
<dfdl:defineFormat name="format">
<dfdl:format
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, adding byteOrder as bigEndian and bitOrder as mostSignificantBitFirst to the format.

limitations under the License.
-->

<!-- Network order binary format: <https://en.wikipedia.org/wiki/Endianness#Networking> -->
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, Network order big endian format.


<!-- Representation property bindings -->
<!-- Network order binary format (net:format) -->
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, saying Network order big endian format (net:format).

</simpleType>
<simpleType name="simple-hexBinaryPrefixed"
dfdl:prefixLengthType="si:prefixedCount"
dfdl:lengthKind="prefixed"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about this requirement until now. It would be a good idea to implement prefix value checking, but not in this PR (going on vacation soon).

@@ -212,6 +212,8 @@ final class SchemaDocument private (xmlSDoc: XMLSchemaDocument)
final override lazy val optLexicalParent = Some(xmlSDoc)
final override lazy val optXMLSchemaDocument = Some(xmlSDoc)

final lazy val version = (xml \ "@version").text
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere on WWW is a diatribe on why using namespace URIs with versions in them is so wrong wrong wrong as a versioning means for XSD schemas.

The built-in XML Schema version attribute is orthogonal to and separate from any namespace URI. We plan to take advantage of it to assign semantic versions (strings like "3.6.0") to our filters.

Modify C generator to compare floating point and integer numbers with
enumerations and ranges specified in DFDL schemas.  Test validation
code with additional simple type root elements in simple.dfdl.xsd
schema and additional TDML tests in simple_errors.tdml.

Also allow C generator to compare hexBinary elements to values in
enumerations since Daffodil does the same thing and C generator
already compares hexBinary elements to values in fixed attributes.

Allow C generator to ignore dfdl:assert expressions in DFDL schemas
and generate code anyway instead of throwing an exception.

Allow C generator to get schema version from DFDL schemas and put it
into the generated C code to version the generated code as well.

Found out Daffodil's default alignment properties (alignment="1" and
alignmentUnits="bytes") makes Daffodil append extra fill bits to
elements with odd sizes not a multiple of 8 bits.  Because
compatibility between Daffodil and DaffodilC depends on defining both
dfdl:alignment="1" (default) and dfdl:alignmentUnits="bits"
(non-default), define a known good binary data format in one place
(network/format.dfdl.xsd) and include it in rest of
daffodil-codegen-c's test schemas.

Finally, make changes requested by PR review.

DAFFODIL-2853

BUILD.md: Document how to install iwyu and libcriterion-dev (used only
for Daffodil C code generator development and maintenance).

README.md: Document how to check formatting of or reformat Daffodil.
Also document how to put the Daffodil jars in the maven and ivy
caches.

c/files/.clang-format: Update to clang-format 14.  Format declarations
more concisely by removing AlignConsecutiveDeclarations (this is why
some reformatted C files lose some extra whitespace).

c/files/Makefile: Shorten iwyu's maximum line length from 999 to 111.

c/files/**/**.[ch]: Add "auto-maintained by iwyu" comment to headers.

c/files/libcli/cli_errors.[ch]: Rename CLI_ZZZ, ERR_ZZZ, and FIELD_ZZZ
to CLI__NUM_CODES, ERR__NUM_CODES, and FIELD__NO_ARGS.

c/files/libcli/daffodil_getopt.[ch]: Merge
daffodil_parse_cli and daffodil_unparse_cli structs and objects into
daffodil_pu_cli and daffodil_pu.  Skip and ignore -r and -s options
so one can run "c/daffodil parse" with the same options as "daffodil
parse -r root -s schema -o outfile infile" without having to remove
these options.

c/files/libcli/daffodil_main.c: Merge daffodil_parse_cli and
daffodil_unparse_cli structs and objects into daffodil_pu_cli and
daffodil_pu.  Initialize ParserOrUnparserState fields as well as
PState/UState fields.  Make sure any diagnostics will fail unparse as
well as parse if validate mode is on.

c/files/libruntime/errors.[ch]: Add ERR_ENUM_MATCH and
ERR_OUTSIDE_RANGE error messages to diagnose elements not matching any
of their enumerations or having values outside their allowed ranges.
Rename ERR_ZZZ and FIELD_ZZZ to ERR__NUM_CODES and FIELD__NO_ARGS.
Rename ERR_ENUM_MATCH, ERR_FIXED_VALUE, and ERR_OUTSIDE_RANGE to
ERR_RESTR_ENUM, ERR_RESTR_FIXED, and ERR_RESTR_RANGE. Remove
`daffodil_program_version` since we use `daffodil_version` in
daffodil_version.h instead.

c/files/libruntime/infoset.h: Move common PState/UState fields to
ParseOrUnparseState to allow parse and unparse functions to call
common validate functions.

c/files/libruntime/parsers.[ch]: Use ParserOrUnparserState fields as
well as PState/UState fields.  Replace parse_check_bounds and
parse_validate_fixed functions with validate_array_bounds and
validate_fixed_attribute functions in validators.[ch].  Rename
parse_align and parse_fill_bits to parse_align_to and
parse_alignment_bits.

c/files/libruntime/unparsers.[ch]: Use ParserOrUnparserState fields as
well as UState fields.  Replace unparse_check_bounds and
unparse_validate_fixed functions with validate_array_bounds and
validate_fixed_attribute functions in validators.[ch].  Rename
unparse_align and unparse_fill_bits to unparse_align_to and
unparse_alignment_bits.

c/files/libruntime/validators.[ch]: Move common validation functions
here from parsers.[ch] and unparsers.[ch].  Add new float, hexBinary,
int, and universal validation functions to check that elements match
their allowed enumerations (requires passing array of floating point
numbers, hexBinary structs, or integer numbers) and fit within their
allowed ranges.  Validation functions create diagnostics instead of
errors; the CLI or a caller is responsible for printing the
diagnostics and exiting with the appropriate status code.

c/files/tests/bits.c: Use ParserOrUnparserState fields as well as
PState/UState fields.

c/files/tests/extras.c: Regenerate iwyu comments.

c/DaffodilCCodeGenerator.scala:  Accept but ignore dfdl:assert statements
(for now).

c/DaffodilCExamplesGenerator.scala: Generate code from simple's
all-in-one root element in order to show generated code for all simple
types, enums, and ranges.

c/generators/AlignmentFillCodeGenerator.scala: Rename parse_align and
unparse_align to parse_align_to and unparse_align_to.  Use
ParserOrUnparserState fields as well as PState/UState fields.

c/generators/BinaryBooleanCodeGenerator.scala: Use
ParserOrUnparserState fields as well as PState/UState fields.

c/generators/BinaryValueCodeGenerator.scala: Generate C code to
validate enumerations and ranges of primitive elements.  Get raw
enumeration values into a Seq[String].  Avoid unsigned >= 0
comparisons to prevent gcc warnings.  Use ParserOrUnparserState fields
as well as PState/UState fields.  Call correct function to validate
enums depending on element's primType.  Generate extra C
initialization code for hexBinary enumerations to define arrays of
hexBinary structs and pass them to validate_hexbinary_enumeration.

c/generators/CodeGeneratorState.scala: Remove unnecessary immutable
(built-in Set is immutable).  Use ParserOrUnparserState fields as well
as PState/UState fields.  Rename parse_fill_bits and unparse_fill_bits
to parse_alignment_bits and unparse_alignment_bits.  Get actual schema
version from Daffodil and assign its value to `schema_version` in
generated_code.c.  Declare `schema_version` in generated_code.h.  Call
validate_array_bounds instead of parse/unparse_check_bounds.  Make
cStructFieldAccess work correctly for DFDL expressions like
"/rr:ReqstReply/..." used by NFS schemas.

c/generators/HexBinaryCodeGenerator.scala: Use ParserOrUnparserState
fields as well as PState/UState fields.  Call validate_fixed_attribute
instead of parse/unparse_validate_fixed.

examples/**/generated_code.[ch]: Regenerate to show changes in C
generator such as adding new "auto-maintained by iwyu" comment, calls
to renamed functions such as parse_align_to and unparse_align_to,
defining schema_version, using pu fields, and calling validation
functions.

c/data/simple*.dat: Remove (contents now inside simple.tdml).

c/ex_nums.dfdl.xsd: Define schema version to be "1.0.2".  Include
network format instead of defining own binary format.  Adjust
elements' own format properties as needed to match original data file
(explicitness is better than using defaults).

c/infosets/simple*.xml: Remove (contents now inside simple.tdml).

c/nested.dfdl.xsd: Include network format instead of defining own
binary format.

c/network/format.dfdl.xsd: Define network format in only one place,
making sure to use vitally needed alignment properties but keep the
format as minimal as possible to make it easier to include in a wide
variety of schemas.  Change "Network order binary format" comment to
"Network order big endian format" as requested.  Define bitOrder and
byteOrder explicitly in format schema instead of inheriting them from
DFDLGeneralFormat schema.

c/padtest.dfdl.xsd: Include network format instead of defining own
binary format.

c/simple.dfdl.xsd: Define schema version to be "1.0.0".  Include
network format instead of defining own binary format.  Add new
elements to test enumerations and ranges of primitive types.  Use
custom simple types in order to define simple type root elements,
enumerations of simple type root elements, ranges of simple type
elements, and all-in-one root element as compactly as possible.

c/simple.tdml: Make comments clearer how to run tests and include all
data and infosets in test cases instead of using files.

c/simple_errors.tdml: Make comments clearer how to run tests and add
new test cases to validate enumerations and ranges of primitive types.

c/variablelen.dfdl.xsd: Include network format instead of defining own
binary format.

core/dsom/SchemaDocument.scala: Modify SchemaDocument to capture
schema versions from XML schema definitions and provide to codegen-c.

tdml/TestDaffodilC.scala: Add future-proofing test to check test
schema compiles without any warnings (would catch "relative location
deprecated" warning if DFDLGeneralFormat url was still relative).  Fix
inconsistent use of "dp"/"tdp" and "isError"/"isProcessingError".

c/TestSimpleErrors.scala: Call new test cases in simple_errors.tdml.
@tuxji tuxji force-pushed the ji/2853/validate-enums-ranges branch from 20b9d96 to 604d509 Compare October 22, 2023 18:53
@tuxji tuxji merged commit be3038e into apache:main Oct 22, 2023
@tuxji tuxji deleted the ji/2853/validate-enums-ranges branch October 22, 2023 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants