Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redo XSD Datetime, Date, Time, Duration parser and serializers #2929

Merged
merged 14 commits into from
Oct 16, 2024

Conversation

ashleysommer
Copy link
Contributor

This PR introduces a XSD-specific datetime parser/serializer module that acknowledges there are subtle but important differences between iso8601 and XSD date strings.

RDFLib had been relying on some undocumented (and faulty) parsing behaviour in the old version of isodate that coincidentally allowed parsing of strings that are XSD-compliant but not iso8601 compliant.

The new version of isodate fixes this issue, that means many tests were now failing in RDFLib due to XSD strings no longer parsing (eg, dates with a timezone "2024-01-31+10:00" that are XSD compliant but not iso compliant, the old version of isodate dropped the tz off the end by accident, so it worked fine, the new version does not allow it).

The preferred solution would be to switch to the new datetime.fromisoformat() function in Python stdlib in Python 3.11+, but there are three issues to solve with that plan:

  1. We still need a solution for Python 3.8, 3.9, 3.10 users
  2. The stdlib parser throws similar errors to the new isodate version, it also doesn't support non-standard iso formats.
  3. There is no stdlib replacement for the isodate's Duration class and corresponding duration_isoformat and parse_duration functions.
    • stdlib timedelta doesn't support deltas with years and months components, needed in XSD durations.
    • stdlib timedelta doesn't have to_isoformat() or fromisoformat() formats, needed to parse XSD duration strings (based on iso8601 duration strings).

So this PR addresses all of the above.

  1. A new xsd_datetime.py module is added to RDFLib to house all of this XSD-datetime/duration-specific parsing/serializing logic.
  2. The Duration class and duration_isoformat() and parse_duration() functions are absorbed from isodate into RDFLib.
  3. For Python 3.11+, the dependency on external isoformat library is removed
    • All parsing of XSD_Date, XSD_Datetime, XSD_Time are done by the stdlib fromisoformat() function
    • XSD-specific quirks (eg, Timezones in dates) are handled in the xsd_datetime.py module before calling into the stdlib function.
  4. For Python < 3.11, we now depend on the new v0.7.2 release of isodate
    • All parsing of XSD_Date, XSD_Datetime, XSD_Time are done by the updated isodate utility
    • XSD-specific quirks (eg, Timezones in dates) are handled in the xsd_datetime.py module before calling into isodate.
    • This update removes our last subdependency on six in our dependency tree!
  5. Shortcut serializers are added to xsd_datetime.py based on strftime that can output XSD-specific XSD_Duration and XSD_Datetime strings.

…n, XSD_Date, XSD_DatetTime, XSD_Time, XSD_gYear, XSD_gYearMonth. Based on isoformat for Python <3.11, and builtin fromisoformat for Python 3.11+
@ashleysommer ashleysommer force-pushed the xsd_datetime_parse branch 2 times, most recently from 9aa77bd to e9d5998 Compare October 15, 2024 13:28
@ashleysommer ashleysommer force-pushed the xsd_datetime_parse branch 2 times, most recently from 118642e to d605532 Compare October 15, 2024 13:56
@coveralls
Copy link

coveralls commented Oct 15, 2024

Coverage Status

coverage: 90.299% (-0.3%) from 90.64%
when pulling 56bb776 on xsd_datetime_parse
into cc25f16 on main.

Copy link
Contributor

@edmondchuc edmondchuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

And thanks for providing the background on isodate and the nuances between ISO 8601 and XSD dates, made it very easy to review.

@ashleysommer
Copy link
Contributor Author

Finally after battling with the type checker, the linter, the ruff formatter, the doctring tests, the autodocs generator errors, then the linter again, then the type checker again, then the autodocs generator again, and finally the ruff formatter again, this is now passing all tests.

@@ -167,7 +167,7 @@ def test_function(expression: str, expected_result: Identifier) -> None:
if isinstance(expected_result, type):
assert isinstance(actual_result, expected_result)
else:
assert expected_result == actual_result
assert actual_result == expected_result
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was made to help the test suite output.
PyTest always assumes when asserting equality that LHS is actual, and RHS is expected, but this test harness compared them the other way around, so PyTest output was backwards that made debugging more difficult.

("P1Y"^^xsd:yearMonthDuration"2019-05-28T12:14:45Z"^^xsd:dateTime)
("P1Y"^^xsd:yearMonthDuration"2019-05-28"^^xsd:date)
("P1Y"^^xsd:yearMonthDuration "2019-05-28T12:14:45Z"^^xsd:dateTime)
("P1Y"^^xsd:yearMonthDuration "2019-05-28"^^xsd:date)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an error I found in the datetime test function, there was a missing space between the literal values, causing incorrect tests to be run on durations.

n = "Z"
elif n.startswith("UTC"):
# Replace tzname like "UTC-05:00" with simply "-05:00" to match Jena tz fn
n = n[3:]
Copy link
Contributor Author

@ashleysommer ashleysommer Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a required change because previous implementation used tzinfo instances (timezone instances) from isodate, but the new implementation always uses timezones from the stdlib. The SPARQL builtin TZ() will return the "timezone name" if known, in python this uses the tzname() function.
The difference is that stdlib timezones and isodate timezones have different tzname() generation pattern. In isodate an unnamed timezone with "-5H" offset will be have tzname of "-05:00", but in stdlib an unnamed timezone with "-5H" offset will have tzname of "UTC-05:00" (more correct IMHO).

So for consistency and backwards compatibility, this change was added to normalize the tzname output, I tested Jena's TZ() SPARQL function too, and it also outputs "-05:00" like isodate did.

Copy link
Member

@nicholascar nicholascar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just make sure to add yourself, @ashleysommer, to the contributor notes at the top of xsd_datetime.py

@ashleysommer ashleysommer merged commit 9c469b5 into main Oct 16, 2024
22 checks passed
@ashleysommer ashleysommer deleted the xsd_datetime_parse branch October 16, 2024 00:20
amercader added a commit to ckan/ckanext-dcat that referenced this pull request Nov 21, 2024
Rdflib 7.1 introduced changes in [date parsing][1] that made some base
profile tests fail. Basically the previous rdflib versions incomplete
dates like

    <time:inXSDDateTimeStamp
         rdf:datatype="http://www.w3.org/2001/XMLSchema#date">
            1904
    </time:inXSDDateTimeStamp>

were expanded to `1904-01-01`. Of course this is not a valid date and
should be expressed using `gYear`:

    <time:inXSDDateTimeStamp
         rdf:datatype="http://www.w3.org/2001/XMLSchema#gYear">
            1904
    </time:inXSDDateTimeStamp>

and we should be expecting `1904`.
This should play nice with the time properties we are generating in CKAN as
they already handle automatically `gYear`, `gYearMonth`, `date` and
`dateTime`.
Sites importing external DCAT representations that use the wrong
encoding might need to check their parsers.

[1] RDFLib/rdflib#2929
Markus92 pushed a commit to Health-RI/ckanext-healthdcat that referenced this pull request Nov 26, 2024
Rdflib 7.1 introduced changes in [date parsing][1] that made some base
profile tests fail. Basically the previous rdflib versions incomplete
dates like

    <time:inXSDDateTimeStamp
         rdf:datatype="http://www.w3.org/2001/XMLSchema#date">
            1904
    </time:inXSDDateTimeStamp>

were expanded to `1904-01-01`. Of course this is not a valid date and
should be expressed using `gYear`:

    <time:inXSDDateTimeStamp
         rdf:datatype="http://www.w3.org/2001/XMLSchema#gYear">
            1904
    </time:inXSDDateTimeStamp>

and we should be expecting `1904`.
This should play nice with the time properties we are generating in CKAN as
they already handle automatically `gYear`, `gYearMonth`, `date` and
`dateTime`.
Sites importing external DCAT representations that use the wrong
encoding might need to check their parsers.

[1] RDFLib/rdflib#2929
Markus92 pushed a commit to Health-RI/ckanext-healthdcat that referenced this pull request Dec 4, 2024
Rdflib 7.1 introduced changes in [date parsing][1] that made some base
profile tests fail. Basically the previous rdflib versions incomplete
dates like

    <time:inXSDDateTimeStamp
         rdf:datatype="http://www.w3.org/2001/XMLSchema#date">
            1904
    </time:inXSDDateTimeStamp>

were expanded to `1904-01-01`. Of course this is not a valid date and
should be expressed using `gYear`:

    <time:inXSDDateTimeStamp
         rdf:datatype="http://www.w3.org/2001/XMLSchema#gYear">
            1904
    </time:inXSDDateTimeStamp>

and we should be expecting `1904`.
This should play nice with the time properties we are generating in CKAN as
they already handle automatically `gYear`, `gYearMonth`, `date` and
`dateTime`.
Sites importing external DCAT representations that use the wrong
encoding might need to check their parsers.

[1] RDFLib/rdflib#2929
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants