-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New principle: Discourage polyglot formats #239
Comments
One might consider that a value of JSON-LD over plain JSON is to provide a means of validating that the described data corresponds to some schema. JSON-LD provides a means to parse the JSON to ensure that it adheres to some normative constraints, and through an RDF interpretation, that the semantic meaning is consistent with vocabularies in use. Providing a means to specify data without a corresponding way to validate that data, beyond pure JSON structure, does not help the cause of producers wanting to ensure that such data actually says what it means in a reproducible, and platform-neutral way. As it is, HTML, itself, is widely processed by different kinds of processors (e.g., web browsers and search engines) for different purposes, and the mere presence of the script tag in HTML specifically provides for the means of inserting non-HTML data, be it JavaScript or some other data format. |
"Web developers tend to only test their websites with one web browser and inadvertently introduce code/markup that doesn't work on other browsers..." ... how is this assertion (and solution to the assertion) different than the one being put forward? Aren't the only solutions to this either 1) a monoculture, or 2) testing?
Can you point to the data that shows that this is happening on a regular basis for JSON-LD? Typically what happens in this case is that the developer goes, "oh, my bad, I'll fix that.", and they update their If the W3C TAG would like to go down this path, and I suggest strongly that it does not -- it'll be a waste of everyone's time -- there will be a strong expectation that the W3C TAG bring a solid body of evidence wrt. how the polyglot nature of JSON-LD has resulted in what you're claiming above. To be clear -- companies deploying JSON-LD tend to deploy BOTH JSON and JSON-LD simultaneously and don't seem to have an issue with doing so. They also interop with other organizations doing so. When we find that someone has a bad |
I think there's also 3) interoperability. Admittedly it requires the interoperability to be quite good, but that's the goal of Web Platform Tests. |
Does this argument also apply to any language that supports typecasting, or OAS 3.0 which allows for JSON Schema as JSON or YAML?
To which the reply should obviously be: Why did you include Where are your unit tests? Do you also forward If the system is truly polyglot the lazy developer can just remove the Being anti-polyglot formats seems a bit like being anti "ability to do multiple things well at the same time"... I agree, most folks can't... should we lower the bar, or just make being excellent optional? |
I thought about this a bit more, and I think I see the point @hober is making. Let's consider 2 examples of polyglot data formats in action. Schema.org and Google Knowledge Graph Search APIhttps://developers.google.com/knowledge-graph/reference/rest/v1 When querying this JSON-LD API, the JSON content is returned at However, the content is valid JSON-LD... Perhaps it would be better if Google did not return this data as JSON-LD since the content type is JSON... or maybe it would be better to set the content type header correctly to JSON-LD.... Most developers unfamiliar with JSON-LD would be surprised by a different content type header, and would have no problem ignoring This specific case was discussed extensively when we were considering DID Document representations, and is the reason the implementation guide contains this section: https://w3c.github.io/did-imp-guide/#data-model-and-representations When writing this section and during the great debate wrt the DID Core abstract data model, I tried contacting the folks at Google a number of times, and never got a response.... In absence of a reply I advocated for handling things the same way google does, which the working group did not accept unanimously, even though many folks agreed with the approach, some preferred for make See https://w3c.github.io/did-core/#representations and the normative requirements... you will not that Mapping Smart Health Cards into the W3C VC Data ModelPer the spec https://w3c.github.io/vc-data-model/#contexts
However, smart health cards claims to implement the standard yet does not follow this requirement. The section on "mapping smart health cards to the VC Data Model" would not be necessary if VCI followed the normative requirements of the spec. https://spec.smarthealth.cards/credential-modeling/#mapping-into-the-w3c-vc-data-model This means that all Vaccination Credential JWTs are not actually standards compliant W3C Verifiable Credentials UNTIL they are mapped to that data format... This is a bit like claiming to support USB-C because you can buy a thunderbolt adapter that supports USB-C. In this case, the polyglot format is even words, since per the VCI spec:
This means that https://smarthealth.cards#fhirBundle So this is a custom covid vaccination format using JSON, JSON-LD, FHIR, and the VC DATA Model and a custom compression scheme for JWTs which is not described in the VC Data Model. In this case, it would probably be better to just not claim to conform to the W3C VC Data Model standard, then there would be no need to add all this mostly incorrect JSON-LD, and you could encode the FHIR JSON directly into the JWT claim field, since JWTs are understood to be over arbitrary serialized data that has no registered semantics beyond the reserved terms. This would also help clearly communicate that other VC-JWT implementations which are clearly not interoperable with smart health cards, are not supposed to be interoperable with them. Here is some example healthkit code that relies on this "polyglot" format... https://developer.apple.com/documentation/healthkit/samples/accessing_data_from_a_smart_health_card |
Please refer to these notes for the context of my previous comment: https://www.w3.org/2021/09/21-did10-minutes.html I incorrectly asserted the link had been removed initially. Here is the link to the TAG review of DID Core, which did contain references to this design principle issue: w3ctag/design-reviews#556 (comment) What this issue raised after the TAG review, but not shared with the WG? Hard to tell the timing from the comment thread. |
Wondering how to move this issue forward... Seems like the problem might be identifying and labeling when a format has become "polyglot"... For example, the first time a spec normatively requires that the same data model be parseable by 2 independent parsers, seems like a potential first time to raise the alarm bell. |
Does this issue apply to all uses of structured suffixes? Or only uses of multiple structured suffixes? Or not to structured suffixes at all? |
https://datatracker.ietf.org/doc/draft-ietf-mediaman-suffixes/ It would be good to get comments on this IETF draft on the IETF list... from people in w3c who thing this issue should remain open... or it would be good to close the issue. |
FWIW, I asked the lists for clarity on this, and referenced this issue: https://mailarchive.ietf.org/arch/msg/media-types/JxzT03Dhe7Nt8cPAfjbDx3WVQRM/ Hopefully IETF can clarify how this is supposed to work. |
I wrote down some additional thoughts the other day. |
@hober have you commented on https://datatracker.ietf.org/doc/draft-ietf-mediaman-suffixes/ ? It would be nice to see guidance on creating media types, being given to the group that manages the media types registries, afaik, W3C is not the keeper of media types, even if we have expertise on how they can interact uncomfortably with the web. |
Very interesting read. I agree with a lot of what you write... except for one important premise: I don't think that JSON-LD qualifies as a polyglot format. Here's why. You write (emphasis is yours):
Any JSON-LD processor needs to first parse the document as JSON, and operates on the result of that parsing. It is not a "JSON or JSON-LD" situation, but a "JSON then JSON-LD" situation. How does it differ from any other JSON format? All JSON formats aim to encode things beyond the objects, arrays, numbers and strings that result from JSON parsing. For example, GeoJSON is about points, lines and polygons, that are parsed from the result of pure-JSON parsing. And yes, I can use a generic JSON tool (such as jq), which knows nothing about points, lines, or polygons, to process my GeoJSON documents, and do some useful stuff with them. Does that make GeoJSON a polyglot? @dlongley and @iherman develop similar arguments here and here. Finally, you conclude your post by
I couldn't agree more with the last part of your sentence. And in fact, a JSON-LD context is exactly that, and nothing more: a mapping between some JSON format and an RDF data model. |
I'd say the key to defining a polyglot media type, is relying on multiple structured suffixes, here is why: type/subtype+suffix (single data type format) type/subtype+suffix1+suffix2 (polyglot) Why use multiple structured suffixes unless you want to signal multiple ways to process? If there are multiple ways to process, how do you know that 2 parties using 2 different processing schemes come to the same conclusion... Instead of solving this problem, maybe don't create it in the first place. |
@OR13, you write
Is it really a single datatype? Reading Section 2 of RFC 6839, that defines structured syntax suffixes (and in particular the
The whole point of structured syntax suffixes, single or multiple, is to "signal multiple ways to process". Following your logic, we should abandon syntax suffixes all together. |
I said something similar up this thread, a long time ago, and on the lists. I do think polyglot data formats and multiple suffixes are 2 sides of the same coin... And just because a design is dangerous doesn't mean it's bad all the time. I think there is a difference between claiming a structured suffix supports generic processing, and claiming it supports 3 or 5 or 26 equivalent representations of the same information. Seeing +json doesn't signal anything other than JSON. Seeing +ld+json signals RDF and JSON. Why stop at 2 though, is this design principle limited to only 2? How much will it cost implementers to use the data format correctly as the number increases. Design principles should apply generically (not just to specific technologies like xml), and they should address scale. I'd like to see the design principle updated to cover multiple suffixes generally... JSON-LD won't be the last time this comes up. |
Pardon my nitpick but re:
is one way of putting it. Others can correct me but I see it signalling JSON-LD (re "JSON then JSON-LD"), where JSON-LD is a concrete RDF syntax. It doesn't signal other concrete RDF syntaxes. RDF is a (constructed) language. |
|
Several of us on the TAG discussed this in the context of #453 and a few things popped out. Most pertinent to this was the idea that a single format can be confused about what model it produced. XML was defined in terms of the infoset but is often processed into a DOM. This dualism turns out to be a dangerous outcome as it means that the same document can produce subtly different interpretations in applications. |
I suggest that the W3C TAG involve communities affected by this discussion as you deliberate, namely, any community that has specified a suffix for their media type (+jwt, +xml, +json, etc.), which, AFAICT, (arguably) defines a polyglot format. This issue is now being misrepresented as a position of the TAG: ietf-wg-mediaman/suffixes#23 (comment) I realize that @hober published the "Polyglot formats" document as an individual, and not as a TAG Member or Apple representative, but it's being represented as "a TAG thing" in mailing list discussions... and this is triggering individuals in WGs, such as the VCWG, DIDWG, MEDIAMAN, and JSON-LD WG to protect against side-effects that the TAG will cause by moving forward with a recommendation to discourage polyglot formats. These mitigations include abandoning suffixes, not because it's the right technical decision, but because of the uncertainty that this issue is creating across (at least) the WGs listed above. |
This is a generalization of @dbaron's concern in #128.
Polyglot formats tend to lead to interoperability problems, so we should discourage defining them.
Polyglot formats are formats which are defined such that they can be processed by two or more different kinds of processors with roughly equivalent results. For instance, it's possible to write a computer program that is simultaneously valid C and valid C++. Another example is Polyglot Markup, an abandoned attempt to define a markup syntax that was simultaneously valid HTML and XHTML, and whose documents would produce roughly equivalent DOM trees when parsed with an HTML parser or an XML parser.
Authors tend to test their document with only one kind of processor, so they inadvertently introduce errors which would only be caught by the other kind of processor. In the case of Polyglot Markup, this happened when authors introduced XML errors into their document but only tested with an HTML parser. Consumers using an XML parser would, instead of seeing the document, see an XML parser error screen.
If the polyglot format contains fields that are only used by one kind of processor, such fields are likely to experience bit rot problems when authors only routinely test their documents with the other kind of processor. (For instance, if authors routinely use JSON parsers to test their JSON-LD, the
@context
section is likely to experience bit rot. Downstream consumers of the document who use a JSON-LD processor will start encountering bugs. They'll report it upstream, and that person will say "it works for me, it must be a bug in your software".)The text was updated successfully, but these errors were encountered: