Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: mime types for dc:description? #1650

Closed
acabal opened this issue Apr 27, 2021 · 16 comments
Closed

Proposal: mime types for dc:description? #1650

acabal opened this issue Apr 27, 2021 · 16 comments
Labels
EPUB33 Issues addressed in the EPUB 3.3 revision Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation Topic-PackageDoc The issue affects package documents Type-FeatureRequest The issue requests new functionality be added

Comments

@acabal
Copy link

acabal commented Apr 27, 2021

I searched through the issue archive and didn't find anything about this, so forgive me if this has already been discussed.

In an epub's metadata, it's often desirable to include rich formatting, like HTML, in elements that have a more prose-like format like the epub's description.

Since <dc:description> can only contain plain text, at Standard Ebooks we make <dc:description> a short plain text sentence, and then we add an additional element,<meta property="se:long-description" refines="#description">, which includes a much longer description. This longer description is an HTML fragment, which is escaped since <meta> cannot have children.

Here's an example:

<dc:description id="description">An amateur sleuth visting a country house solves the mystery of who shot the man in the locked room.</dc:description>
<meta id="long-description" property="se:long-description" refines="#description">
	&lt;p&gt;&lt;i&gt;The Red House Mystery&lt;/i&gt; is a detective novel by &lt;a href="https://standardebooks.org/ebooks/a-a-milne"&gt;&lt;abbr&gt;A. A.&lt;/abbr&gt; Milne&lt;/a&gt;, better known for his children’s writing...&lt;/p&gt;
</meta>

Since we're using a custom property within our own se namespace, it's reasonable to assume that reading systems that know to extract it will also know to expect escaped HTML. No problem.

But, would it be valuable to epub in general to be able to specify the mime type of the description, so that publishers could include HTML descriptions that reading systems would know to parse/render as HTML?

For example:

<dc:description>A plain text description.</dc:description>
<dc:description mime-type="text/html">&lt;p&gt;An HTML description.&lt;/p&gt;</dc:description>

Now the dc namespace doesn't include a mime-type attribute, but as far as I can tell (and I may be looking in the wrong place!) neither does the opf namespace include attributes like property or refines.

A possible implementation in the spec might default to assuming element contents to be plain text in the absence of a mime-type attribute, to fall back to plain text rendering if the reading system can't render the mime type, and to only allow some subset of mime types, like (text/plain, text/html, text/markdown, application/xhtml+xml). It might allow multiple <dc:description> elements as long as each has a different mime type.

This is not a fully-formed proposal, just something to spark discussion!

@OriIdan
Copy link

OriIdan commented Apr 27, 2021 via email

@iherman
Copy link
Member

iherman commented Apr 27, 2021

I am a bit bothered by the fact that the HTML text must be escaped, which makes it very user unfriendly... This is not a problem with markdown, but I am not sure markdown is widely used among epub authors.

What about using the <link> element, that can then refer to a separate file, and it already has a media type attribute. What is missing from the spec is a suitable link relationship; the specification may add a new link relationship for description to cover this. It may make the job of reading systems also easier, because they fall back on an existing mechanism...

@iherman iherman added Topic-PackageDoc The issue affects package documents Type-FeatureRequest The issue requests new functionality be added labels Apr 27, 2021
@OriIdan
Copy link

OriIdan commented Apr 27, 2021 via email

@acabal
Copy link
Author

acabal commented Apr 27, 2021

It could also be included as CDATA. I think there's something to be said for having all the ebook's metadata in one location, instead of in various linked files somewhere in the epub. And linking a file would invite publishers to create huge/complex HTML pages for descriptions which I think we probably want to avoid.

@iherman
Copy link
Member

iherman commented Apr 28, 2021

@OriIdan:

I don't think it is a problem for reading systems to unescape the description.

True. My worry was on the EPUB author's side, though, mainly when the description becomes longer. I realize that the idea is to have shorter descriptions also as HTML fragments, but these things may have the tendency of getting longer and longer over time if the facility is there...

@acabal:

It could also be included as CDATA.

True, and that is probably better.

I think there's something to be said for having all the ebook's metadata in one location, instead of in various linked files somewhere in the epub.

Formally, that is already the case. If you include more complex metadata (say, ONIX data) then the idea is to use <link>. (I acknowledge, though, that ONIX data are rarely included in the EPUB file itself.)

And linking a file would invite publishers to create huge/complex HTML pages for descriptions which I think we probably want to avoid.

Probably... but what constitutes huge is unclear and I am not sure where I would draw the line...


Anyway, to be clear, I am not against your proposal, just exploring alternatives. See what others have to say...

@mattgarrish
Copy link
Member

One of the big hurdles is going to be that reading systems don't generally want author formatting. Just look at the table of contents where we perpetually argue over whether reading systems should be forced to use the formatting that authors put within the nav doc links. If we can't get support there, I'm less optimistic we'll get support within package metadata.

@iherman has pointed out the preferred method for including metadata in an alternative form. That's exactly why we introduced linked records. We could make the linking more granular to specific fields, but if there's a precedent to be learned from the original exercise, it's again that reading systems aren't too interested in metadata expressed in alternative formats.

There's also a danger that making alternative forms look like standard issue metadata will have unwanted effects. If a reading system picks up a description, it may very well render all the escaped markup. It all depends on which dc:description the developer picks out, as this sort of a change in formatting wouldn't be anticipated by any current reading systems.

But I freely admit I've become a bit jaded when it comes to epub metadata! 😄

This might be another issue for the CG to explore. Even if they can't introduce new attributes, exploring the current possibilities and/or rallying some reading system support for a new attribute is going to be necessary before we look at adding this into the core.

@dauwhe
Copy link
Contributor

dauwhe commented Apr 28, 2021

Is there implementer interest in reading more metadata from EPUBs themselves? Very few reading systems will display more than author and title.

This also makes it difficult to author such metadata, when you can't see how it will display. Anecdotal evidence suggests that even text-only metadata embedded in EPUBs is often incorrect. I can only imagine how much worse it would be with escaped HTML.

Oh, and what Matt said.

@dauwhe dauwhe added the Agenda+ Issues that should be discussed during the next working group call. label Apr 28, 2021
@OriIdan
Copy link

OriIdan commented Apr 28, 2021 via email

@acabal
Copy link
Author

acabal commented Apr 28, 2021

One of the big hurdles is going to be that reading systems don't generally want author formatting. Just look at the table of contents where we perpetually argue over whether reading systems should be forced to use the formatting that authors put within the nav doc links. If we can't get support there, I'm less optimistic we'll get support within package metadata.

I wasn't around for that discussion, but the nice thing about having a properly structured ToC (i.e. <nav> with correctly nested <ol>s) is that the RS can decide what to do. It could strip CSS and process the nested structure and you still have a correct ToC to display in the RS's style, or it could keep CSS and just show the page as part of the reading order. I think either option is valid--given that the ToC was done right to begin with!

@iherman has pointed out the preferred method for including metadata in an alternative form. That's exactly why we introduced linked records. We could make the linking more granular to specific fields, but if there's a precedent to be learned from the original exercise, it's again that reading systems aren't too interested in metadata expressed in alternative formats.

My concern there is that if HTML descriptions are allowed as external files, then the long-term conclusion of that is huge JS-enabled beasts with big CSS stylesheets, complex unsemantic structure, and maintenance burden as HTML/CSS change over time--just like what the modern web has "evolved" in to.

I think this touches on the ToC issue above. In epub, ToCs are external files that have the full power of HTML/CSS, and as such there's argument on how to render them because there's so much author freedom.

Limiting a rich description to being the contents of a metadata element, and explicitly calling it a "fragment" of HTML with no <head>, stylesheet, or style attributes allowed, could be enough to keep things simple and allow the RS to style the fragment in its own house style. Basically, limiting what can be in there to what might be generated by Markdown. There's no argument on how to render that (well, at least not CommonMark) because it's simple on purpose and not meant to be a full-blown layout language.

If a reading system picks up a description, it may very well render all the escaped markup. It all depends on which dc:description the developer picks out, as this sort of a change in formatting wouldn't be anticipated by any current reading systems.

Perhaps, but that's what mime-type is for--the reading system would inspect the mime type and know whether unescaping is necessary. The spec could require that in the absence of mime-type, the RS default to rendering plain text, which would be backwards compatible with epub < next and puts the burden of spec compliance on the author and not the RS.

The spec could also require that exactly one plain text <dc:description> occur, and that it the first in document order, before any additional descriptions are included. That would probably cover 99% of backwards compatibility as presumably RS's that are assuming a single description will pick the first one encountered.

@iherman
Copy link
Member

iherman commented Apr 28, 2021

Limiting a rich description to being the contents of a metadata element, and explicitly calling it a "fragment" of HTML with no <head>, stylesheet, or style attributes allowed, could be enough to keep things simple and allow the RS to style the fragment in its own house style.

What this amounts to is to define some sort of an HTML "Profile", which is more complex than what one would expect (a precise specification should then define the detailed processing in terms of what the HTML spec defines and, in view of the extreme complexity of the latter, that is not an easy task). I know that we did consider such a route in another Working Group and we quickly shied away from it.

(Note that security considerations would force us to do such a profiling: otherwise it would allow adding javascripts and even handlers into the description content which is then a security risk.)

Basically, limiting what can be in there to what might be generated by Markdown. There's no argument on how to render that (well, at least not CommonMark) because it's simple on purpose and not meant to be a full-blown layout language.

If we went down the line of enriching the dc:description content, then I wonder whether using markdown (or a subset thereof) is not a better option in the first place. It is essentially plain text, so if an RS does not recognize the media type attribute and displays the content verbatim, it is still less messy than the escaped HTML source...

But the reservation of @dauwhe and @mattgarrish are certainly compelling to me.

@dauwhe
Copy link
Contributor

dauwhe commented Apr 28, 2021

If we went down the line of enriching the dc:description content, then I wonder whether using markdown (or a subset thereof) is not a better option in the first place. It is essentially plain text, so if an RS does not recognize the media type attribute and displays the content verbatim, it is still less messy than the escaped HTML source...

I see so many complications. You're requiring reading systems to add a markdown parser. You're requiring content authors to learn an entirely new vocabulary, meant to be an authoring aid for, well, HTML. This is adding a lot of complexity to the ecosystem. EPUBCheck would have to somehow write a markdown validator.

@mattgarrish
Copy link
Member

Cataloging systems also read the EPUB metadata and needs the description.

This is why we allow linked records in formats that such systems would expect - marc, mods, onix, etc.

This proposal, as I understand it, is to add formatting for displaying the description, not for improving the cataloguing of it.

I wonder whether using markdown (or a subset thereof) is not a better option in the first place.

Except that means introducing overrides to metadata handling for whitespace. Reading systems are supposed to trim and compact whitespace in metadata elements, but now in the presence of this attribute they'd have to preserve. We can always make more and more rules to make anything work, but in the process the more brittle processing becomes.

That would probably cover 99% of backwards compatibility as presumably RS's that are assuming a single description will pick the first one encountered.

This is what the community group would need to look at. I believe there are some reading systems that process descriptions (or that's my memory of the epub 3.1 reading system survey), but never assume how. We've also had problems with the wrong authors being processed and you'd assume that would be pretty straightforward, too.

The spec could require that in the absence of mime-type, the RS default to rendering plain text

That wouldn't help legacy systems, though, as it's the change due to the presence of the attribute that they won't recognize.

A lesson learned the hard way in IDPF is that adding new features doesn't translate into quick reading system support (or any reading system support). For W3C, we need to prove uptake of all new features.

I'm not unsympathetic to the limitations of epub metadata, but this would be a pretty radical change. That's why it needs more incubation.

@wareid
Copy link
Contributor

wareid commented Apr 28, 2021

Ivan already mentioned my main concern with this, it is actually security. Yes we'd require the creation of a profile of HTML to do this, but I would say as an RS this still presents the risk of opening up a security concern within the internal metadata. The profile that could counteract this would likely be highly restricted (i.e. style tags only), and I imagine we'd end up back where this issue stems from if we did that.

@iherman
Copy link
Member

iherman commented May 7, 2021

The issue was discussed in a meeting on 2021-05-07

List of resolutions:

View the transcript

3. mimetypes for dc:description

See github issue #1650.

Dave Cramer: the use-case here is to provide rich formatting for epub metadata in the package file
… there were various proposals on how to do this (e.g. escaped markup, markdown)
… but I have concerns about all of these
… they increase burden on RS to parse info out of package file
… plus, we already have an alternative to this via linked metadata
… and few RS use metadata other than title and author, not overwhelming demand

Laurent Le Meur: Atom provides the ability to do this
… but the results are often poor

Ivan Herman: See Atom syndication format

Dave Cramer: book descriptions in our ONIX feed use escaped markup
… once found something that was nested 37 level deep, found javascript fragment, found OpenOffice XML fragments
… very messy

Bill Kasdorf: in the trade book side of things, its not comment to embed things in metadata
… although it does tend to happen in other areas of publishing
… perhaps we can leave this to a successor format to epub
… e.g. in educational publishing, Macmillan wasn't even familiar with ONIX

Ivan Herman: for those areas of publishing, would linking to metadata be acceptable?
… that is what we have today

Bill Kasdorf: if that publisher's ecosystem supports that, then yes

Ivan Herman: the only minor thing that came up is that if we say we prefer the linked element solution, then we may want to add ability to use link to more elements

Dave Cramer: i'm also not aware of a RS that supports linked metadata elements

George Kerscher: what about where the book is ingested and that link is used to provide information about that book
… its not the RS itself that is using that link, but the link is being used elsewhere in supply chain
… does that pass our implementation test?

Dave Cramer: I think so, but I'm not aware of this happening

Laurent Le Meur: from what i heard, epub metadata is not used in ingestion, they use mostly ONIX

Wendy Reid: some epub metadata is used on ingestion, e.g. epub 3 vs 2, FXL vs reflow
… generally identification and classification of the type of epub

George Kerscher: VitalSource uses epub metadata, RedShelf uses it, CG is working on ways to expose epub metadata, we're working with libraries (EBSCO, Proquest) to expose it

Tzviya Siegman: our research showed that the field that was used most was the identifier, and that was used to correlate to ONIX, for example
… covers were extracted, sometimes the author field
… as far as linked metadata, I was proponent, but we're not seeing that used in real world right now, even in scholarly

Bill Kasdorf: all GCA certified publishers are putting accessibility metadata in their epubs

Dave Cramer: agreed that metadata in epubs are useful, but not sure that we should complicate that metadata

Tzviya Siegman: +1 dauwhe

George Kerscher: Micah Bowers is working on a way to create citations and bibliographic references, and that work depends on the metadata in epubs

Ivan Herman: so it seems the WG is not in favour of complicating the way we do metadata in epubs, with the exception of maybe adding a new relationship

Wendy Reid: the new relationship piece can be a new issue or PR, i think

Proposed resolution: Close issue 1650, add new property to linked relationship (Wendy Reid)

Ivan Herman: +1

Ben Schroeter: 0

George Kerscher: +1 add the relationship, then close

Laurent Le Meur: +1 to close, 0 for a new link

Toshiaki Koike: 0

Dan Lazin: 0

Bill Kasdorf: 0

Masakazu Kitahara: 0

Tzviya Siegman: can we clarify what we are adding?

Dave Cramer: https://w3c.github.io/epub-specs/epub33/core/#app-link-vocab

Tzviya Siegman: +1

Dave Cramer: little afraid that this will open a can of worms with other people wanting their own vocab added

Dave Cramer: +1 to close issue, -1 on new link relation

Proposed resolution: Close issue 1650 (Wendy Reid)

Wendy Reid: +1

Dave Cramer: +1

Matthew Chan: +1

Laurent Le Meur: +1

Ivan Herman: +1

Ben Schroeter: +1

Toshiaki Koike: +1

Masakazu Kitahara: +1

Gregorio Pellegrino: +1

Bill Kasdorf: +1

Resolution #2: Close issue 1650

Ivan Herman: should I come up with separate PR about the new relationship?

Dave Cramer: a new issue, i think, where we can further discuss it

@iherman iherman removed the Agenda+ Issues that should be discussed during the next working group call. label May 7, 2021
@iherman
Copy link
Member

iherman commented May 7, 2021

See also issue raised in #1666

@iherman iherman closed this as completed May 7, 2021
@acabal
Copy link
Author

acabal commented May 7, 2021

Thank you!

@mattgarrish mattgarrish added the EPUB33 Issues addressed in the EPUB 3.3 revision label May 11, 2021
@mattgarrish mattgarrish added the Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation label Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EPUB33 Issues addressed in the EPUB 3.3 revision Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation Topic-PackageDoc The issue affects package documents Type-FeatureRequest The issue requests new functionality be added
Projects
None yet
Development

No branches or pull requests

6 participants