Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are data URLs always disallowed? #1564

Closed
mattgarrish opened this issue Mar 9, 2021 · 17 comments · Fixed by #1582
Closed

Are data URLs always disallowed? #1564

mattgarrish opened this issue Mar 9, 2021 · 17 comments · Fixed by #1582
Labels
EPUB33 Issues addressed in the EPUB 3.3 revision Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation Topic-PackageDoc The issue affects package documents

Comments

@mattgarrish
Copy link
Member

The question of whether a data URL is allowed in an a tag came up in the epubcheck tracker.

My understanding is this is disallowed because you can't navigate to non-spine resources.

I've also understood that embedding resources using data URLs is forbidden since they're also not listed in the manifest.

Assuming both these impressions are correct, we should probably explicitly note somewhere for extra clarity that the data: scheme cannot be used.

@iherman
Copy link
Member

iherman commented Mar 9, 2021

Assuming both these impressions are correct, we should probably explicitly note somewhere for extra clarity that the data: scheme cannot be used.

I fully agree with this statement. But the question is whether the premise should be true or not.

What happens if, for example, one uses a data URL in a CSS file (it is, in fact, a fairly useful trick, see in the example below)? Consistency would require to make that invalid, too. I would rather consider having an exception for data URL-s not in the sense of forbidding them, but more in terms of dropping the requirement that their corresponding resources should be listed in the manifest or spine. (I guess the question is whether this would create issues with RSs.)


Here is a, I think, typical usage of data url-s in CSS: adding some sort of a watermark to a page:

body {
   background-image: url("data:image/svg+xml;utf8,<svg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 150 100' width='100%' height='50%' opacity='.06'> <g transform='translate(73,45) rotate(-40)'> ... </g> </svg>");
   background-position: top center;
    background-repeat: repeat;
}

@gregoriopellegrino
Copy link
Contributor

I think that also for bitmap images (attribute src with data) works on most reading Systems, for example:

<img src=" ...">

@mattgarrish mattgarrish added the Topic-PackageDoc The issue affects package documents label Mar 9, 2021
@mattgarrish
Copy link
Member Author

but more in terms of dropping the requirement that their corresponding resources should be listed in the manifest or spine

Well, I wouldn't go quite this far. I think it's okay to embed resources using data URLs since we can still check whether they are core media types or not (we could limit the use to CMTs).

But allowing top-level navigation to them seems like a security issue. Aren't browsers beginning to restrict this, too?

It also gets us back into the reason why navigating to any resource not in the spine has been disallowed (i.e., reading systems have no reference to where to take the user next).

@iherman
Copy link
Member

iherman commented Mar 9, 2021

But allowing top-level navigation to them seems like a security issue. Aren't browsers beginning to restrict this, too?

I have just checked, and that is correct. The error console in the browser says:

Not allowed to navigate top frame to data URL:...

The question is how we would specify where data URL-s are allowed and where they aren't. I have not found (yet) any explicit statement about in the HTML standard which we could refer to :-(

I have found this on the MDN page:

Note: Data URLs are treated as unique opaque origins by modern browsers, rather than inheriting the origin of the settings object responsible for the navigation.

And there are some other security related issues in the text.

Would be good to have some feedbacks from RS people...

@xworld21
Copy link

(I hope you don't mind the intrusion!)

I think these are two orthogonal issues, one is whether "embedded resources" (in CSS, link, but in principle even an iframe or an object) should be accepted in the spec, the other is how to treat navigation links <a href="data:...">.

The trouble I have found with navigation links is that different software interprets it differently, and it's all very buggy. I have done some tests with different apps and a physical e-reader using an <a href="data:..."> link containing an XHTML page (with this file). My e-reader behaved sensibly, and asked me to launch the browser. One Android app tried to launch the link but complained that there was no handler in the system. All of the other apps I have tried (Android and macOS) either opened the link as a different page (and got lost and/or did bizarre things), crashed, or deactivated the link. Bugs notwithstanding, the two behaviours (launching the browser vs opening the link within the book) are not comparable, so right now it's pretty hard to create an EPUB that can use data URL links in a meaningful way.

Embedded resources on the other hand seem non-ambiguous in terms of how they should be used.

For reference, this was the discussion about disabling navigation to data URLs in Chrome. There are some use interesting use cases for not killing the links altogether, some could apply to an eBook (for instance an author might want to instruct the users to copy and paste the links in the browser manually).

@iherman
Copy link
Member

iherman commented Mar 16, 2021

(I hope you don't mind the intrusion!)

Absolutely not. On the contrary, we welcome any relevant comment from the community, so, please, keep them coming! Thank you.

@iherman
Copy link
Member

iherman commented Mar 16, 2021

@xworld21 your separation of concepts is very helpful.

  • I would expect that disallowing top level navigation links to a data URL would probably be o.k. By disallowing data URLs in the spine explicitly (currently it is not, so it is, spec wise, possible to repeat a data url in the spine) we get there and this would probably be in line with current RS deployment. I would not think it would break any existing EPUB
  • What is indeed unclear is what should happen with embedded resource references, like img, or svg in a css file. @mattgarrish I believe that, in theory, those resources are supposed to be listed in the manifest, aren't they?

@mattgarrish
Copy link
Member Author

I believe that, in theory, those resources are supposed to be listed in the manifest, aren't they?

Yes, that's why I pointed out at the start that any use of data: URLs is currently invalid per the specifications. But...

I think it's okay to embed resources using data URLs since we can still check whether they are core media types or not (we could limit the use to CMTs).

Because data: URLs identify their media type, we can still check whether they are CMTs. Since you can't provide a fallback to a CMT using data: URLs, it would be invalid to embed a foreign resource this way. (The absence of a media type would also be invalid since text/plain is not a core media type.)

@iherman
Copy link
Member

iherman commented Mar 16, 2021

Because data: URLs identify their media type, we can still check whether they are CMTs. Since you can't provide a fallback to a CMT using data: URLs, it would be invalid to embed a foreign resource this way. (The absence of a media type would also be invalid since text/plain is not a core media type.)

That is fine but, per the spec today, using a data URL in a, say, <img> would be acceptable only if the same URL is listed in the manifest, right? Which is, sort of, meaningless, it beats the purpose of a data URL...

I may miss something.

@xworld21
Copy link

Since you can't provide a fallback to a CMT using data

I suppose the same rules for Foreign Resources would apply, i.e. when an element offers intrinsic fallback, then one MAY use data: URLs of non-CMT?

While if there is no fallback mechanism (e.g. CSS does not do fallback), then the data has to be CMT, because you wouldn't be able to use a manifest fallback. Or to be pedantic, you could by duplicating the URL in the manifest, but as @iherman says, that would make using data URLs pointless.

  • By disallowing data URLs in the spine explicitly

Just a clarification (for me): would that imply that data URLs in <a href=""> must be treated as remote resources (like external websites), or that they must be banned altogether?

@mattgarrish
Copy link
Member Author

I suppose the same rules for Foreign Resources would apply, i.e. when an element offers intrinsic fallback, then one MAY use data: URLs of non-CMT?

Yes, I was thinking specifically of where manifest fallbacks are the only means of providing an alternative, like with img or CSS.

would that imply that data URLs in <a href=""> must be treated as remote resources (like external websites), or that they must be banned altogether?

I'd expect they'd be banned in both cases, but it would be good to hear from some reading system devs on this. Spawning a browser instance is akin to doing a window.open, so it sounds equally problematic to opening within the reading order (i.e., you still end up with a top-level browsing context).

@iherman
Copy link
Member

iherman commented Mar 16, 2021

would that imply that data URLs in <a href=""> must be treated as remote resources (like external websites), or that they must be banned altogether?

I'd expect they'd be banned in both cases, but it would be good to hear from some reading system devs on this. Spawning a browser instance is akin to doing a window.open, so it sounds equally problematic to opening within the reading order (i.e., you still end up with a top-level browsing context).

I would agree with this. I guess most of the security issues raised by data URL-s are related to this usage anyway.

@iherman
Copy link
Member

iherman commented Mar 16, 2021

I was wondering how to make a minimal change on the content document to allow for the reasonable usage of data URLs (see #1564 (comment) or #1564 (comment)) but avoiding the possible security issues that we listed. What about doing these two changes in the spec:

  1. data urls are explicitly listed as possible “external” resources (alongside, e.g., audio or video resources) in §2.2.2 Resource Locations. (I realize that the term “external” is a bit of a misnomer in that case.)
  2. data urls are explicitly disallowed in <a> elements for SVG and XHTML content elements. This means adding an extra item in §3.1.2. XHTML Requirements and in §3.2.2 SVG Requirements

In my (possibly erroneous reading) of the spec (1) means that there is no obligation to add the data url as part of the manifest, because it is not considered to be “located” in the EPUB Container; and (2) puts an extra constraint that avoids a problem with navigation links (but it is o.k. to use data urls as “embedded resources”).

How does that sound?

@mattgarrish
Copy link
Member Author

I realize that the term “external” is a bit of a misnomer in that case.

Ya, I'm not sure that's where I'd be looking for information about data URLs. The resource is technically following the existing rule of being located in the container; it's just embedded directly within another resource.

My first impression is that if go this route we should specifically explain data URLs in a new subsection under Section 3 since they're relevant to most of the other subsections there. That and/or cover them in the manifest item section as an exception to the requirement to list every resource.

But this likely to take a lot of smaller changes, too. It's probably going to affect the definition of publication resource, the foreign resource requirements, etc.

@iherman
Copy link
Member

iherman commented Mar 17, 2021

+1 to having a separate, explicit text. I am not sure how it would affect the other sections...

@iherman
Copy link
Member

iherman commented Mar 19, 2021

The issue was discussed in a meeting on 2021-03-18

List of resolutions:

  • Resolution No. 2: embedded Data URLs are OK, but data URLs in <a> are forbidden; reading systems must not navigate to such URLs.
View the transcript

3. Data URLs

See github issue #1564.

Dave Cramer: where should data URLs be allowed in epub?
… seems to make sense that they shouldn't be allowed in <a> that are not spine elements
… there are some interesting uses of data urls in CSS (e.g. for backgrounds)
… so where are these okay, and where shouldn't they be okay?

Matt Garrish: if we allow them to be embedded, basically, we still can check if they are core media types
… so we still have a sense of what they are
… if they aren't core media types, you can't really provide a fallback to data URL, so they would be disallowed that way
… i have no particular issue with the embedding part
… but the <a> part opens security issues

Dave Cramer: so is it good enough to forbid the <a> usage, but to allow as embedded resource? Would that break epub check?

Matt Garrish: that would be possible

Brady Duga: i don't think we spec away the security issues here
… i agree that not putting data urls in your <a> is good, and that RS should ignore them and not do anything
… if we do spec something, what do we tell RS to do?
… this seems more like a comment to authors that they shouldn't do this thing

Dave Cramer: the problem is that i want this in epub check, and to do that we need the spec to say something
… by putting it in epub check many fewer RSes will have to deal with this at all
… can we just say that "RS must not perform navigation"?

Matt Garrish: do we have to spec out what the RS does?

Brady Duga: i'm okay putting this in just for epub check purposes...

Dave Cramer: and i'm okay writing a test for it, e.g. an RS fails if it tries to follow the <a>
… can we resolve on this?

Proposed resolution: embedded Data URLs are OK, but data URLs in <a> are forbidden; reading systems must not navigate to such URLs. (Dave Cramer)

Wendy Reid: +1

Toshiaki Koike: +1

Brady Duga: +1

Masakazu Kitahara: +1

Matthew Chan: +1

Dave Cramer: is the link element an issue here?

Matt Garrish: i wouldn't think that it would matter, because there is no requirement for the RS to do anything

Ben Schroeter: 0

Brady Duga: what about in SVGs?

Wendy Reid: we'll do some research

Matt Garrish: i can take a look at it, yes

Resolution #2: embedded Data URLs are OK, but data URLs in <a> are forbidden; reading systems must not navigate to such URLs.

@iherman
Copy link
Member

iherman commented Mar 26, 2021

The issue was discussed in a meeting on 2021-03-26

List of resolutions:

View the transcript

2. Data URLs PR

See github pull request #1582.

See github issue #1564.

Matt Garrish: this is from what we discussed last week's call
… where do we allow data URLs, etc.
… browsers are starting to disallow navigation to data URLs (to circumvent security issues)
… so last week we said we would follow that path, but still allow embedding content via data URLs
… plus we were going to say that we wouldn't allow data URLs as spine docs
… how to phrase this was kind of complicated
… the language uses "top level browsing"
… but some RSes don't use a browser based model
… so we say that in those cases, they still have to treat the data URLs similar to the way that a browser would

Brady Duga: its hard to get the language right so that we don't disallow things by mistake, or allow things by mistake
… e.g. when an RS wants to provide the ability to let user pop out content embedded via data URL in a separate window
… wonder if there is another group working on specifying how this should work?
… or are we on the bleeding edge here? Would rather not be the ones who set this standard

Matt Garrish: this very much seems to be raised and discussed in the browser realm
… potentially a part of the HTML spec
… but nothing I can point to specifically

Ivan Herman: this PR does a lot that is not contentious, i think
… i propose to merge PR as it is, subject to the nitty gritty about the exact terminology
… but then open an issue saying that this terminology still has to be checked with TAG and HTML community
… and link to our language
… that way we specifically flag this for TAG input
… the new issue would be more focused
… and in the meantime we capture everything else in the PR

Matt Garrish: in that light, do we maybe want to keep the authoring side requirements, but leave out the RS expectations side?

Ivan Herman: pref trying to do something now, even with it being incomplete, raise a separate issue about the terminology, link that issue to the text and ask for external help, e.g., from the TAG.

Brady Duga: would refer this to TAG, yes

Ivan Herman: also, let's mark issues as explicitly "referred to TAG for input"

Dave Cramer: i'm okay with merging for now, on understanding that this is not final final

Proposed resolution: Merge PR 1582, open issue for TAG review (Wendy Reid)

Ivan Herman: +1

Wendy Reid: +1

Matt Garrish: +1

Charles LaPierre: +1

Matthew Chan: +1

Brady Duga: +`

Brady Duga: +1

Bill Kasdorf: +1

Resolution #2: Merge PR 1582, open issue for TAG review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EPUB33 Issues addressed in the EPUB 3.3 revision Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation Topic-PackageDoc The issue affects package documents
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants