-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What base URLs to use for URL parsing in EPUB? #1888
Comments
Solution 0: Current situationDescriptionThe URL of the root directory is largely undefined, and the URL of the package document is defined –for zipped EPUBs– to the URL of the ZIP + a fragment. Examplessee the problem statememt above. Features
|
Solution 1: Leave the definition of base URLs up to the Reading SystemDescriptionRemove what the spec currently says about how to obtain the Package URL. Leave it up to Reading Systems to define the base URLs of the Root Directory and the Package Document. This is only a mild improvement to the current spec. It doesn't prevent the existing flawed approach. Examplesanything can happen, depending on the implementation-defined appraoch. Features
|
Solution 2: Use a reserved special URL as the container root's base URLDescriptionThe URL of the container root is a well-chosen special URL. Possibly, the host can include a string unique to each EPUB instance. One downside is that parsed URLs look like Web resources, but they are not. If we control the domain, we can ensure it will not conflict to actual resources. ExamplesThe URL of the root directory of an EPUB is arbitarily defined as
Features
|
Solution 3: Use a proprietary non-special schemeDescriptionDefine an EPUB-specific URL scheme (for example One downside is that registering a new scheme may not be a good practice (scheme squatting). ExamplesThe URL of the root directory of an EPUB is defined as
Features
|
Solution 4: use a
|
# | URL string | Base EPUB | Resulting URL |
---|---|---|---|
1 | / |
file://1234/ |
file://1234/ |
2 | doc.xhtml |
file://1234/EPUB/package.opf |
file://1234/EPUB/doc.xhtml |
3 | doc.xhtml |
file://4242/EPUB/package.opf |
file://4242/EPUB/doc.xhtml |
4 | ../../../secret |
file://1234/EPUB/package.opf |
file://1234/secret |
5 | /secret |
file://1234/EPUB/package.opf |
file://1234/secret |
Features
unambiguous | contained | unique | origin-safe |
---|---|---|---|
Yes ✅ | Yes ✅ | Yes (almost) ✅ | Not-defined (unique opaque origin is recommended) ❌ |
Solution 5: use a special syntax for ZIP entries URLsDescriptionThe URL of container resources is defined as This is not a standard practice, but there is precedent (like ExamplesThe URL of the root can be for example
Features
|
Tip: you can use this standard-conforming live URL viewer to check the result of parsing a URL string with a base URL. |
One thing is that usually epub publications have internal metadata that uniquely identify the book This is why searching for an ePub publication, even in urls, should be in a query form IMHO. Also there is the ! redirection symbol that could be useful, as a standard mean of showing that some "jump" has to be done from that point, even with a different method of retrieval. For example, the [chapter01]! inside the epubcfi fragment means that there is an explicit operation of "jumping" has to be done with the methods that are available on the reader or the system, like loading the effective file with corresponding idref. I would like to point out the "active" nature of the redirection operation could be useful when changing "domain", like from filesystem to zip, or to package, to chapter, to paragraph. |
Solution 6: use a
|
# | URL string | Base EPUB | Resulting URL |
---|---|---|---|
1 | / |
http://localhost:49152/ |
http://localhost:49152/ |
2 | doc.xhtml |
http://localhost:49152/EPUB/package.opf |
http://localhost:49152/EPUB/doc.xhtml |
3 | doc.xhtml |
http://localhost:50505/EPUB/package.opf |
http://localhost:50505/EPUB/doc.xhtml |
4 | ../../../secret |
http://localhost:49152/EPUB/package.opf |
http://localhost:49152/secret |
5 | /secret |
http://localhost:49152/EPUB/package.opf |
http://localhost:49152/secret |
Features
unambiguous | contained | unique | origin-safe |
---|---|---|---|
Yes ✅ | Yes ✅ | Yes (limited to 16k instances per machine) ✅ | Yes ✅ |
It seems this is slightly out of topic for the current issue: we're not trying to define how should a URL to an existing EPUB look like. This is not our responsibility. We're trying to identify if we can define the URL of the root directory of the EPUB OCF container in such a way that parsing relative URL strings becomes unambiguous.
I understand this is similar to solution 5 define above?
I'm not sure I fully understand your proposal. Maybe examples would help? 😊 |
You seem to want to merge the internal "world" of an ePub publication, with the URL standard, but this can be obtained only with explicit role of the ! redirection symbol, IMHO. The quoted text clearly states that an unique solution is not available, and it would be preventing a lot of uses in fact. And also it is not uniquely defined on systems themselves, for example there are https:// or file:/// that can both used in a browser, or in an application. I think our two issues are not unrelated, and yes it is necessary to create a standard scheme to open the ePub, view or to access its internal content, it seems that we are saying the very same thing. The virtual nature of the nodes that are in a URL demands for a generalization. A book could be identified by an URN but all happens today on the internet or devices so URL does make sense. It seems to demand the introduction of the "virtual nodes" of domain specification with the ! symbol. Because even if some implementations could yield the internal content when a complete https:// URL is used, still it is the system that decides to allow that "jump", as it was an internet resource access. For example my app does not do that. Also my app is on a device, not on the internet. eBooks are possibily located in different locations, according to what system can search, deliver or view them. The redirection symbol ! could be useful to solve the conundrum of different uses of the URLs, different origins and destinations, asking the system to find or open the virtual root that best suites the user's request according to the system features. A standard form should allow the retrieval of the ePub with definite metadata, like isbn or HTML-encoded title, and epubcfi, even in the case the user just wants to open it in any app that's available on a device. It's like a sort of search function that is embedded in the URL, because in most cases the user has an app where the ePub can be found in the library. So the common case would be a legit and simple URL with a scheme that informs the device an app has to be launched to open that ePub at that cfi location, or that it has to be searched on a webservice, or directly opened in a server folder on the internet. These cases have increasingly specific URLs as to host, authority and path. It seems that there is some differences between the major OSs, like Android and iOS just to mention, so https:// is sort of universal. But of course there are other simple or complex forms of URL that can be used. First and foremost the concern about the authority and host parts of an URL should be addressed. It should allow both a specific app and a generic app to handle the request. I put down many URL forms, like for example: |
What if we replace the fragment identifier part with the ZIP path, so we end up with something like: "When an EPUB Publication is zipped, the base URL of the Package Document is obtained by combining the base URL for the EPUB Container with the ZIP path to Package Document. The resulting absolute URLs obtained by combining the Package Document's base URL with the relative URL references in the Package Document MUST resolve at or below the base URL for the EPUB Container." Would that still require us to get into schemes? Or am I missing another problem? |
The main problem with this approach is that we enable Schrödinger’s EPUBs, which are both conforming and not-conforming at the same time, until a processor decides how if effectively combines the EPUB URL and package path. EPUBCheck cannot tell if an EPUB is conforming. An author cannot tell if their EPUB will work in an interoperable manner. |
What decision making is left, though? If you treat the path to the package document as a relative URL instead of a fragment identifier then what complicates generating a absolute URL from that? I get having the EPUB file in the base URL makes a mess, but what if the base URL excludes that on the assumption that you're unpacking that file into the directory where it's currently located? So from: https://example.org/acme/mobydick.epub The base URL of the EPUB is: https://example.org/acme/ Then from parsing 'https://example.org/acme/' with 'EPUB/package.opf' you get https://example.org/acme/EPUB/package.opf From that the base URL of the package document is 'http://example.org/acme/EPUB/' And with that why can't you continue to check the relative URLs in the package document to ensure that are at 'https://example.org/acme/' or below? There's still the problem of the root directory of the EPUB being virtual, but that's something we can only warn reading systems that if they don't maintain it they may allow references to leak outside the EPUB. |
This should be enough to dismiss this approach, because you are mixing two "worlds". Outside an EPUB container there is "nothing", it is not connected with other files that are around in that directory where that ePub and other ones were unpacked.
I think the root directory is just a virtual limit, cannot be combined with an external path, even if some implementations just unpack the folder and then the content is internally accessed as files in a folder. But it cannot be a specification. Instead it should be avoided and forbidden. This is why I asked for a way to access the ePub content that is modern and flexible with special "virtual node" features for those systems that can be compatible with them. |
@mattgarrish wrote:
Should I worry that this means that https://example.org/acme/mrsdalloway.epub (and every other EPUB on example.org would be same-origin with Moby-Dick? |
It seems I didn't understand your proposal. I thought it was intentionally open to any kind of combination. But instead you're saying that the URL of the Package Document is the result of applying the URL parser to the path of the Package Document relative to the root directory, with the URL of the EPUB publication as the base URL. Correct? I can see several issues with that. See below.
OK for the examples. Just some nitpicking to be perfectly clear: speaking of base URL of a document is not standard terminology.
One issue is editorial. It may not be easy to formally specify "being at some URL or below". Another issue is that this approach is not contained (as defined in the problem statement) and can potentially create conflicts. To take your example with the EPUB at I'm not even considering URL that leak outside the container. Even if we put some spec work to forbit it, authors will do it so we have to handle that case in the RS spec. Finally, the approach is not unambiguous (as defined in the problem statement).
In fact, can we even assume that an EPUB publication has a URL? 🤔 (serious question). |
In fact the approach pretty much shares the characteristics of solution 5 (the "zip URL + !" approach). Two EPUBs can be same-origin indeed. I'd be interested to hear more about what people think about the objectives listed in the problem statement. So I second @dauwhe's question 👀😊. |
I kind of like the localhost with unique port idea, largely because it also gives us the origin properties we want--an EPUB is same-origin with itself, but cross-origin with all other EPUBs. Since it's in some sense fiction, do we need to worry about only having the 16k ports? |
I think you're right: any solution based on the actual URL of the EPUB publication will not be contained and probably not be totally unambiguous (using this issue's terminology). |
That was only for example. The abstract container is a virtual file system, so it doesn't really matter what scheme/url constitutes the base of the EPUB. You assign whatever base URL you want for the root directory and you can operate on the ZIP paths as though they are relative URLs. That's how this can't be a conflict:
The video is not in the abstract container. You can't generate URLs for the zipped content and then go looking at things around the zipped file. (If you're going to unzip the content where resources already exist, well then you've made your own mess we can't solve.) The generated URLs don't correspond to physical resources, so all they can tell you is whether the resource would conceptually fall within an actual file system representation of the abstract container. If you want to know if the resource is in the container, you still have to take the path segment corresponding to the root directory and below and see if there's a matching resource in the ZIP container. |
This is a conflict to evaluate conformance to the spec saying that "EPUB Creators (…) MUST ensure each URL is unique within the manifest scope after resolution to an absolute URL". Unless we work around it by saying "only for URLs defined as relative URL strings" maybe? but that sounds flimsy. And it doesn't entirely solve the ambiguity (see below).
Yeah, I'm not convinced that works, see the last example in my previous comment with the That the container is virtual makes me prefer a solution which uses by design a virtual space (like solution 2 (reserved space on w3.org or another safe domain), solution 3 ( |
I think that the ePub publication is considered a "system" so the folder structure refers to a root, like / on a Linux system. Base url is root. Then, connecting the external world URL to the internal ePub system is like connecting two computers. Moreover, the fact that the ePub publication is a self-contained "mini-PC" has the only purpose, I think, to help ePub readers to handle the content with a WebView component, like WebKit, that relies on the filesystem. And it is very likely that an ePub reader can manage the ePub publication as a zipped file, or as an unpacked zipped file. If the ePub is unpacked it is just for internal convenience, it does not become part of an "official" filesystem structure. If the ePub is not unpacked it is accessed in memory, special API methods have to be used to intercept the WebView (browser) resource requests, like an image, a css file and so on, because AFAIK the WebView does not access directly the zip as the filesystem. Considering the EPUB folder at the same level of So IMHO there is no need to find a way to connect the ePub "mini-PC" with a global system with the "URL way". The URLs in ePubs have just not to exceed the root level, readers just should check it and ignore leaking URLs, presenting an error or informative dialog about that occurrence. I think that the ! symbol is like "mounting" but I think it could be more general and useful, stopping there the path parsing, and starting a sort of query part of the url. The zip + ! approach seems to be reasonable here, although this issue seems to be about internal ePub URL validation and not how to have meaningful URLs for apps. |
Oh, that's right. I forgot about that part. Ya, I've been focusing on interpretation solely within the abstract container (i.e., let developers pick any method that works for what they need to know and how they're obtaining the content). Hm... I'm coming around to your point now... 😄 |
First, thanks to @rdeltour for putting all this together! Just some random thoughts while reading through the proposed alternatives
In a later comment, you say:
Absolutely, +1 to that
As far as I am concerned, I would prefer to drop solution 3, which leaves us with reserved domain, |
(I would propose that we add something roughly like the following to the start of 6.1.3, and change the name of the section:) URLs and the OCF Abstract ContainerIn order to explain the behavior of EPUB with respect to URLs and the web security model, we find it useful to imagine that the Root Directory of the OCF Abstract Container has a defined URL. EPUB Reading Systems will not present the contents of an EPUB to users with such a URL; it is merely a concise way of describing a complex set of behaviors. The URL of the Root Directory is defined as follows:
This has the following implications:
Example:
|
I think (?) the dynamic port doesn't help solve #1843. If you are a webserver admin and you want to let a given (known) ebook iframe your site, but prohibit other ebooks from doing so, how can you specify a dynamic origin? |
I don't think there are enough ports to have a unique port per book. The number of available dynamic ports on Unix is around 16K or so, so even if this is just local to a single user a large library fail (and yes, there are some very large libraries out there). Is there even a spec we can reference for dynamic ports? |
I was assuming that a new port would be assigned when you open a book, and that the assumption we're then making is that you have fewer than 16,000 books simultaneously open. The next time you open the same book, you probably get a different port. But that still runs into the predictability problem I mentioned above. Also I might be misunderstanding. |
Leave it to an ebook spec to use metaphor! :) I think though, that this runs into the same issue as xml namespaces. People expect these URLs to work, though perhaps localhost would be enough to dissuade them. On the other hand, maybe not. But if we use ports to localhost in a URL, it seems like people might expect them to be ports to a host in a URL, and be subject to those rules. We could instead not use ports, eg http://localhost-unique-id, or even just http://unique-id. Or maybe http://this-is-not-a-custom-scheme-we-promise-unique-id.
That might be fine, but we need to say that. And then we need to define what it means for a book to be "open", which is probably trickier than it sounds, and it already sounds tricky to me. [Edit to remove things that looked like custom tags] |
I think defining the container’s URL "as if" is reasonable for interpreting the conformance statements in the core spec. Specifically, the "as if" approach (like in @dauwhe's proposal / solution 6) allows us to:
In the RS spec, we could give more leeway on the implementation, as long as the RS must:
(*) the only difficultly is to be 100% safe with authored absolute URLs. Say I write an EPUB which purposely contains an exhaustive list of localhost URLs with dynamic ports. It is theoretically hard to totally avoid conflicts, unless we forbid "localhost" absolute URLs. |
Correct, as far as I understand.
The issue with non-dynamic origins is that it is no-longer unique per instance. If you and I have a copy of the same EPUB, and they share the same origin, does it create vulnerabilities? Let's discuss this in #1843, and possibly come back here if #1843 implies new requirements for the current issue #1888? |
Ok, I have stared at it a bit more and it seems reasonable. Maybe we can find a way to emphasis the imaginary nature of these constructs? Maybe |
I made a little EPUB to test the internal URLs used in some JS-supporting readers: my-url.zip (rename the The logic is based on a (very naïve) javascript run in a content document ( I only tested in iBooks (which uses URLs with a custom Unfortunately, none seem to behave like the proposed "as if" case. That experimentation is kinda flawed and limited, but it may be informative 😊 |
What follows is just IMHO. I could not read past issues but just browse some referenced ones, I see that many of them revolve around the same problem: the root of the ePub publication. When creating a modern ePub reader (RS) I think that developers or engineers rely on the powerful features of a filesystem or a WebView (like WebKit). Modern EPUB3 RS do not rely on parsing the paths and resources on their own. So the ePub was created as a mini-website, likely to allow easily creation of RS and to exploit the WebKit or browser features directly and in a straightforward way. Connecting with a virtual URL for validating is not compatible with every way the ePub content is handled (unpacking or reading from zip are very different from each other). And leaks are leaks, they are errors, they cannot be avoided by means of creating a complete URL that is validated, or even corrected. I found this in another thread #1688,
You seem to deal both with validating the ePub content and with providing a way of handling the files in practice. As it was also said "what's an open ePub?", and what's an archived ePub?, and is its content accessible from outside or has it to be "asked" to the RS or archive to "open" it or "provide" it? As you can see many uses and cases are possible, so consider just the above mentioned case: if an ePub publication has references to a common documentation system, I think the publication per se is a closed "world" and it cannot reasonably be allowed or persuaded to read content from a filesystem, unless it is a special system but then it is not within the ePub specifications. But if the local system is where the external content has to be accessed, and https:// URLS are not wanted, then here it is where the ! redirection would be useful, having a way of "opening" an ePub that could or could not be available in the local system but could be searched elsewhere as a fallback. I see that
So is the epubcfi fragment invalid when it is used in URLs? |
I think this is the main point: it is "imaginary", "virtual", and user provided script should not rely on the existence of those localhost URLs. I do not have an issue with the sentence you propose, but I think we should also add something describing these restrictions on scripts. |
@rdeltour, see also my comment in #1888 (comment): I would think it should be made explicit that such scripts may be unpredictable in an EPUB environment. May that be the problem with the test? For testing, I believe what should be done is to write tests along the lines of the requirement you have put in the issue itself (unambiguous, contained, unique, and origin-safe) (and not relying on scripts). The goal is
|
That little experimentation is limited for sure. But if the two RS I tested do not implement custom URL parsing logic and if they rely on the JS URL API, then at least in those two cases, their implementation does not 100% fit the behavior of the "as if" case.
I agree with the principle. I don't know if/how that's practically testable 😊 (especially for the same-origin requirement). The goal is
Right. For (1), let's keep in mind that there's the core and RS spec. All the criteria may not be needed in both specs. We may not need to define ore require them in the same place. For instance:
(I'm thinking out loud here, exploring possibilities. I'm not quite sure yet how to best articulate this 😊). |
I'm working on a proposal, stay tuned 😉 |
So, here's a proposal. Blockquotes starting with "📝 Comment:" are not part of the proposal, but my comments. [In the EPUB 3.3 spec] 1.4 Terminology
To get the Path of a file file in the OCF Abstract Container:
6.1.3 URLs in the OCF Abstract Container
The container root URL is the URL [URL] of the Root Directory. It is implementation specific, but MUST verify the following:
The container URL of a file or directory in the OCF Abstract Container is the result of parsing the file's Path with the container root URL as base.
In the OCF Abstract Container, when a file uses a URL string to reference another file in the container, the string MUST be a path-relative-scheme-less-URL string, optionally followed by U+0023 (#) and a URL-fragment string.
6.1.5.x Parsing URLs in the
|
URL component | Value |
---|---|
scheme | http or https |
host | localhost |
port | a dynamic port uniquely assigned to the EPUB instance |
For example:
Container File | File Path | URL |
---|---|---|
Root Directory | empty string | http://localhost:49152/ |
Package Document | OPS/package.opf |
http://localhost:49152/OPS/package.opf |
📝 Comment:
The localhost + port solution has limitations. Notably, it cannot be used if we start requiring that the origin is preserved when opening the same EPUB another time; or it cannot guarrantee other EPUB instances opened at another time will not be same-origin. I still think it is relevant as an example; do we need to make these limitations explicit? or at least say it is known to be limited? or more strongly say it is a mere analogy and RS can very well use another solution?
Comments warmly welcome, especially from RS folks 😊.
@rdeltour this sounds great to me. I particularly like the way you describe the root in abstract, leaving it to the implementation whether they use localhost or anything else. As a non-implementer it sounds perfect to me as some sort of mental model. But, as you say, the real answer should come from the RS folks. Would it help if I attempted to fold that into the spec in the form of a PR, so that people could see the changes as part of the spec (with also a diff file)? I am happy to make an attempt, although I might get it wrong here and there. CC: @wareid @bduga @HadrienGardeur @hober @llemeurfr @danielweck @fchasen @rickj @mteixeira-wwn |
I am not sure if it is clear what "EPUB Publication instance" means. Do we rely on the model whereby a RS makes some sort of (virtual or physical) copies of books it receives? I believe that is what happens with most of the Reading Systems even on a Mac, but I do not know whether this can be considered as a rule. If not, then what happens if two RS-s read the same EPUB file on my disc? I guess these should be considered as separate "instances"... |
From spec point of view, that is true; the new statements make the second bullet items of the second list obsolete. I might still prefer to leave something in the section referring to the new text. |
For good or for worse the current spec puts the |
Yeah, I wondered that too. I started jotting this down in markdown, with interspersed comments, so it ended up as a comment and not in a PR. Also, I wanted to see if the group agreed with the direction. But feel free to turn that into a PR! (Or I can do it if you prefer, but not until next week).
Right, I agree. I copied that term from the existing "Scripting" section, but this is rather (intentionally?) vague. I think this would be worth clarifying, especially if we want to further think about (and ideally specify) the origin-related requirements. |
I am working on it. May be ready later today or tomorrow at the latest. |
(or: The Big Mystery of Spooky EPUB Relative URLs 👻🎃)
TL;DR: EPUB 3.3 now normatively references the URL Standard. But URL parsing is ambiguous in some cases, because base URLs are not clearly defined.
Current situation
In an EPUB, files reference each other via relative URL strings (see Relative URLs, in Open Container Format).
In the URL standard, to parse a relative URL string into URL records, the URL parser needs a base URL.
The base URL used to parse a URL string is defined by host languages (like in CSS, or HTML). Typically, it is the URL of the document containing the URL string.
EPUB defines what base URL to use for URL parsing in two cases:
META-INF
directoryParsing a URL in documents located in the
META-INF
directoryFor documents in the
META-INF
directory, URL strings must be parsed using the root directory as the base URL (see Relative URLs, in Open Container Format).The problem is that Root Directory is not defined as a URL, but quite abstractly as "the base of the OCF Abstract Container". The spec also says the root directory is "virtual in nature". In fact, RS may or may not generate a physical directory for the root directory (see OCF ZIP Container RS processing).
Parsing a URL in the Package Document
For Package Documents, URL strings must be parsed uses the URL of the Package Document as the base URL (see Parsing Relative URLs, in Package Documents RS processing).
Here again, the URL of the Package Document is not well-defined. But the spec says (in the same section) that for zipped EPUBs, the URL of the package document is obtained "from the URL of the EPUB Container together with a fragment identifier that specifies the path to Package Document (relative to the Root Directory)".
Problems
The URL of the container’s root directory is undefined
The current specification leaves many questions unanswered:
The current way to obtain the URL of the Package Document is flawed
Parsing a relative URL in the Package Document always results in a URL of a resource outside the container.
Examples:
For instance, for an EPUB
mobydick.epub
located athttps://example.org/acme-publishing/mobydick.epub
, the URL of the Package Document would be something likehttps://example.org/acme-publishing/mobydick.epub#path=/EPUB/package.opf
. So this is how a few relative URL string examples are parsed:nav.xhtml
https://example.org/acme/mobydick.epub#path=/EPUB/package.opf
http://example.org/acme/nav.xhtml
nav.xhtml
https://example.org/acme/tomsawyer.epub#package-doc=/EPUB/package.opf
https://example.org/acme/nav.xhtml
../video/cat.mp4
https://example.org/acme/mobydick.epub#package-doc=/EPUB/package.opf
https://example.org/video/cat.mp4
/secret
https://example.org/acme/mobydick.epub#package-doc=/EPUB/package.opf
https://example.org/secret
../../../secret
https://example.org/acme/mobydick.epub#package-doc=/EPUB/package.opf
https://example.org/secret
To summarize:
Possible Solutions
The ideal solution would ensure parsed URLs would be:
Note: the ideal solution might not exist, or might not be practical to use, to implement, or to specify. But the goals listed above may help us evaluate a solution.
Possible solutions will be listed below as individual comments, for easier referencing in the discussion.
Comments and ideas welcome! 😊
I may have missed important things…
The text was updated successfully, but these errors were encountered: