Clarification needed on official specs about the nature of the ePub publications and of the ePub Reader Systems in regard to root filesystems - proposal included. #1910

P5music · 2021-11-13T12:17:08Z

Clarification needed on official specs about the nature of the ePub publications and of the ePub Reader Systems in regard to root filesystems.

Summary:
This issue is about some needed clarifications in the wake of the
#1898
proposal.

About:
-the nature of ePub publications and ePub readers
-the root of ePub publications
-the base Url of ePub publications
-the mount point on filesystems vs websites "root"
-the BASE tag for relative URLs and the meaning of the /file.ext syntax in references
-the need of design and source changes for some ePub readers
-the possible high-level data exchange between the WebKit module and high-level applications (like encoding images with base64 encoding and so on).
-HTML injection and base url

Please read carefully, no offence is intended to anyone.

In computer systems there is the root filesystem at the / mount point that is special.

If you run a command like
cd ../../../../../.. and so on, you will end up at the root even if you exceed the available number of levels.

For example if you are at root level
cd ../../../../../.. is going to point to the root itself.

You can mount other filesystems at some mount point.

If you mount a filesystem at a certain depth level, that point is part of the main filesystem now.

So in that case commands with exceeding number of ../ will not stop at the mount point but will continue to go up.

/home/pc/mainpath/mainfolder/secondarypath

/home/pc/mainpath/mainfolder/secondarypath/secondlevelfolder/file.xhtml

if I issue thiss command

cd /home/pc/mainpath/mainfolder/secondarypath/secondlevelfolder
I am now at secondlevelfolder of the mounted filesystem.
cd ..
I am now in secondarypath
cd ..
I am now in mainfolder, I am higher than the mount point
cd ..
I am now in mainpath
cd ../../..
I am now at root
cd ..
I am still at root.

Now
cd /home/pc/mainpath
I am at mainpath
cd ../../..
takes me to root
but also
cd ../../../..
would take me to root.

Now cd /home/pc/mainpath/mainfolder/secondarypath/secondlevelfolder
cd ../../../..
does not take me to the mount point but goes up.

So the mount point is not root, so it is not possible to count on the (strange) rule of exceeding ../ sequence for it.

Let's consider what happens in websites:
https://github.com/w3c/../P5music
is the same URL as
https://github.com/w3c/../../../../../../../../P5music

So webservers do the same at root level.

If you create a localhost on your pc, the same rule apply, the webserver will respond to external requests by applying that rule to the "root" of the localhost hosted website.
So we are talking of responding to requests, be it the request to a filesystem from a terminal, or the external request to a webserver.
You can even send a request from your browser to the localhost webserver on your pc, and that rule apply: you will not be allowed to exceed the "root" of the website the localhost is the host of.

We can now ask what is a Reader System for ePubs,

-Is it a host?
-Does it respond to external requests?
-If the client is on a device (like a tablet) and the server is on another device (like a remote computer), is the client reader a host?
Has the client to parse the URLs, or is it the server that parses the URLs and responds to the client?
-If the client unzips the ePub on the local filesystem, is it a host? Does it respond to external requests? Has it to parse the ePub URLS and apply the "exceeding ../" rule?
-Same question for readers that read from the zip archive in R.A.M. or from disk.

According to the new proposal
#1898

any reader, be it client or an application, has to parse the URLs to apply the above mentioned rule, and the ePub is like a root filesystem, an unmounted filesystem.

Is it possible to add to the specifications
What exact nature is intended for ePub publications and ePub readers, including the following cases:
-client application (reader) and server (webservice)
-client application that reads a zipped ePub in R.A.M. or from disk
-client application that unzips the ePub on the filesystem
and other cases that would happen to exist I am not aware of.

What exact nature is intended for ePubs
-are they unmountable filesystems? (are they real root filesystems)

Please take explicitly into account that the rule of exceeding ../ is naturally enforced only at root.
When a WebView requests a resource with a relative URL to the filesystem, and the URL is exceeding it has to be parsed to prevent leaking (going up the ePub "root").

The WebView can have a baseUrl, usually it is the folder where the loaded XHTMl file resides, it is the default.
Indeed if you inject HTML code without loading it from a filesystem file it is like "floating". If the XHTML file is loaded, instead, the baseUrl is automatically set to the folder where the XHTML comes from.
If the baseUrl is explicitcly provided, all realative URLs are calculated from that explicit point.

But this is not a root, so the leaking is possible. This leads to enforcing the exceeding ../ rule at code level with a major source code change.
It's like filtering exceeding URLs because the filesystem would not filter that, would not stop the exceeding ../ going up.

This is techically feasible, but also it is possible that the parse task is ignored by some readers just to not have to undergo major design changes about how the WebView reads the resources from the filesystem.
The WebView has very optimized low-level code. Usually, feeding the WebView directly with data is less performing, especially from high-level languages exchanging with a separate module as the WebView is.

Indeed when the explicit baseUrl is provided you have to feed the WebView because relative URLs are "wrong" now
Also the use of BASE tag could be considered, but the same applies. The entire publication should have relative paths to the base path, while usually the relative paths are relative to the currently displayed page origin folder, so
../images/image1.png
means "go up one level, enter the image folder, load image1.png.
If a BASE element is defined, the meaning is different. And /file.jpg is relative to the base and not the root.
That alone should be enough to reconsider the new proposal.
Being that many cases are possible, the parsing of path is really a difficult and tricky task for a reader.

Take also into account that some systems, like emulators and simulators could not even enforce the rule at root, because the root of the simulator is not the real filesystem root of the computer running the simulator.

Since the above mentioned new proposal arbitrarily makes the ePub3 publication like a "host" root and not a simply mounted filesystem, it should also be specified with all subsequent implications.

An alternate proposal could be to just to forbid exceeding relative URLs so it is good for both webservers and filesystems, that are slightly different "worlds".
Enforcing an URL syntax clearly intended for webserver host-level website-root to an intermediate point of a filesystem would need the program intervention at the "connection point" to filter what now is explicitly allowed, like exceeding relative URLs, like they are allowed at host-level website-root. It could be at least difficult or even not possible.

I could be wrong on some of the points above, so at least they should be assessed and discussed.

Regards

iherman · 2021-11-15T09:27:02Z

@P5music, I agree that it is not said very explicitly in the documents, but the overwhelming conceptual model for an EPUB instance that it is a "Website in a Box". The overview document makes this point:

"The Package Document defines a layer on top of the traditional structuring of a typical Web site to facilitate the authoring of digital publications." (see §2.1 in the Overview document

The very fact that the EPUB content is defined in terms of Web Standards makes this fairly clear, too. I would agree that we may want to emphasize this conceptual model better in the documents (cc @dauwhe @mattgarrish).

However, if we accept this conceptual model then the analogy to file system structures becomes much less relevant: reading systems are to think in terms of websites with some very significant specificities and not in terms of file systems. The discussions in #1898, #1888, or #1374 are all aiming at making some of the technical details more precise.

How a Reading System implements that conceptual model is not for the Standard to define. It is up to the RS implementers to decide whether it works by unzipping the content into a file system or whether it uses streaming; the behavior, in terms of, say, relative URL-s (which is where the discussion started) should be identical. What this WG has to provide are test cases to help RS-s to achieve interoperability in this respect, and not to specify implementation patterns. It is up to us (and any help is welcome from within and outside the WG!) to produce the right test cases and see whether the RS-s can implement the model based on the specification.

rdeltour · 2021-11-15T13:09:27Z

@P5music, if I understand correctly your proposal boils down to:

An alternate proposal could be to just to forbid exceeding relative URLs

I see two points to discuss there:

is it a legitimate improvement to forbid "exceeding" relative URLs? (note: in previous issues we also referred to those as "leaky" relative URLs, for lack of a better term).
is it sufficient to solve the initial problem stated in What base URLs to use for URL parsing in EPUB? #1888

Should we make "exceeding relative URLs" non-conforming?

In the #1898 proposal, these URLs are conforming, but are informatively discouraged in a note.

That said, I can see how making them non-conforming could be an additional safeguard. If they are reported as EPUBCheck errors or warnings, specifically, then authors will be more forcefully discouraged to use them than with the current informative note.

It should be possible to craft some normative definition of "exceeding relative URLs" and explicitly forbid that in the spec.

Would that solve (part of) your concerns?

Would it be sufficient to solve the problem stated in #1888?

Unfortunately, not.

If we don't say anything about how the root URL is defined, i.e if the RS implementors can choose whatever they want as the container root URL, then it creates interoperability / unpredictability issues.

If you don't understand why, I would ask you to read carefully the problem stated in the opening comment of #1888 (especially section "The current way to obtain the URL of the Package Document is flawed", which exposes a particularly problematic solution).
If after reading that you still don't see why that would be an issue, then I can try to explain it with other words or examples, let me know.

So, after we agree that forbidding exceeding URLs alone does not solve all of the original issue. Do you have an alternative proposal to solve it?

I'm aware that #1898 is restrictive for RS implementors. There may be another way to fix the original problem in a less-restrictive manner. Ideas welcome!

P5music · 2021-11-15T14:26:42Z

@iherman
@rdeltour

Thank you for the response.

I can contribute with this:

The "Website in a Box" metaphor is useful but it goes too far, it is bigger than the ePub3.

I think that only the filesystem part of the website was indeed intended when the analogy was created.
So it could be just a matter of using a lighter version of the metaphor.

As I said, if exceeding URLs are forbidden that will work for both websites and filesystems, and zipped epub reader too.
So it does not solve the spooky problem of root URL definition, yes,
but it is because you are still embracing the "Website in a Box" metaphor too much.

Websites are more complex than ePub3 publications, and it is bad for ePub publications to be compared to them.
The ePub3 has not PHP or equivalent server side functions or obligations.
The ePub3 is more similar to a client like a browser or WebView. Why?
Because it can have Javascript inside the page, not <? > PHP tags,
and very often the RS manages the ePub archive itself on the device, unzipping it or reading it directly from the compressed archive.

So my proposal is to forbid exceeding URLs
AND
to drop the spooky URL problem.

What is the ePub3 publication?
IMHO It is a book, It is a publication in the form of an digital archive, so it is naturally laid on a filesystem, be it on a remote computer or a mobile device.

I also pondered about some further questions:

-when displaying a local ePub, the browser and the WebView have an internal representation of URLS that is going to pop-up every now an then as a possible issue, they have a file path like
file:///C:/main_path/main_folder/secondary_path/secondary_folder/file.html
that is different from
http://localhost:12345/secondary_path/secondary_folder/file.html
WebView and browsers are not going to comply to the new URL form in internal representations, whatever high-level parsing of URLs is done.

-The spine or manifest references are parsed by the RS, but the XHTML references (images, css) should be left to the internal handling of the browser and the WebView.
But also they can be called from the XHTML too, with possible conflicts. And the XHTML pages can also have a BASE tag, I would have the concern that some ePub author put the BASE tag reference as the
http://localhost:port/ form

I hope this is useful.
Thanks
Regards

rdeltour · 2021-11-15T16:05:39Z

So my proposal is to forbid exceeding URLs
AND
to drop the spooky URL problem.

what do you mean by "drop the spooky URL problem"?

P5music · 2021-11-15T16:45:24Z

@rdeltour

I just mean that
the root URL problem is something that only webservers have to deal with, not the ePub3 publication in itself.
I see that websites already enforce the recursive parsing to URLs, as you demonstrated.
So I think that there is not concern about the root URL form when dealing with the ePub3 publication in itself because it is a concept that is not relevant there, it does not apply.
Maybe you want that the specs are more explicit about client and servers, but as you said the webservers are enforcing that rule by default.
The clients do not deal with complete URLS at all, because they are not servers, they do not serve ePub publications after an external request, they just have to display
the digital archive that's the ePub book on the device screen.

According to my thinking, the ePub3 publication is just a file, an archive, representing a book to be read.
Inside it, a reader can find a filesystem structure, and it is very easy to display the publication with a WebView because
the ePub3 publication is something like
a web page when you save it on the computer from the web.
It becomes a special local page with resources folder and the link references are now local, so they can be viewed offline.

Regards

iherman · 2021-12-03T11:16:23Z

I have the impression that this issue has been overtaken by events and the discussion moved elsewhere.Propose closing...

mattgarrish added the Topic-OCF The issue affects the OCF section of the core EPUB 3 specification label Nov 16, 2021

iherman added the Status-ProposeClosing The issue is no longer relevant, or has already been fixed, and should be closed. label Dec 3, 2021

dauwhe closed this as completed Dec 8, 2021

mattgarrish added Status-Invalid The issue is not applicable to the EPUB specifications and removed Status-ProposeClosing The issue is no longer relevant, or has already been fixed, and should be closed. labels Dec 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification needed on official specs about the nature of the ePub publications and of the ePub Reader Systems in regard to root filesystems - proposal included. #1910

Clarification needed on official specs about the nature of the ePub publications and of the ePub Reader Systems in regard to root filesystems - proposal included. #1910

P5music commented Nov 13, 2021 •

edited

Loading

iherman commented Nov 15, 2021

rdeltour commented Nov 15, 2021

P5music commented Nov 15, 2021 •

edited

Loading

rdeltour commented Nov 15, 2021

P5music commented Nov 15, 2021 •

edited

Loading

iherman commented Dec 3, 2021

Clarification needed on official specs about the nature of the ePub publications and of the ePub Reader Systems in regard to root filesystems - proposal included. #1910

Clarification needed on official specs about the nature of the ePub publications and of the ePub Reader Systems in regard to root filesystems - proposal included. #1910

Comments

P5music commented Nov 13, 2021 • edited Loading

iherman commented Nov 15, 2021

rdeltour commented Nov 15, 2021

Should we make "exceeding relative URLs" non-conforming?

Would it be sufficient to solve the problem stated in #1888?

P5music commented Nov 15, 2021 • edited Loading

rdeltour commented Nov 15, 2021

P5music commented Nov 15, 2021 • edited Loading

iherman commented Dec 3, 2021

P5music commented Nov 13, 2021 •

edited

Loading

P5music commented Nov 15, 2021 •

edited

Loading

P5music commented Nov 15, 2021 •

edited

Loading