Add MIME charset based on Emacs coding system #18

kqr · 2018-02-17T16:57:29Z

Companion to, but obviously independent of, PR #6 on impatient-mode.

skeeto · 2018-02-17T21:25:26Z

Thanks for taking the time to put this together. There are some issues with this patch: * I don't know what the fst function is or where it comes from. It's not defined by Emacs, and a quick search brings up nothing. * Reading the file into a multibyte buffer defeats the purpose of detecting its format/encoding. Emacs will have already decoded it into UTF-8 regardless of the original encoding, and so it no longer matters. Worse, it it's a binary file, the decoding won't make sense. The goal is to pass the raw bytes along unmolested by Emacs, and in a single load. I perhaps should have used insert-file-contents-literally (back in 2012 when I wrote that line), but it appears to do the same thing when the buffer is unibyte. * You've singled out UTF-8, which is appropriate since it would be the most common, but there are actually 20 different values by my count that all mean UTF-8 to Emacs: utf-8 utf-8-auto utf-8-auto-dos utf-8-auto-mac utf-8-auto-unix utf-8-dos utf-8-emacs utf-8-emacs-dos utf-8-emacs-mac utf-8-emacs-unix utf-8-hfs utf-8-hfs-dos utf-8-hfs-mac utf-8-hfs-unix utf-8-mac utf-8-unix utf-8-with-signature utf-8-with-signature-dos utf-8-with-signature-mac utf-8-with-signature-unix Transforming an Emacs coding system symbol into a MIME charset needs to be a lot more robust. (Perhaps this is was "fst" is about?)

kqr · 2018-02-17T22:12:08Z

fst was just a brainfart that somehow made it into the code after I had tested it with car, which is the function I meant. I have pushed a fix, but one will probably want to squash them while merging.
No, Emacs does not ignore original encoding when reading into a multibyte buffer. The original encoding (or at least a best guess) is saved to buffer-file-coding-system.

You can try for yourself and see what

(seq-every-p (lambda (x) (eq x 'utf-8))
    (mapcar #'coding-system-type
         '(utf-8 utf-8-auto utf-8-auto-dos utf-8-auto-mac utf-8-auto-unix utf-8-dos utf-8-emacs utf-8-emacs-dos utf-8-emacs-mac utf-8-emacs-unix utf-8-hfs utf-8-hfs-dos utf-8-hfs-mac utf-8-hfs-unix utf-8-mac utf-8-unix utf-8-with-signature utf-8-with-signature-dos utf-8-with-signature-mac utf-8-with-signature-unix)))

returns. It's true! Many of the non-utf-8 cases are handled by the charset case.

kqr · 2018-02-17T22:46:50Z

Re. your concern about binary files I'm not sure... One could retrieve charset only for text/ mime types. Or guess based on presence of null bytes.

skeeto · 2018-02-17T23:29:41Z

Thanks for the tip about (3). I didn't know that's how coding-system-type worked! I'm now satisfied with that part. What I mean with (2) is that Emacs always uses UTF-8 internally for multibyte strings and buffers. It remembers the original encoding so that it can re-encode it that way when saving. However, for this server the buffer is sent out with process-send-region, which uses the connection process' coding system, not the buffer's file coding system. Since the connection process is set to "binary," Emacs will send out the buffer's raw UTF-8 content regardless of the original file's encoding. So that's what I meant about the original encoding being irrelevant. Whatever it was, the most efficient action at this point would be to just send it as UTF-8 and forget the original encoding. Less efficient would be changing the connection process' coding system first so it re-encodes the data back to the original encoding as it's sent out. I'm not currently sure if this would always work as expected. Even less efficient would be reading the entire file a second time into a unibyte buffer, then sending the raw data (as before your patch). But this time with the Emacs-guessed charset discovered from the initial decode. HTTP coding is kind of a mess from Emacs' point of view because a stream of HTTP data is typically a mix of different encodings. Headers are ISO-8859-1 while content is generally either binary or UTF-8. To deal with this, simple-httpd uses raw binary for the connection process and calls decode-coding-string to manually decode just the HTTP header.

kqr · 2018-02-18T08:10:34Z

However, for this server the buffer is sent out with process-send-region, which uses the connection process' coding system, not the buffer's file coding system. Since the connection process is set to "binary," Emacs will send out the buffer's raw UTF-8 content regardless of the original file's encoding.

Aaah, that makes sense. Sorry, I misunderstood completely. Thank you for teaching me a little more about Emacs encoding handling.

If I understand the rest correctly (i.e. regardless of this patch or not, Emacs will push UTF-8 into the pipes), would it make sense to rewrite this patch to instead just set charset=utf-8 on everything? It's not optimal (since recoding is not a reversible process in general), but it should be strictly better than the current situation where UTF-8 is used with no indication of it, right?

skeeto · 2018-02-18T13:17:11Z

Without your patch, Emacs passes along any file's raw bytes without any translation, just as if it were a binary file. A UTF-16 text file will be served as UTF-16 — though without any indication of the "charset" from the Emacs server. It just leaves it for the client to figure out the encoding. I personally don't think it's the server's business to translate between encodings since 1) it doesn't know the client's purpose (the original encoding might be important) and 2) it could be wrong about the original encoding. Try it out yourself. Serve a UTF-16 text file with some non-ASCII code points via simple-httpd and observe the results in various browsers and other clients (curl -v | hexdump -C). I tested with GREEK SMALL LETTER PI and PILE OF POO (to test a supplemental plane). There's a certain irony that, in my tests, UTF-16 actually works better than UTF-8 (without BOM / "signature") in both Firefox and Chromium. Both detect that it's UTF-16 and display the text properly. A UTF-8 file gets mangled because both browsers assume ASCII by default (despite RFC 6657). Both detect UTF-8 when a BOM is present (eww). I got a variety of different results from various terminal browsers: lynx, links, elinks, w3m. In practice, for websites this is not an issue. An HTML document should generally specify its encoding (e.g. <meta charset="...>) — despite how strange it is for the encoding to be specified in itself — so it doesn't matter if the server uses "charset" or not. JavaScript can't do this, but JavaScript included from an HTML document adopts that document's encoding (at least with Firefox and Chromium). Side note #1: If you didn't know, Emacs extends Unicode such that it can represent raw bytes within an encoded stream, which would otherwise be illegal: http://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html Because of this, any binary file that was incorrectly decoded via any coding system can be unambiguously and safely converted back to the original binary data, given the coding system that was used to mistakenly decode it. Side note #2: It looks like simple-httpd doesn't properly respond to "connection: close" as used by simpler browsers, which I should fix. Perhaps there's some value in calling an external program like enca to make a guess at the encoding if it's available. Unfortunately enca incompletely or fails to identify the encoding of some of the files I tested, so it would need to be some other program. On the other hand, this is the "simple" httpd server, and I prefer to do the simple, non-surprising thing of leaving out "charset" and letting the client sort it out. You haven't said _why_ you want this change. I see you paired this with a PR over in impatient-mode. Are you having trouble with a particular document? An issue with htmlize? The beauty of MIME types being stored an alist is that you can trivially override it for your own purposes: (push '("html" . "text/html;charset=utf-8") httpd-mime-types) But, IMHO, much better is to use a meta tag in the document if possible.

If the client uses HTTP/1.0 or "Connection: close" then the server *must* close the connection after the response has been sent.

kqr · 2018-02-19T15:43:59Z

Ah, yes, I forgot the buffer was not multibyte pre-patch. Sorry. Now that part makes sense as well. I agree fully that the server is out of line if it attempts to recode the content before passing it along to the user (beside the philosophical issue, it would wreck any checksums or HMACs associated with the content.)

You haven't said why you want this change. I see you paired this with a PR over in impatient-mode. Are you having trouble with a particular document?

Yes, although I'm starting to suspect something else is wrong, because my problem contradicts your other observations. In particular,

JavaScript included from an HTML document adopts that document's encoding

appeared not to be the case for me. Firefox assumed (as you said) that a JavaScript file included in my HTML was ASCII encoded rather than UTF-8, and then when it encountered GREEK SMALL LETTER MU in an identifier it errored out. The HTML file had UTF-8 specified as a meta tag and it appeared to understand UTF-8 sequences.

That said, I am 100% content with the solution you suggested of pushing a new mime type onto the alist. I wish I had thought of that to begin with, and not tried to be clever. If it's okay with you, I can close this pull request as well as its companion.

skeeto · 2018-02-19T21:59:22Z

Unless you've come up with another way to tackle this issue, closing both PRs is fine with me. Again, thanks for taking the time to look into this.

Firefox assumed (as you said) that a JavaScript file included in my HTML was ASCII encoded rather than UTF-8, and then when it encountered GREEK SMALL LETTER MU in an identifier it errored out.

Here's the test I ran before putting that claim in my response, just to be absolutely sure my understanding was accurate. Both Firefox and Chromium had the same result — though Chromium required a hard refresh. index.html: <!DOCTYPE html> <meta charset="utf-8"> <script src="test.js"></script> test.js (encoded with UTF-8, obviously): alert('π'); When I visit this page via simple-httpd, the response headers include "Content-Type: text/html" and "Content-Type: text/javascript" as expected (no charset) but the alert box still shows "π" correctly. When I comment/remove the meta tag, the alert box displays "Ï€" since the JavaScript was interpreted as ISO-5589-1.

Add MIME charset based on Emacs coding system

fbd3ae3

kqr mentioned this pull request Feb 17, 2018

Add MIME charset to files served by httpd/imp/live skeeto/impatient-mode#6

Open

Fixes brainfart where (car x) became (fst x)

956fc3a

skeeto added a commit that referenced this pull request Feb 18, 2018

Properly close connections when requested (#18)

c252765

If the client uses HTTP/1.0 or "Connection: close" then the server *must* close the connection after the response has been sent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MIME charset based on Emacs coding system #18

Add MIME charset based on Emacs coding system #18

kqr commented Feb 17, 2018 •

edited

Loading

skeeto commented Feb 17, 2018 via email

kqr commented Feb 17, 2018

kqr commented Feb 17, 2018

skeeto commented Feb 17, 2018 via email

kqr commented Feb 18, 2018 •

edited

Loading

skeeto commented Feb 18, 2018 via email

kqr commented Feb 19, 2018

skeeto commented Feb 19, 2018 via email

Add MIME charset based on Emacs coding system #18

Are you sure you want to change the base?

Add MIME charset based on Emacs coding system #18

Conversation

kqr commented Feb 17, 2018 • edited Loading

skeeto commented Feb 17, 2018 via email

kqr commented Feb 17, 2018

kqr commented Feb 17, 2018

skeeto commented Feb 17, 2018 via email

kqr commented Feb 18, 2018 • edited Loading

skeeto commented Feb 18, 2018 via email

kqr commented Feb 19, 2018

skeeto commented Feb 19, 2018 via email

kqr commented Feb 17, 2018 •

edited

Loading

kqr commented Feb 18, 2018 •

edited

Loading