-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MIME charset based on Emacs coding system #18
base: master
Are you sure you want to change the base?
Conversation
Thanks for taking the time to put this together. There are some issues
with this patch:
* I don't know what the fst function is or where it comes from. It's not
defined by Emacs, and a quick search brings up nothing.
* Reading the file into a multibyte buffer defeats the purpose of
detecting its format/encoding. Emacs will have already decoded it into
UTF-8 regardless of the original encoding, and so it no longer matters.
Worse, it it's a binary file, the decoding won't make sense. The goal is
to pass the raw bytes along unmolested by Emacs, and in a single load. I
perhaps should have used insert-file-contents-literally (back in 2012
when I wrote that line), but it appears to do the same thing when the
buffer is unibyte.
* You've singled out UTF-8, which is appropriate since it would be the
most common, but there are actually 20 different values by my count that
all mean UTF-8 to Emacs:
utf-8 utf-8-auto utf-8-auto-dos utf-8-auto-mac utf-8-auto-unix utf-8-dos
utf-8-emacs utf-8-emacs-dos utf-8-emacs-mac utf-8-emacs-unix utf-8-hfs
utf-8-hfs-dos utf-8-hfs-mac utf-8-hfs-unix utf-8-mac utf-8-unix
utf-8-with-signature utf-8-with-signature-dos utf-8-with-signature-mac
utf-8-with-signature-unix
Transforming an Emacs coding system symbol into a MIME charset needs to
be a lot more robust. (Perhaps this is was "fst" is about?)
|
|
Re. your concern about binary files I'm not sure... One could retrieve charset only for text/ mime types. Or guess based on presence of null bytes. |
Thanks for the tip about (3). I didn't know that's how
coding-system-type worked! I'm now satisfied with that part.
What I mean with (2) is that Emacs always uses UTF-8 internally for
multibyte strings and buffers. It remembers the original encoding so
that it can re-encode it that way when saving. However, for this server
the buffer is sent out with process-send-region, which uses the
connection process' coding system, not the buffer's file coding system.
Since the connection process is set to "binary," Emacs will send out the
buffer's raw UTF-8 content regardless of the original file's encoding.
So that's what I meant about the original encoding being irrelevant.
Whatever it was, the most efficient action at this point would be to
just send it as UTF-8 and forget the original encoding.
Less efficient would be changing the connection process' coding system
first so it re-encodes the data back to the original encoding as it's
sent out. I'm not currently sure if this would always work as expected.
Even less efficient would be reading the entire file a second time into
a unibyte buffer, then sending the raw data (as before your patch). But
this time with the Emacs-guessed charset discovered from the initial
decode.
HTTP coding is kind of a mess from Emacs' point of view because a stream
of HTTP data is typically a mix of different encodings. Headers are
ISO-8859-1 while content is generally either binary or UTF-8. To deal
with this, simple-httpd uses raw binary for the connection process and
calls decode-coding-string to manually decode just the HTTP header.
|
Aaah, that makes sense. Sorry, I misunderstood completely. Thank you for teaching me a little more about Emacs encoding handling. If I understand the rest correctly (i.e. regardless of this patch or not, Emacs will push UTF-8 into the pipes), would it make sense to rewrite this patch to instead just set charset=utf-8 on everything? It's not optimal (since recoding is not a reversible process in general), but it should be strictly better than the current situation where UTF-8 is used with no indication of it, right? |
Without your patch, Emacs passes along any file's raw bytes without any
translation, just as if it were a binary file. A UTF-16 text file will
be served as UTF-16 — though without any indication of the "charset"
from the Emacs server. It just leaves it for the client to figure out
the encoding. I personally don't think it's the server's business to
translate between encodings since 1) it doesn't know the client's
purpose (the original encoding might be important) and 2) it could be
wrong about the original encoding.
Try it out yourself. Serve a UTF-16 text file with some non-ASCII code
points via simple-httpd and observe the results in various browsers and
other clients (curl -v | hexdump -C). I tested with GREEK SMALL LETTER
PI and PILE OF POO (to test a supplemental plane).
There's a certain irony that, in my tests, UTF-16 actually works better
than UTF-8 (without BOM / "signature") in both Firefox and Chromium.
Both detect that it's UTF-16 and display the text properly. A UTF-8 file
gets mangled because both browsers assume ASCII by default (despite RFC
6657). Both detect UTF-8 when a BOM is present (eww). I got a variety of
different results from various terminal browsers: lynx, links, elinks,
w3m.
In practice, for websites this is not an issue. An HTML document should
generally specify its encoding (e.g. <meta charset="...>) — despite how
strange it is for the encoding to be specified in itself — so it doesn't
matter if the server uses "charset" or not. JavaScript can't do this,
but JavaScript included from an HTML document adopts that document's
encoding (at least with Firefox and Chromium).
Side note #1: If you didn't know, Emacs extends Unicode such that it can
represent raw bytes within an encoded stream, which would otherwise be
illegal:
http://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html
Because of this, any binary file that was incorrectly decoded via any
coding system can be unambiguously and safely converted back to the
original binary data, given the coding system that was used to
mistakenly decode it.
Side note #2: It looks like simple-httpd doesn't properly respond to
"connection: close" as used by simpler browsers, which I should fix.
Perhaps there's some value in calling an external program like enca to
make a guess at the encoding if it's available. Unfortunately enca
incompletely or fails to identify the encoding of some of the files I
tested, so it would need to be some other program. On the other hand,
this is the "simple" httpd server, and I prefer to do the simple,
non-surprising thing of leaving out "charset" and letting the client
sort it out.
You haven't said _why_ you want this change. I see you paired this with
a PR over in impatient-mode. Are you having trouble with a particular
document? An issue with htmlize? The beauty of MIME types being stored
an alist is that you can trivially override it for your own purposes:
(push '("html" . "text/html;charset=utf-8") httpd-mime-types)
But, IMHO, much better is to use a meta tag in the document if possible.
|
If the client uses HTTP/1.0 or "Connection: close" then the server *must* close the connection after the response has been sent.
Ah, yes, I forgot the buffer was not multibyte pre-patch. Sorry. Now that part makes sense as well. I agree fully that the server is out of line if it attempts to recode the content before passing it along to the user (beside the philosophical issue, it would wreck any checksums or HMACs associated with the content.)
Yes, although I'm starting to suspect something else is wrong, because my problem contradicts your other observations. In particular,
appeared not to be the case for me. Firefox assumed (as you said) that a JavaScript file included in my HTML was ASCII encoded rather than UTF-8, and then when it encountered GREEK SMALL LETTER MU in an identifier it errored out. The HTML file had UTF-8 specified as a meta tag and it appeared to understand UTF-8 sequences. That said, I am 100% content with the solution you suggested of pushing a new mime type onto the alist. I wish I had thought of that to begin with, and not tried to be clever. If it's okay with you, I can close this pull request as well as its companion. |
Unless you've come up with another way to tackle this issue, closing
both PRs is fine with me. Again, thanks for taking the time to look into
this.
Firefox assumed (as you said) that a JavaScript file included in my
HTML was ASCII encoded rather than UTF-8, and then when it encountered
GREEK SMALL LETTER MU in an identifier it errored out.
Here's the test I ran before putting that claim in my response, just to
be absolutely sure my understanding was accurate. Both Firefox and
Chromium had the same result — though Chromium required a hard refresh.
index.html:
<!DOCTYPE html>
<meta charset="utf-8">
<script src="test.js"></script>
test.js (encoded with UTF-8, obviously):
alert('π');
When I visit this page via simple-httpd, the response headers include
"Content-Type: text/html" and "Content-Type: text/javascript" as
expected (no charset) but the alert box still shows "π" correctly. When
I comment/remove the meta tag, the alert box displays "Ï€" since the
JavaScript was interpreted as ISO-5589-1.
|
Companion to, but obviously independent of, PR #6 on impatient-mode.