Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify HTML numeric character reference fallback encoding for multipart upload filename characters not representable in form charset #3223

Closed
bsittler opened this issue Nov 11, 2017 · 8 comments

Comments

@bsittler
Copy link

bsittler commented Nov 11, 2017

Specify HTML numeric character reference fallback encoding for multipart upload filename characters not representable in form acceptCharset/form charset.

Rationale:

  • Consistency: this will make filename fallback character replacement consistent with encoding of form element names and values in multipart uploads when a source character is not representable in the acceptCharset/form charset. @annevk points out that this is exactly the "html" error handling of the Encoding Standard. https://encoding.spec.whatwg.org/#concept-encoding-process
  • Predictability: this is consistent with existing behavior in at least two browsers (Firefox and Edge). I have also started an intent to implement and ship thread for this behavior for Chrome. edit: this proposal was accepted, I'm now working to implement it in Chrome
  • Reduced data loss: this change reduces the risk of user confusion and website malfunction when multiple uploaded files with distinct local filenames but identical representation after user agent-specific fallback character replacement are uploaded using <input type=file multiple>; with this behavior standardized, web pages may even be able to portably recover useful user-visible representations of the original filenames, though some ambiguity remains with that approach as a local file could actually contain name parts matching numeric character references (moving to UTF-8 for the form submission of course resolves the ambiguity and should be the only recommended solution for newly-built web pages).

Accidentally filed here too: w3c/html#1077

@annevk
Copy link
Member

annevk commented Nov 14, 2017

Currently we have:

For each character in the entry's name and value that cannot be expressed using the selected character encoding, replace the character by a string consisting of a U+0026 AMPERSAND character (&), a U+0023 NUMBER SIGN character (#), one or more ASCII digits representing the code point of the character in base ten, and finally a U+003B (;).

If we use https://encoding.spec.whatwg.org/#encode this will happen automatically. The problem is that HTML passes strings to the RFC "algorithms" which are supposed to handle all the encoding requirements.

A proper fix would require replacing the RFC I think.

@annevk
Copy link
Member

annevk commented Nov 14, 2017

Replacing the RFC is #3040 and https://www.w3.org/Bugs/Public/show_bug.cgi?id=16909.

@bsittler
Copy link
Author

bsittler commented Nov 14, 2017

An example: if the filename were ABC~‾¥≈¤・・•∙·☼★星🌟星★☼·∙•・・¤≈¥‾~XYZ&#8776;.txt and the form charset were ISO-2022-JP the expected encoded name would be ABC~⊖(J~\&#8776;&#164;⊖$B!&!&⊖(B&#8226;&#8729;&#183;&#9788;⊖$B!z@1⊖(B&#127775;⊖$B@1!z⊖(B&#9788;&#183;&#8729;&#8226;⊖$B!&!&⊖(B&#164;&#8776;⊖(J\~⊖(B~XYZ&#8776;.txt a.k.a. ABC~␛(J~\&#8776;&#164;␛$B!&!&␛(B&#8226;&#8729;&#183;&#9788;␛$B!z@1␛(B&#127775;␛$B@1!z␛(B&#9788;&#183;&#8729;&#8226;␛$B!&!&␛(B&#164;&#8776;␛(J\~␛(B~XYZ&#8776;.txt (here or pictorially represents the ESC C0 control x1B); beware the information-losing unification of halfwidth and fullwidth forms, in addition to the information loss due to inability to distinguish literal and replacement numeric character reference-like sequences; the actual solution to those problems is UTF-8, but so long as non-UTF-8 charsets are still used this will at least reduce information loss compared to e.g. ? substitution.

@bsittler
Copy link
Author

bsittler commented Dec 6, 2017

Given that this issue is still open, should the tests I'm adding in https://crrev.com/c/811625 be .tentative. ?

chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Dec 6, 2017
Tests multipart form POSTs with file inputs where the selected "file"
was constructed using the `File` constructor and added to a
`DataTransferItemList` (this avoids the user gesture requirement which
otherwise would consign this to manual testing.) For the non-ASCII
filenames with non-UTF-8 accept-charsets this also verifies fallback
encoding/replacement of unrepresentable characters using numeric
character references. whatwg/html#2861

Coverage for fallback encoding is still tentative because filename
fallback encoding is not yet
standardized. whatwg/html#3223

Bug: 661819
Change-Id: Ic646f76b0c8a0792d1214a7848d2238bcc3a76e7
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Dec 6, 2017
Tests multipart form POSTs with file inputs where the selected "file"
was constructed using the `File` constructor and added to a
`DataTransferItemList` (this avoids the user gesture requirement which
otherwise would consign this to manual testing.) For the non-ASCII
filenames with non-UTF-8 accept-charsets this also verifies fallback
encoding/replacement of unrepresentable characters using numeric
character references. whatwg/html#2861

Coverage for fallback encoding is still tentative because filename
fallback encoding is not yet standardized.
whatwg/html#3223

Bug: 661819
Change-Id: Ic646f76b0c8a0792d1214a7848d2238bcc3a76e7
@domenic
Copy link
Member

domenic commented Dec 6, 2017

Yeah. Were you interested in updating the spec too?

@bsittler
Copy link
Author

bsittler commented Dec 6, 2017

Sure! How does a non-editor do that? Edit: n/m, I see CONTRIBUTING.md now

chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Dec 6, 2017
Tests multipart form POSTs with file inputs where the selected "file"
was constructed using the `File` constructor and added to a
`DataTransferItemList` (this avoids the user gesture requirement which
otherwise would consign this to manual testing.) For the non-ASCII
filenames with non-UTF-8 accept-charsets this also verifies fallback
encoding/replacement of unrepresentable characters using numeric
character references. whatwg/html#2861

Coverage for fallback encoding is still tentative because filename
fallback encoding is not yet standardized.
whatwg/html#3223

Bug: 661819
Change-Id: Ic646f76b0c8a0792d1214a7848d2238bcc3a76e7
Reviewed-on: https://chromium-review.googlesource.com/811625
Reviewed-by: Victor Costan <[email protected]>
Reviewed-by: Joshua Bell <[email protected]>
Commit-Queue: Benjamin Wiley Sittler <[email protected]>
Cr-Commit-Position: refs/heads/master@{#522197}
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Dec 6, 2017
Tests multipart form POSTs with file inputs where the selected "file"
was constructed using the `File` constructor and added to a
`DataTransferItemList` (this avoids the user gesture requirement which
otherwise would consign this to manual testing.) For the non-ASCII
filenames with non-UTF-8 accept-charsets this also verifies fallback
encoding/replacement of unrepresentable characters using numeric
character references. whatwg/html#2861

Coverage for fallback encoding is still tentative because filename
fallback encoding is not yet standardized.
whatwg/html#3223

Bug: 661819
Change-Id: Ic646f76b0c8a0792d1214a7848d2238bcc3a76e7
Reviewed-on: https://chromium-review.googlesource.com/811625
Reviewed-by: Victor Costan <[email protected]>
Reviewed-by: Joshua Bell <[email protected]>
Commit-Queue: Benjamin Wiley Sittler <[email protected]>
Cr-Commit-Position: refs/heads/master@{#522197}
@bsittler
Copy link
Author

bsittler commented Dec 6, 2017

@annevk @domenic I have attempted to change HTML to match in #3276 - would you be suitable reviewers?

MXEBot pushed a commit to mirror/chromium that referenced this issue Dec 7, 2017
Tests multipart form POSTs with file inputs where the selected "file"
was constructed using the `File` constructor and added to a
`DataTransferItemList` (this avoids the user gesture requirement which
otherwise would consign this to manual testing.) For the non-ASCII
filenames with non-UTF-8 accept-charsets this also verifies fallback
encoding/replacement of unrepresentable characters using numeric
character references. whatwg/html#2861

Coverage for fallback encoding is still tentative because filename
fallback encoding is not yet standardized.
whatwg/html#3223

Bug: 661819
Change-Id: Ic646f76b0c8a0792d1214a7848d2238bcc3a76e7
Reviewed-on: https://chromium-review.googlesource.com/811625
Reviewed-by: Victor Costan <[email protected]>
Reviewed-by: Joshua Bell <[email protected]>
Commit-Queue: Benjamin Wiley Sittler <[email protected]>
Cr-Commit-Position: refs/heads/master@{#522197}
@annevk
Copy link
Member

annevk commented Mar 1, 2021

@andreubotella ended up fixing this in #6282.

@annevk annevk closed this as completed Mar 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

3 participants