-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add explicit-endian String::from_utf16 variants #95967
Conversation
Thank you for submitting a new PR for the library teams! If this PR contains a stabilization of a library feature that has not already completed FCP in its tracking issue, introduces new or changes existing unstable library APIs, or changes our public documentation in ways that create new stability guarantees then please comment with |
(rust-highfive has picked a reviewer for you, use r? to override) |
Surely it would be more consistent with |
The point of this API surface is to take |
Ah, I see, it's to deserialize bytes from say a file or other stream. That makes sense. |
(triaging my open PRs) I've inlined this into the project that needed it, and it's been working fine there. (Though tbf I haven't exercised that codepath much.) I think this is right on the edge of interesting enough to include in std; the forwarding to the aligned version when possible is beneficial but easy to overlook. The one thing that maybe should be added is a note to use I'm happy to do either, just S-waiting-on-review. |
match (cfg!(target_endian = "little"), unsafe { v.align_to::<u16>() }) { | ||
(true, (&[], v, &[])) => Self::from_utf16(v), | ||
_ => decode_utf16(v.array_chunks::<2>().copied().map(u16::from_le_bytes)) | ||
.collect::<Result<_, _>>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be better to use String::with_capacity()
+ String::push()
instead of collect::<Result<_, _>>()
until #48994 is fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this would match the implementation in from_utf16
. However, I expect these functions will be much less performance critical, so adding the extra code is perhaps unneeded? I'm not sure.
Ping from triage @joshtriplett. It's been 10 months since last activity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me. Please open a tracking issue.
maybe these should be named with
_bytes
in the name to match the convention fromu16
I find the names currently in the PR to be all right and would be happy landing them as is. I'm sure we'll discuss the naming further as part of stabilization, though.
Thanks!
This comment has been minimized.
This comment has been minimized.
Tracking issue linked: #116258 |
This comment has been minimized.
This comment has been minimized.
Co-authored-by: David Tolnay <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
@bors r+ |
Rollup of 4 pull requests Successful merges: - rust-lang#95967 (Add explicit-endian String::from_utf16 variants) - rust-lang#116530 (delay a bug when encountering an ambiguity in MIR typeck) - rust-lang#116611 (Document `diagnostic_namespace` feature) - rust-lang#116612 (Remove unused dominator iterator) r? `@ghost` `@rustbot` modify labels: rollup
Rollup merge of rust-lang#95967 - CAD97:from-utf16, r=dtolnay Add explicit-endian String::from_utf16 variants This adds the following APIs under `feature(str_from_utf16_endian)`: ```rust impl String { pub fn from_utf16le(v: &[u8]) -> Result<String, FromUtf16Error>; pub fn from_utf16le_lossy(v: &[u8]) -> String; pub fn from_utf16be(v: &[u8]) -> Result<String, FromUtf16Error>; pub fn from_utf16be_lossy(v: &[u8]) -> String; } ``` These are versions of `String::from_utf16` that explicitly take [UTF-16LE and UTF-16BE](https://unicode.org/faq/utf_bom.html#gen7). Notably, we can do better than just the obvious `decode_utf16(v.array_chunks::<2>().copied().map(u16::from_le_bytes)).collect()` in that: - We handle the case where the byte slice is not an even number of bytes, and - In the case that the UTF-16 is native endian and the slice is aligned, we can forward to `String::from_utf16`. If the Unicode Consortium actively defines how to handle character replacement when decoding a UTF-16 bytestream with a trailing odd byte, I was unable to find reference. However, the behavior implemented here is fairly self-evidently correct: replace the single errant byte with the replacement character.
48: Pull upstream master 2023 10 12 r=tshepang a=Dajamante * rust-lang/rust#113487 * rust-lang/rust#116506 * rust-lang/rust#116448 * rust-lang/rust#116640 * rust-lang/rust#116627 * rust-lang/rust#116597 * rust-lang/rust#116436 * rust-lang/rust#116315 * rust-lang/rust#116219 * rust-lang/rust#113218 * rust-lang/rust#115937 * rust-lang/rust#116014 * rust-lang/rust#116623 * rust-lang/rust#112818 * rust-lang/rust#115948 * rust-lang/rust#116622 * rust-lang/rust#116621 * rust-lang/rust#116612 * rust-lang/rust#116611 * rust-lang/rust#116530 * rust-lang/rust#95967 * rust-lang/rust#116578 * rust-lang/rust#113915 * rust-lang/rust#116605 * rust-lang/rust#116574 * rust-lang/rust#116560 * rust-lang/rust#116559 * rust-lang/rust#116503 * rust-lang/rust#116444 * rust-lang/rust#116250 * rust-lang/rust#109422 * rust-lang/rust#116598 * rust-lang/rust#116596 * rust-lang/rust#116595 * rust-lang/rust#116589 * rust-lang/rust#116586 * rust-lang/rust#116551 * rust-lang/rust#116409 * rust-lang/rust#116548 * rust-lang/rust#116366 * rust-lang/rust#109882 * rust-lang/rust#116497 * rust-lang/rust#116532 * rust-lang/rust#116569 * rust-lang/rust#116561 * rust-lang/rust#116556 * rust-lang/rust#116549 * rust-lang/rust#116543 * rust-lang/rust#116537 * rust-lang/rust#115882 * rust-lang/rust#116142 * rust-lang/rust#115238 * rust-lang/rust#116533 * rust-lang/rust#116096 * rust-lang/rust#116468 * rust-lang/rust#116515 * rust-lang/rust#116454 * rust-lang/rust#116183 * rust-lang/rust#116514 * rust-lang/rust#116509 * rust-lang/rust#116487 * rust-lang/rust#116486 * rust-lang/rust#116450 * rust-lang/rust#114623 * rust-lang/rust#116416 * rust-lang/rust#116437 * rust-lang/rust#100806 * rust-lang/rust#116330 * rust-lang/rust#116310 * rust-lang/rust#115583 * rust-lang/rust#116457 * rust-lang/rust#116508 * rust-lang/rust#109214 * rust-lang/rust#116318 * rust-lang/rust#116501 * rust-lang/rust#116500 * rust-lang/rust#116458 * rust-lang/rust#116400 * rust-lang/rust#116277 * rust-lang/rust#114709 * rust-lang/rust#116492 * rust-lang/rust#116484 * rust-lang/rust#116481 * rust-lang/rust#116474 * rust-lang/rust#116466 * rust-lang/rust#116423 * rust-lang/rust#116297 * rust-lang/rust#114564 * rust-lang/rust#114811 * rust-lang/rust#116489 * rust-lang/rust#115304 Co-authored-by: Peter Hall <[email protected]> Co-authored-by: Emanuele Vannacci <[email protected]> Co-authored-by: Neven Villani <[email protected]> Co-authored-by: Alex Macleod <[email protected]> Co-authored-by: Tamir Duberstein <[email protected]> Co-authored-by: Eduardo Sánchez Muñoz <[email protected]> Co-authored-by: koka <[email protected]> Co-authored-by: bors <[email protected]> Co-authored-by: Philipp Krones <[email protected]> Co-authored-by: Camille GILLOT <[email protected]> Co-authored-by: Esteban Küber <[email protected]> Co-authored-by: Ralf Jung <[email protected]>
48: Pull upstream master 2023 10 12 r=tshepang a=Dajamante * rust-lang/rust#113487 * rust-lang/rust#116506 * rust-lang/rust#116448 * rust-lang/rust#116640 * rust-lang/rust#116627 * rust-lang/rust#116597 * rust-lang/rust#116436 * rust-lang/rust#116315 * rust-lang/rust#116219 * rust-lang/rust#113218 * rust-lang/rust#115937 * rust-lang/rust#116014 * rust-lang/rust#116623 * rust-lang/rust#112818 * rust-lang/rust#115948 * rust-lang/rust#116622 * rust-lang/rust#116621 * rust-lang/rust#116612 * rust-lang/rust#116611 * rust-lang/rust#116530 * rust-lang/rust#95967 * rust-lang/rust#116578 * rust-lang/rust#113915 * rust-lang/rust#116605 * rust-lang/rust#116574 * rust-lang/rust#116560 * rust-lang/rust#116559 * rust-lang/rust#116503 * rust-lang/rust#116444 * rust-lang/rust#116250 * rust-lang/rust#109422 * rust-lang/rust#116598 * rust-lang/rust#116596 * rust-lang/rust#116595 * rust-lang/rust#116589 * rust-lang/rust#116586 * rust-lang/rust#116551 * rust-lang/rust#116409 * rust-lang/rust#116548 * rust-lang/rust#116366 * rust-lang/rust#109882 * rust-lang/rust#116497 * rust-lang/rust#116532 * rust-lang/rust#116569 * rust-lang/rust#116561 * rust-lang/rust#116556 * rust-lang/rust#116549 * rust-lang/rust#116543 * rust-lang/rust#116537 * rust-lang/rust#115882 * rust-lang/rust#116142 * rust-lang/rust#115238 * rust-lang/rust#116533 * rust-lang/rust#116096 * rust-lang/rust#116468 * rust-lang/rust#116515 * rust-lang/rust#116454 * rust-lang/rust#116183 * rust-lang/rust#116514 * rust-lang/rust#116509 * rust-lang/rust#116487 * rust-lang/rust#116486 * rust-lang/rust#116450 * rust-lang/rust#114623 * rust-lang/rust#116416 * rust-lang/rust#116437 * rust-lang/rust#100806 * rust-lang/rust#116330 * rust-lang/rust#116310 * rust-lang/rust#115583 * rust-lang/rust#116457 * rust-lang/rust#116508 * rust-lang/rust#109214 * rust-lang/rust#116318 * rust-lang/rust#116501 * rust-lang/rust#116500 * rust-lang/rust#116458 * rust-lang/rust#116400 * rust-lang/rust#116277 * rust-lang/rust#114709 * rust-lang/rust#116492 * rust-lang/rust#116484 * rust-lang/rust#116481 * rust-lang/rust#116474 * rust-lang/rust#116466 * rust-lang/rust#116423 * rust-lang/rust#116297 * rust-lang/rust#114564 * rust-lang/rust#114811 * rust-lang/rust#116489 * rust-lang/rust#115304 Co-authored-by: Emanuele Vannacci <[email protected]> Co-authored-by: Neven Villani <[email protected]> Co-authored-by: Alex Macleod <[email protected]> Co-authored-by: Tamir Duberstein <[email protected]> Co-authored-by: Eduardo Sánchez Muñoz <[email protected]> Co-authored-by: koka <[email protected]> Co-authored-by: bors <[email protected]> Co-authored-by: Philipp Krones <[email protected]> Co-authored-by: Camille GILLOT <[email protected]> Co-authored-by: Esteban Küber <[email protected]> Co-authored-by: Ralf Jung <[email protected]> Co-authored-by: ShE3py <[email protected]>
This adds the following APIs under
feature(str_from_utf16_endian)
:These are versions of
String::from_utf16
that explicitly take UTF-16LE and UTF-16BE. Notably, we can do better than just the obviousdecode_utf16(v.array_chunks::<2>().copied().map(u16::from_le_bytes)).collect()
in that:String::from_utf16
.If the Unicode Consortium actively defines how to handle character replacement when decoding a UTF-16 bytestream with a trailing odd byte, I was unable to find reference. However, the behavior implemented here is fairly self-evidently correct: replace the single errant byte with the replacement character.