Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for char encoding methods #27784

Closed
alexcrichton opened this issue Aug 13, 2015 · 71 comments
Closed

Tracking issue for char encoding methods #27784

alexcrichton opened this issue Aug 13, 2015 · 71 comments
Labels
B-unstable Blocker: Implemented in the nightly compiler and unstable. E-help-wanted Call for participation: Help is requested to fix this issue. final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@alexcrichton
Copy link
Member

This is a tracking issue for the unstable unicode feature and the char::encode_utf{8,16} methods.

The interfaces here are a little wonky but are done for performance. It's not clear whether these need to be exported or not or if there's a better method to do so through iterators.

@alexcrichton alexcrichton added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. B-unstable Blocker: Implemented in the nightly compiler and unstable. labels Aug 13, 2015
@SimonSapin
Copy link
Contributor

How about returning enums like enum OneOrTwo { One(u16), Two(u16, u16) } or enum Utf16Encoding { SingleCodeUnit(u16), SurrogatePair(u16, u16) }?

@alexcrichton
Copy link
Member Author

Certainly possible, but there's also the question of ergonomics here in terms of what to do with that after you've got the information.

@SimonSapin
Copy link
Contributor

I think that one form or another of this functionality that doesn’t require allocation should be exposed. Returning an iterator is nicer than taking &mut [_], but I don’t know about performance.

@nagisa
Copy link
Member

nagisa commented Aug 13, 2015

I’ve suggested taking &mut [u8; 4]/&mut [u16; 2] and returning usize once. The major downside is inability to convert from slice to array.

Taking anything else than slice also makes following use case not as elegant as it is now:

let mut buffer = Vec::with_capacity(alot);
let mut idx = 0;
loop {
     idx += some_char().encode_utf8(&mut buffer[idx..]).unwrap();
}

bors added a commit that referenced this issue Aug 27, 2015
* Rename `Utf16Items` to `Utf16Decoder`. "Items" is meaningless.
* Generalize it to any `u16` iterator, not just `[u16].iter()`
* Make it yield `Result` instead of a custom `Utf16Item` enum that was isomorphic to `Result`. This enable using the `FromIterator for Result` impl.
* Replace `Utf16Item::to_char_lossy` with a `Utf16Decoder::lossy` iterator adaptor.

This is a [breaking change], but only for users of the unstable `rustc_unicode` crate.

I’d like this functionality to be stabilized and re-exported in `std` eventually, as the "low-level equivalent" of `String::from_utf16` and `String::from_utf16_lossy` like #27784 is the low-level equivalent of #27714.

CC @aturon, @alexcrichton
@SimonSapin
Copy link
Contributor

How about returning something that both is an iterator and dereferences to a slice?

struct Utf8Char {
    bytes: [u8; 4],
    position: usize,
}

impl Deref for Utf8Char {
    type Target = [u8];
    fn deref(&self) -> &[u8] { &self.bytes[self.position..] }
}

impl Iterator for Utf8Char {
    type Item = u8;
    fn next(&mut self) -> Option<u8> {
        if self.position < self.bytes.len() {
            let byte = self.bytes[self.position];
            self.position += 1;
            Some(byte)
        } else {
            None
        }
    }
}

(“Short” code points have zeros as padding at the start of the array.)

… and similarly for UTF-16, but with [u16; 2] instead of [u8; 4].

@BurntSushi
Copy link
Member

@SimonSapin That looks really sweet to me!

@BurntSushi
Copy link
Member

@SimonSapin In your deref method, I think that should be &self.bytes[..self.position], right?

@BurntSushi
Copy link
Member

@SimonSapin What do you think about also exposing decode_{utf8,utf16} methods? Basically, if you have some bytes and want the next encoded char out of it, today I think you need to decode into a string and then call chars, which is a bit roundabout (and does extra work I believe).

@SimonSapin
Copy link
Contributor

No, deref returns the slice that hasn’t been consumed by the iterator yet. For code points that have less than 4 bytes to begin with, padding is at the start of the array, not the end.

We already have char::decode_utf16 that takes and returns iterators.

For UTF-8 I do want to expose a decoder that’s more low-level than what we currently have, but I’m not sure what it should look like. I have some experiments at https://github.com/SimonSapin/rust-utf8

@BurntSushi
Copy link
Member

padding is at the start of the array, not the end.

Ah! That was what I missed. Thanks for the clarification.

I have some experiments at https://github.com/SimonSapin/rust-utf8

Interesting. That is much more complex than I had thought it would be! (I hadn't considered returning additional info about incomplete sequences.)

@SimonSapin
Copy link
Contributor

Most of the complexity comes from self-imposed constraints:

  • Support “chunked” decoding so you can start processing, say, an HTML document before it’s finished downloading from the network. The bytes for a single char can be split across chunks.
  • Make it possible to emit &str slices that borrow &[u8] input bytes whenever possible, to avoid copying too many bytes.

I don’t know how much of that should be in the standard library.

But when the standard library gets performance improvement like #30740 (and perhaps more in the future with SIMD or something?), ideally they’d be in a low-level algorithm that everything else builds on top of.

@alexcrichton
Copy link
Member Author

🔔 This issue is now entering its cycle-long final comment period for stabilization 🔔

The API proposed by @SimonSapin seems reasonable, perhaps in the form of:

impl char {
    fn encode_utf8(&self) -> EncodeUtf8;
}

struct EncodeUtf8 {
    // ...
}

impl Iterator for EncodeUtf8 {
    type Item = u8;
    // ...
}

impl EncodeUtf8 {
    #[unstable(...)]
    pub fn as_slice(&self) -> &[u8] { /* ... */ }
}

@alexcrichton alexcrichton added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed I-nominated labels Mar 11, 2016
alexcrichton added a commit to alexcrichton/rust that referenced this issue Mar 22, 2016
Currently these have non-traditional APIs which take a buffer and report how
much was filled in, but they're not necessarily ergonomic to use. Returning an
iterator which *also* exposes an underlying slice shouldn't result in any
performance loss as it's just a lazy version of the same implementation, and
it's also much more ergonomic!

cc rust-lang#27784
bors added a commit that referenced this issue Mar 22, 2016
…turon

std: Change `encode_utf{8,16}` to return iterators

Currently these have non-traditional APIs which take a buffer and report how
much was filled in, but they're not necessarily ergonomic to use. Returning an
iterator which *also* exposes an underlying slice shouldn't result in any
performance loss as it's just a lazy version of the same implementation, and
it's also much more ergonomic!

cc #27784
bors added a commit that referenced this issue Mar 22, 2016
…turon

std: Change `encode_utf{8,16}` to return iterators

Currently these have non-traditional APIs which take a buffer and report how
much was filled in, but they're not necessarily ergonomic to use. Returning an
iterator which *also* exposes an underlying slice shouldn't result in any
performance loss as it's just a lazy version of the same implementation, and
it's also much more ergonomic!

cc #27784
@Stebalien
Copy link
Contributor

Why add an as_slice method instead of implementing Deref and Index? I was under the impression that &thing[..] should always be prefered to thing.as_slice().

@daschl
Copy link

daschl commented Oct 10, 2016

@alexcrichton not sure if this is in scope, but I have the use case where I want to write a "char" into a u8 array at a given offset. Right now I've copied the main part from stdlib and do it like so:

#[inline]
fn write_char_into_array(&self, offset: usize, ch: &char, array: &mut [u8]) -> bool {
    if (ch.len_utf8() + offset) > array.len() {
        return false;
    }

    let code = *ch as u32;
    if code < MAX_ONE_B {
        array[offset] = code as u8;
    } else if code < MAX_TWO_B {
        array[offset] = (code >> 6 & 0x1F) as u8 | TAG_TWO_B;
        array[offset + 1] = (code & 0x3F) as u8 | TAG_CONT;
    } else if code < MAX_THREE_B {
        array[offset] = (code >> 12 & 0x0F) as u8 | TAG_THREE_B;
        array[offset + 1] = (code >> 6 & 0x3F) as u8 | TAG_CONT;
        array[offset + 2] = (code & 0x3F) as u8 | TAG_CONT;
    } else {
        array[offset] = (code >> 18 & 0x07) as u8 | TAG_FOUR_B;
        array[offset + 1] = (code >> 12 & 0x3F) as u8 | TAG_CONT;
        array[offset + 2] = (code >> 6 & 0x3F) as u8 | TAG_CONT;
        array[offset + 3] = (code & 0x3F) as u8 | TAG_CONT;
    }
    true
}

Again not sure if this is in scope, but I wanted to bring it up in case such a use case makes sense to integrate.

@SimonSapin
Copy link
Contributor

@daschl you can deal with the offset by slicing:

fn write_char_into_array(&self, offset: usize, ch: &char, array: &mut [u8]) -> bool {
    let slice = &mut array[offset..];
    if ch.len_utf8() > slice.len() {
        return false
    }
    ch.encode_utf8(slice);
    true
}

@bluss
Copy link
Member

bluss commented Oct 10, 2016

@alexcrichton All of your suggestions are good. &'a mut str seems like the most powerful one, and it allows recovering all the other ones.

I don't know if this data point is interesting, but ArrayString ended up copying this code too. When I shaped it after how it's used it ends up looking like

pub fn encode_utf8(ch: char, buf: &mut [u8]) -> Result<usize, EncodeError>

impl: https://github.com/bluss/arrayvec/blob/d43c959fa8afa912c497104b60a64892e1178f9d/src/char_ext.rs#L28-L51

use: https://github.com/bluss/arrayvec/blob/d43c959fa8afa912c497104b60a64892e1178f9d/src/array_string.rs#L114-L119

@daschl
Copy link

daschl commented Oct 10, 2016

@SimonSapin ah good to know, thanks! lets hope it gets stable soon then

@bluss
Copy link
Member

bluss commented Oct 10, 2016

I would vote for no panic. That also simplifies the bounds checking (the impl linked above has all bounds checks elided, since buf.len() is being tested manually).

@alexcrichton
Copy link
Member Author

@rfcbot fcp merge

These methods have been around for awhile, I propose we merge!

@alexcrichton
Copy link
Member Author

er, stabilize*

@rfcbot
Copy link

rfcbot commented Nov 1, 2016

Team member @alexcrichton has proposed to merge this. The next step is review by the rest of the tagged teams:

Concerns:

Once these reviewers reach consensus, this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@brson
Copy link
Contributor

brson commented Nov 4, 2016

It's not obvious to me that the current API is the one we want since there has been ongoing discussion. @bluss recently suggested it should not panic, and the current implementation does panic.

@brson
Copy link
Contributor

brson commented Nov 4, 2016

@rfcbot concern panic-vs-not-panic

@alexcrichton
Copy link
Member Author

I'm personally convinced by @SimonSapin's points above, which is that there is a statically known size (4 and 2) to encode all utf8 and utf16 characters. Callers basically always need to pass in that size buffer, so it seems more like a programmer error if you pass in a small buffer than a runtime error that should be handled.

@SimonSapin
Copy link
Contributor

To reiterate: you can use 2 or 4 to keep things simple or to use a statically-sized array, but you can also use char::len_utf8 or char::len_utf16 to find the precise length that is needed.

@bluss
Copy link
Member

bluss commented Nov 4, 2016

I don't want to cause a gridlock, it was just what my perspective was in that moment. The fixed capacity use case is more uncommon, and the motivation for panic is following the usual conventions, so it certainly makes sense.

@brson
Copy link
Contributor

brson commented Nov 10, 2016

OK, it sounds like there's no major opposition to the api as is.

@brson
Copy link
Contributor

brson commented Nov 10, 2016

@rfcbot resolved panic-vs-not-panic

@Kimundi
Copy link
Member

Kimundi commented Nov 11, 2016

Apart from the panicking. I'm a bit confused right now about what the actual API/signature is going to be. The one that returns &mut str ?

@alexcrichton
Copy link
Member Author

@Kimundi

Yeah encode_utf8 looks like:

fn encode_utf8(self, dst: &mut [u8]) -> &mut str

and encode_utf16 looks like:

fn encode_utf16(self, dst: &mut [u16]) -> &mut [u16]

@Kimundi
Copy link
Member

Kimundi commented Nov 11, 2016

Alright!

@rfcbot
Copy link

rfcbot commented Nov 12, 2016

🔔 This is now entering its final comment period, as per the review above. 🔔

psst @alexcrichton, I wasn't able to add the final-comment-period label, please do so.

@alexcrichton alexcrichton added the final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. label Nov 12, 2016
@rfcbot
Copy link

rfcbot commented Nov 22, 2016

The final comment period is now complete.

bors added a commit that referenced this issue Dec 18, 2016
Library stabilizations/deprecations for 1.15 release

Stabilized:

- `std::iter::Iterator::{min_by, max_by}`
- `std::os::*::fs::FileExt`
- `std::sync::atomic::Atomic*::{get_mut, into_inner}`
- `std::vec::IntoIter::{as_slice, as_mut_slice}`
- `std::sync::mpsc::Receiver::try_iter`
- `std::os::unix::process::CommandExt::before_exec`
- `std::rc::Rc::{strong_count, weak_count}`
- `std::sync::Arc::{strong_count, weak_count}`
- `std::char::{encode_utf8, encode_utf16}`
- `std::cell::Ref::clone`
- `std::io::Take::into_inner`

Deprecated:

- `std::rc::Rc::{would_unwrap, is_unique}`
- `std::cell::RefCell::borrow_state`

Closes #23755
Closes #27733
Closes #27746
Closes #27784
Closes #28356
Closes #31398
Closes #34931
Closes #35601
Closes #35603
Closes #35918
Closes #36105
gwenn added a commit to gwenn/rustyline that referenced this issue Dec 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-unstable Blocker: Implemented in the nightly compiler and unstable. E-help-wanted Call for participation: Help is requested to fix this issue. final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests