Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add utf16_units #14613

Closed
wants to merge 1 commit into from
Closed

Add utf16_units #14613

wants to merge 1 commit into from

Conversation

schmee
Copy link
Contributor

@schmee schmee commented Jun 2, 2014

Closes #14358.

The tests are not yet moved to utf16_iter, so this probably won't compile. I'm submitting this PR anyway so it can be reviewed and since it was mentioned in #14611. EDIT: Tests now use utf16_iter.

This deprecates .to_utf16. x.to_utf16() should be replaced by either x.utf16_iter().collect::<Vec<u16>>() (the type annotation may be optional), or just x.utf16_iter() directly, if it can be used in an iterator context.

[breaking-change]

cc @huonw

@@ -720,7 +720,7 @@ fn with_envp<T>(env: Option<&[(CString, CString)]>, cb: |*mut c_void| -> T) -> T
let kv = format!("{}={}",
pair.ref0().as_str().unwrap(),
pair.ref1().as_str().unwrap());
blk.push_all(kv.to_utf16().as_slice());
blk.extend(kv.to_utf16());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this method mean to be extend(kv.utf16_iter()) ?

@alexcrichton
Copy link
Member

Out of curiosity, what does the perf look like with an iterator vs a hard-coded to_utf16 loop?

extra: u16
}

impl<'a> Iterator<u16> for Utf16Codeunits<'a> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this override size_hint as well?

@schmee
Copy link
Contributor Author

schmee commented Jun 3, 2014

Is there any special reason why from_utf8 returns Option<&'a str> while from_utf16 returns Option<String>? This tripped me up a bit when working on the tests. Some of the {to, from}_utf{8,16} works on str and some on String, perhaps this could be unified somehow?

@huonw
Copy link
Member

huonw commented Jun 3, 2014

str and String both use UTF-8 internally, so from_utf8 can just check if the &[u8] buffer is valid UTF-8, and then cast it to a &str: the conversion can be done by just reinterpreting the same chunk of memory as a different type.

On the other hand, a UTF-16 vector does not have the same memory layout as a string, the u16s have to be decoded and built up into a dynamic String. (Theoretically from_utf16 can/should return an Iterator<char>, rather than actually building the String internally.)

/// External iterator for a string's UTF16 codeunits.
/// Use with the `std::iter` module.
#[deriving(Clone)]
pub struct Utf16Codeunits<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utf16CodeUnits would be more consistent with Rust's style guide

@schmee
Copy link
Contributor Author

schmee commented Jun 16, 2014

Fixed tests, squashed and rebased. ping @huonw

@schmee
Copy link
Contributor Author

schmee commented Jun 17, 2014

It feels unnecessary to have both to_utf16 and utf16_iter, since the iterator is strictly more flexible. How about removing to_utf16, renaming utf16_iter to to_utf16, and also make from_utf16 into an iterator so the semantics match. I can refactor the call sites as needed.

@huonw
Copy link
Member

huonw commented Jun 17, 2014

How about removing to_utf16,

This PR is doing this already: deprecation is how functions are removed (deprecated things are properly deleted a month or so after they were deprecated).

renaming utf16_iter to to_utf16

Using .to_utf16 defies the naming conventions (it's not an expensive cast).

also make from_utf16 into an iterator so the semantics match

This is a good idea. Theoretically both of these can just be iterator adaptors, that is

fn from_utf16<I: Iterator<u16>>(it: I) -> FromUtf16<I> { ... }
fn to_utf16<I: Iterator<char>>(it: I) -> ToUtf16<I> { ... }

impl<I: Iterator<u16>> Iterator<char> for FromUtf16<I> { ... }
impl<I: Iterator<char>> Iterator<u16> for ToUtf16<I> { ... }

and this would then be called like to_utf16(string.chars()). I don't think this is necessary to do in this PR.

@huonw
Copy link
Member

huonw commented Jun 17, 2014

(BTW, I edited a [breaking-change] description into the PR text, since this "removes" a method from strings.)

bors added a commit that referenced this pull request Jun 17, 2014
Closes #14358.

The tests are not yet moved to `utf16_iter`, so this probably won't compile. I'm submitting this PR anyway so it can be reviewed and since it was mentioned in #14611.

This deprecates `.to_utf16`. `x.to_utf16()` should be replaced by either `x.utf16_iter().collect::<Vec<u16>>()` (the type annotation may be optional), or just `x.utf16_iter()` directly, if it can be used in an iterator context.

[breaking-change]

cc @huonw
@schmee
Copy link
Contributor Author

schmee commented Jun 17, 2014

@huonw r?

let (low, high) = self.chars.size_hint();
// we could be entirely valid surrogates (2 elements per
// char), or entirely non-surrogates (1 element per char)
(low / 2, high)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised this is wrong: it should be (low, high * 2), since every char gets either one u16 or two u16, meaning the length of this iterator is between 1 or 2 times as long as the underlying one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and it needs to use CheckedMul (and I guess it needs to handle the Option): high.and_then(|n| n.checked_mul(&2)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh boy, that's embarrassing... I've pushed a fix.

@schmee
Copy link
Contributor Author

schmee commented Jun 24, 2014

This is tricky, since it touches a bunch of Windows-related things in libnative that I can't check. So it builds fine locally but fails when bors runs the tests. I spent a couple of hours trying to build Rust on Windows but there is no way I'll figure that out. I've changed all occurrences of to_utf16 to utf16_iter and I hope I got it right...

@huonw r? Sorry for spamming about this, didn't anticipate the libnative errors...

@bill-myers
Copy link
Contributor

Shouldn't this be called utf16() or something like utf16_units() without _iter in the name?

@huonw
Copy link
Member

huonw commented Jun 30, 2014

I asked on IRC. r=me if the name changes to utf16_units. (git ls-files -z | xargs -0 sed -i s/utf16_iter/utf16_units/g should do the renaming throughout the whole codebase.)

This deprecates `.to_utf16`. `x.to_utf16()` should be replaced by either
`x.utf16_units().collect::<Vec<u16>>()` (the type annotation may be optional), or
just `x.utf16_units()` directly, if it can be used in an iterator context.

Closes rust-lang#14358

[breaking-change]
@schmee schmee changed the title Add utf16_iter Add utf16_units Jun 30, 2014
@schmee
Copy link
Contributor Author

schmee commented Jun 30, 2014

@huonw Renamed. I have tried building this locally, but as reported in #15250 and #15279, I have not been able to make it through the test suite even with master! So no guarantees, but I really hope this works now. r?

bors added a commit that referenced this pull request Jun 30, 2014
Closes #14358.

~~The tests are not yet moved to `utf16_iter`, so this probably won't compile. I'm submitting this PR anyway so it can be reviewed and since it was mentioned in #14611.~~ EDIT: Tests now use `utf16_iter`.

This deprecates `.to_utf16`. `x.to_utf16()` should be replaced by either `x.utf16_iter().collect::<Vec<u16>>()` (the type annotation may be optional), or just `x.utf16_iter()` directly, if it can be used in an iterator context.

[breaking-change]

cc @huonw
@bors bors closed this Jun 30, 2014
@schmee
Copy link
Contributor Author

schmee commented Jul 1, 2014

Yay, finally!

@schmee schmee deleted the utf16-iterator branch July 1, 2014 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StrAllocating.to_utf16 should be an Iterator<u16>
7 participants