Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some stabilization and conventions changes to std::char #18603

Merged
merged 19 commits into from
Nov 22, 2014

Conversation

brson
Copy link
Contributor

@brson brson commented Nov 4, 2014

  • Deprecate the free functions in favor of methods, except the two ctors from_u32 and from_digit, whose methods are deprecated.
  • Mark the Char and UnicodeChar traits experimental until we decide for sure that we won't have some sort of inherent methods for primitives.
  • The UnicodeChar methods related to numerics are now called e.g. is_numeric to match the 'numeric' unicode character class, and the *_digit_radix methods on Char now just called *_digit.
  • len_utf8_bytes -> len_utf8
  • Converted methods to take self by-value
  • Converted escape_default and escape_unicode to iterators over chars.
  • Renamed is_XID_start, is_XID_continue to is_xid_start, is_xid_continue to match conventions

This also converts encode_utf8 and encode_utf16 to return iterators. I suspect this is not the final form of these methods. Perf is worse (numbers in the commit). Many of the uses ended up being awkward, copying into a buffer then writing that buffer to a Writer. It might be more appropriate for these to return Readers instead, but that type is defined in std.

Note: although I did add the from_u32 ctor to the Char trait, I deprecated it again later, preferring the free ctors.

I've been sitting on this for a while.

cc @aturon

@rust-highfive
Copy link
Collaborator

warning Warning warning

  • These commits modify unsafe code. Please review it carefully!

fn encode_utf8(self) -> Utf8CodeUnits {
let code = self as u32;
let (len, buf) = if code < MAX_ONE_B {
(1, [code as u8, 0, 0, 0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this would be faster by writing it 'backwards', e.g. this would be (3, [0, 0, 0, code as u8]), and the next one would be (2, [0, 0, (code >> 6u & 0x1F ...), (...)]).

The next function could then become:

self.buf.get(self.pos).map(|c| { self.pos += 1; c })

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Another possibility is store these in reverse in the buffer and decrement position instead, e.g. [(code & 0x3F_u32 ...), (code >> 6u ...), 0, 0] with next like if self.pos != 0 { self.pos -= 1; Some(self.buf[self.pos]) } else { None }. The one above has the slight advantage of doing exactly one bounds check.)

@aturon
Copy link
Member

aturon commented Nov 4, 2014

@brson I've looked over this and it looks great overall! I left a few minor comments.

I guess the main question is whether we're comfortable moving to an iterator for encoding despite the current perf loss.

The fact that the code using encode_blah seems to get worse is an important sign, I think, that iterators might not be the right choice. I would be vote for leaving the encode_foo methods in their original form for now. We're hoping to move Reader and Writer into core soon, and perhaps we could revisit this question then. I think having a custom buffer-based API is also fine, though.

I do feel differently about the escape functions, though, which were just using internal iteration; I think we should keep your changes for those.

@brson
Copy link
Contributor Author

brson commented Nov 6, 2014

@aturon rebased to remove the encoding iterators. I think your points are addressed.

bors added a commit that referenced this pull request Nov 11, 2014
* Deprecate the free functions in favor of methods, except the two ctors `from_u32` and `from_digit`, whose methods are deprecated.
* Mark the `Char` and `UnicodeChar` traits experimental until we decide for sure that we won't have some sort of inherent methods for primitives.
* The `UnicodeChar` methods related to numerics are now called e.g. `is_numeric` to match the 'numeric' unicode character class, and the `*_digit_radix` methods on `Char` now just called `*_digit`.
* `len_utf8_bytes` -> `len_utf8`
* Converted methods to take self by-value
* Converted `escape_default` and `escape_unicode` to iterators over chars.
* Renamed `is_XID_start`, `is_XID_continue` to `is_xid_start`, `is_xid_continue` to match conventions

This also converts `encode_utf8` and `encode_utf16` to return iterators. I suspect this is not the final form of these methods. Perf is worse (numbers in the commit). Many of the uses ended up being awkward, copying into a buffer then writing that buffer to a `Writer`. It might be more appropriate for these to return `Reader`s instead, but that type is defined in `std`.

Note: although I *did* add the `from_u32` ctor to the `Char` trait, I deprecated it again later, preferring the free ctors.

I've been sitting on this for a while.

cc @aturon
@aturon
Copy link
Member

aturon commented Nov 21, 2014

@brson ping

brson added 16 commits November 21, 2014 12:49
This is the only free function not part of the trait.
'Numeric' is the proper name of the unicode character class,
and this frees up the word 'digit' for ascii use in libcore.

Since I'm going to rename `Char::is_digit_radix` to
`is_digit`, I am not leaving a deprecated method in place,
because that would just cause name clashes, as both
`Char` and `UnicodeChar` are in the prelude.

[breaking-change]
This fits the naming of `to_digit` and `from_digit`. Leave
the old name deprecated.
"bytes" is redundant.

Deprecate the old.
Missing method to pair with len_utf8.
For now we are preferring free functions for primitive ctors,
so they are marked 'unstable' pending final decision. The
methods on `Char` are 'deprecated'.
The `Char` trait itself may go away in favor of primitive inherent
methods. Still some questions about whether the preconditions are
following the final error handling conventions.
Methods on primitmive Copy types generally should take `self`.

[breaking-change]
Extension traits for primitive types should be by-value.

[breaking-change]
Free functions deprecated. UnicodeChar experimental pending
final decisions about prelude.
bors added a commit that referenced this pull request Nov 21, 2014
* Deprecate the free functions in favor of methods, except the two ctors `from_u32` and `from_digit`, whose methods are deprecated.
* Mark the `Char` and `UnicodeChar` traits experimental until we decide for sure that we won't have some sort of inherent methods for primitives.
* The `UnicodeChar` methods related to numerics are now called e.g. `is_numeric` to match the 'numeric' unicode character class, and the `*_digit_radix` methods on `Char` now just called `*_digit`.
* `len_utf8_bytes` -> `len_utf8`
* Converted methods to take self by-value
* Converted `escape_default` and `escape_unicode` to iterators over chars.
* Renamed `is_XID_start`, `is_XID_continue` to `is_xid_start`, `is_xid_continue` to match conventions

This also converts `encode_utf8` and `encode_utf16` to return iterators. I suspect this is not the final form of these methods. Perf is worse (numbers in the commit). Many of the uses ended up being awkward, copying into a buffer then writing that buffer to a `Writer`. It might be more appropriate for these to return `Reader`s instead, but that type is defined in `std`.

Note: although I *did* add the `from_u32` ctor to the `Char` trait, I deprecated it again later, preferring the free ctors.

I've been sitting on this for a while.

cc @aturon
@bors bors closed this Nov 22, 2014
@bors bors merged commit 75ffadf into rust-lang:master Nov 22, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants