Some stabilization and conventions changes to std::char #18603

brson · 2014-11-04T01:59:59Z

Deprecate the free functions in favor of methods, except the two ctors from_u32 and from_digit, whose methods are deprecated.
Mark the Char and UnicodeChar traits experimental until we decide for sure that we won't have some sort of inherent methods for primitives.
The UnicodeChar methods related to numerics are now called e.g. is_numeric to match the 'numeric' unicode character class, and the *_digit_radix methods on Char now just called *_digit.
len_utf8_bytes -> len_utf8
Converted methods to take self by-value
Converted escape_default and escape_unicode to iterators over chars.
Renamed is_XID_start, is_XID_continue to is_xid_start, is_xid_continue to match conventions

This also converts encode_utf8 and encode_utf16 to return iterators. I suspect this is not the final form of these methods. Perf is worse (numbers in the commit). Many of the uses ended up being awkward, copying into a buffer then writing that buffer to a Writer. It might be more appropriate for these to return Readers instead, but that type is defined in std.

Note: although I did add the from_u32 ctor to the Char trait, I deprecated it again later, preferring the free ctors.

I've been sitting on this for a while.

cc @aturon

rust-highfive · 2014-11-04T02:00:04Z

Warning

These commits modify unsafe code. Please review it carefully!

huonw · 2014-11-04T02:07:00Z

src/libcore/char.rs

+    fn encode_utf8(self) -> Utf8CodeUnits {
+        let code = self as u32;
+        let (len, buf) = if code < MAX_ONE_B {
+            (1, [code as u8, 0, 0, 0])


I wonder if this would be faster by writing it 'backwards', e.g. this would be (3, [0, 0, 0, code as u8]), and the next one would be (2, [0, 0, (code >> 6u & 0x1F ...), (...)]).

The next function could then become:

self.buf.get(self.pos).map(|c| { self.pos += 1; c })

(Another possibility is store these in reverse in the buffer and decrement position instead, e.g. [(code & 0x3F_u32 ...), (code >> 6u ...), 0, 0] with next like if self.pos != 0 { self.pos -= 1; Some(self.buf[self.pos]) } else { None }. The one above has the slight advantage of doing exactly one bounds check.)

aturon · 2014-11-04T20:40:22Z

@brson I've looked over this and it looks great overall! I left a few minor comments.

I guess the main question is whether we're comfortable moving to an iterator for encoding despite the current perf loss.

The fact that the code using encode_blah seems to get worse is an important sign, I think, that iterators might not be the right choice. I would be vote for leaving the encode_foo methods in their original form for now. We're hoping to move Reader and Writer into core soon, and perhaps we could revisit this question then. I think having a custom buffer-based API is also fine, though.

I do feel differently about the escape functions, though, which were just using internal iteration; I think we should keep your changes for those.

brson · 2014-11-06T02:17:40Z

@aturon rebased to remove the encoding iterators. I think your points are addressed.

@aturon

* Deprecate the free functions in favor of methods, except the two ctors `from_u32` and `from_digit`, whose methods are deprecated. * Mark the `Char` and `UnicodeChar` traits experimental until we decide for sure that we won't have some sort of inherent methods for primitives. * The `UnicodeChar` methods related to numerics are now called e.g. `is_numeric` to match the 'numeric' unicode character class, and the `*_digit_radix` methods on `Char` now just called `*_digit`. * `len_utf8_bytes` -> `len_utf8` * Converted methods to take self by-value * Converted `escape_default` and `escape_unicode` to iterators over chars. * Renamed `is_XID_start`, `is_XID_continue` to `is_xid_start`, `is_xid_continue` to match conventions This also converts `encode_utf8` and `encode_utf16` to return iterators. I suspect this is not the final form of these methods. Perf is worse (numbers in the commit). Many of the uses ended up being awkward, copying into a buffer then writing that buffer to a `Writer`. It might be more appropriate for these to return `Reader`s instead, but that type is defined in `std`. Note: although I *did* add the `from_u32` ctor to the `Char` trait, I deprecated it again later, preferring the free ctors. I've been sitting on this for a while. cc @aturon

aturon · 2014-11-21T06:32:53Z

@brson ping

This is the only free function not part of the trait.

'Numeric' is the proper name of the unicode character class, and this frees up the word 'digit' for ascii use in libcore. Since I'm going to rename `Char::is_digit_radix` to `is_digit`, I am not leaving a deprecated method in place, because that would just cause name clashes, as both `Char` and `UnicodeChar` are in the prelude. [breaking-change]

This fits the naming of `to_digit` and `from_digit`. Leave the old name deprecated.

"bytes" is redundant. Deprecate the old.

Missing method to pair with len_utf8.

For now we are preferring free functions for primitive ctors, so they are marked 'unstable' pending final decision. The methods on `Char` are 'deprecated'.

Prefer the methods.

The `Char` trait itself may go away in favor of primitive inherent methods. Still some questions about whether the preconditions are following the final error handling conventions.

Methods on primitmive Copy types generally should take `self`. [breaking-change]

[breaking-change]

Extension traits for primitive types should be by-value. [breaking-change]

Free functions deprecated. UnicodeChar experimental pending final decisions about prelude.

…id_continue

@aturon

* Deprecate the free functions in favor of methods, except the two ctors `from_u32` and `from_digit`, whose methods are deprecated. * Mark the `Char` and `UnicodeChar` traits experimental until we decide for sure that we won't have some sort of inherent methods for primitives. * The `UnicodeChar` methods related to numerics are now called e.g. `is_numeric` to match the 'numeric' unicode character class, and the `*_digit_radix` methods on `Char` now just called `*_digit`. * `len_utf8_bytes` -> `len_utf8` * Converted methods to take self by-value * Converted `escape_default` and `escape_unicode` to iterators over chars. * Renamed `is_XID_start`, `is_XID_continue` to `is_xid_start`, `is_xid_continue` to match conventions This also converts `encode_utf8` and `encode_utf16` to return iterators. I suspect this is not the final form of these methods. Perf is worse (numbers in the commit). Many of the uses ended up being awkward, copying into a buffer then writing that buffer to a `Writer`. It might be more appropriate for these to return `Reader`s instead, but that type is defined in `std`. Note: although I *did* add the `from_u32` ctor to the `Char` trait, I deprecated it again later, preferring the free ctors. I've been sitting on this for a while. cc @aturon

huonw reviewed Nov 4, 2014
View reviewed changes

brson force-pushed the stdchar branch from 6075c87 to 009ae4a Compare November 6, 2014 02:13

brson force-pushed the stdchar branch from 14f018f to 8c93705 Compare November 11, 2014 19:10

brson added 16 commits November 21, 2014 12:49

core: Add from_u32 to the Char trait

41fb8f7

This is the only free function not part of the trait.

core: Mark Char trait experimental

070e691

char: Mark the MAX constant stable

ac2f379

core: Rename Char::is_digit_radix to is_digit

acb5fef

This fits the naming of `to_digit` and `from_digit`. Leave the old name deprecated.

core: Rename Char::len_utf8_bytes to Char::len_utf8

0150fa4

"bytes" is redundant. Deprecate the old.

core: Add Char::len_utf16

f6607a2

Missing method to pair with len_utf8.

core: Add stability attributes to char::from_digit and from_u32

4dd1724

For now we are preferring free functions for primitive ctors, so they are marked 'unstable' pending final decision. The methods on `Char` are 'deprecated'.

core: Deprecated remaining free functions in char

95c3f61

Prefer the methods.

core: Mark remaining Char methods unstable

b577e4c

The `Char` trait itself may go away in favor of primitive inherent methods. Still some questions about whether the preconditions are following the final error handling conventions.

Fix various deprecation warnings from char changes

5928f6c

core: Convert Char methods to by-val self

ca1820b

Methods on primitmive Copy types generally should take `self`. [breaking-change]

core: Convert Char::escape_default, escape_unicode to iterators

aad2461

[breaking-change]

unicode: Convert UnicodeChar methods to by-value

d6ee804

Extension traits for primitive types should be by-value. [breaking-change]

unicode: Add stability attributes to u_char

76ddd2b

Free functions deprecated. UnicodeChar experimental pending final decisions about prelude.

unicode: Rename is_XID_start to is_xid_start, is_XID_continue to is_x…

f39c29d

…id_continue

brson added 3 commits November 21, 2014 13:18

unicode: Remove unused non_snake_case allows.

73622f8

core: Update docs for escape_unicode, escape_default

879af89

core: Convert a 'failure' to 'panic' in docs

75ffadf

brson force-pushed the stdchar branch from 8c93705 to 75ffadf Compare November 21, 2014 21:18

bors closed this Nov 22, 2014

bors merged commit 75ffadf into rust-lang:master Nov 22, 2014

aturon mentioned this pull request Dec 16, 2014

Stabilization metabug: 1.0-alpha #19260

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some stabilization and conventions changes to std::char #18603

Some stabilization and conventions changes to std::char #18603

brson commented Nov 4, 2014

rust-highfive commented Nov 4, 2014

huonw Nov 4, 2014

huonw Nov 4, 2014

aturon commented Nov 4, 2014

brson commented Nov 6, 2014

aturon commented Nov 21, 2014

Some stabilization and conventions changes to std::char #18603

Some stabilization and conventions changes to std::char #18603

Conversation

brson commented Nov 4, 2014

rust-highfive commented Nov 4, 2014

huonw Nov 4, 2014

Choose a reason for hiding this comment

huonw Nov 4, 2014

Choose a reason for hiding this comment

aturon commented Nov 4, 2014

brson commented Nov 6, 2014

aturon commented Nov 21, 2014