-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some stabilization and conventions changes to std::char #18603
Conversation
fn encode_utf8(self) -> Utf8CodeUnits { | ||
let code = self as u32; | ||
let (len, buf) = if code < MAX_ONE_B { | ||
(1, [code as u8, 0, 0, 0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this would be faster by writing it 'backwards', e.g. this would be (3, [0, 0, 0, code as u8])
, and the next one would be (2, [0, 0, (code >> 6u & 0x1F ...), (...)])
.
The next
function could then become:
self.buf.get(self.pos).map(|c| { self.pos += 1; c })
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Another possibility is store these in reverse in the buffer and decrement position instead, e.g. [(code & 0x3F_u32 ...), (code >> 6u ...), 0, 0]
with next
like if self.pos != 0 { self.pos -= 1; Some(self.buf[self.pos]) } else { None }
. The one above has the slight advantage of doing exactly one bounds check.)
@brson I've looked over this and it looks great overall! I left a few minor comments. I guess the main question is whether we're comfortable moving to an iterator for encoding despite the current perf loss. The fact that the code using I do feel differently about the escape functions, though, which were just using internal iteration; I think we should keep your changes for those. |
@aturon rebased to remove the encoding iterators. I think your points are addressed. |
* Deprecate the free functions in favor of methods, except the two ctors `from_u32` and `from_digit`, whose methods are deprecated. * Mark the `Char` and `UnicodeChar` traits experimental until we decide for sure that we won't have some sort of inherent methods for primitives. * The `UnicodeChar` methods related to numerics are now called e.g. `is_numeric` to match the 'numeric' unicode character class, and the `*_digit_radix` methods on `Char` now just called `*_digit`. * `len_utf8_bytes` -> `len_utf8` * Converted methods to take self by-value * Converted `escape_default` and `escape_unicode` to iterators over chars. * Renamed `is_XID_start`, `is_XID_continue` to `is_xid_start`, `is_xid_continue` to match conventions This also converts `encode_utf8` and `encode_utf16` to return iterators. I suspect this is not the final form of these methods. Perf is worse (numbers in the commit). Many of the uses ended up being awkward, copying into a buffer then writing that buffer to a `Writer`. It might be more appropriate for these to return `Reader`s instead, but that type is defined in `std`. Note: although I *did* add the `from_u32` ctor to the `Char` trait, I deprecated it again later, preferring the free ctors. I've been sitting on this for a while. cc @aturon
@brson ping |
This is the only free function not part of the trait.
'Numeric' is the proper name of the unicode character class, and this frees up the word 'digit' for ascii use in libcore. Since I'm going to rename `Char::is_digit_radix` to `is_digit`, I am not leaving a deprecated method in place, because that would just cause name clashes, as both `Char` and `UnicodeChar` are in the prelude. [breaking-change]
This fits the naming of `to_digit` and `from_digit`. Leave the old name deprecated.
"bytes" is redundant. Deprecate the old.
Missing method to pair with len_utf8.
For now we are preferring free functions for primitive ctors, so they are marked 'unstable' pending final decision. The methods on `Char` are 'deprecated'.
Prefer the methods.
The `Char` trait itself may go away in favor of primitive inherent methods. Still some questions about whether the preconditions are following the final error handling conventions.
Methods on primitmive Copy types generally should take `self`. [breaking-change]
[breaking-change]
Extension traits for primitive types should be by-value. [breaking-change]
Free functions deprecated. UnicodeChar experimental pending final decisions about prelude.
* Deprecate the free functions in favor of methods, except the two ctors `from_u32` and `from_digit`, whose methods are deprecated. * Mark the `Char` and `UnicodeChar` traits experimental until we decide for sure that we won't have some sort of inherent methods for primitives. * The `UnicodeChar` methods related to numerics are now called e.g. `is_numeric` to match the 'numeric' unicode character class, and the `*_digit_radix` methods on `Char` now just called `*_digit`. * `len_utf8_bytes` -> `len_utf8` * Converted methods to take self by-value * Converted `escape_default` and `escape_unicode` to iterators over chars. * Renamed `is_XID_start`, `is_XID_continue` to `is_xid_start`, `is_xid_continue` to match conventions This also converts `encode_utf8` and `encode_utf16` to return iterators. I suspect this is not the final form of these methods. Perf is worse (numbers in the commit). Many of the uses ended up being awkward, copying into a buffer then writing that buffer to a `Writer`. It might be more appropriate for these to return `Reader`s instead, but that type is defined in `std`. Note: although I *did* add the `from_u32` ctor to the `Char` trait, I deprecated it again later, preferring the free ctors. I've been sitting on this for a while. cc @aturon
from_u32
andfrom_digit
, whose methods are deprecated.Char
andUnicodeChar
traits experimental until we decide for sure that we won't have some sort of inherent methods for primitives.UnicodeChar
methods related to numerics are now called e.g.is_numeric
to match the 'numeric' unicode character class, and the*_digit_radix
methods onChar
now just called*_digit
.len_utf8_bytes
->len_utf8
escape_default
andescape_unicode
to iterators over chars.is_XID_start
,is_XID_continue
tois_xid_start
,is_xid_continue
to match conventionsThis also converts
encode_utf8
andencode_utf16
to return iterators. I suspect this is not the final form of these methods. Perf is worse (numbers in the commit). Many of the uses ended up being awkward, copying into a buffer then writing that buffer to aWriter
. It might be more appropriate for these to returnReader
s instead, but that type is defined instd
.Note: although I did add the
from_u32
ctor to theChar
trait, I deprecated it again later, preferring the free ctors.I've been sitting on this for a while.
cc @aturon