-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking issue for concrete errors in the standard library #27734
Comments
Nominating for 1.5 FCP discussion. |
The /// Error returned from `String::from`
#[unstable(feature = "str_parse_error", reason = "may want to be replaced with \
Void if it ever exists",
issue = "27734")]
#[derive(Copy, Clone, Debug, Eq, PartialEq)]
pub struct ParseError(());
#[stable(feature = "rust1", since = "1.0.0")]
impl FromStr for String {
type Err = ParseError;
#[inline]
fn from_str(s: &str) -> Result<String, ParseError> {
Ok(String::from(s))
}
} This does indeed seem to be a good candidate for a void type. However, it is actually a unit type. Was that intended? For |
The currently documented semantics are not very helpful:
I’d like to know precisely. Taking a step back, I’d like to reproduce the functionality of |
Never mind. For the use case I’m thinking of I also want to read the input bytes in chunks, and support decoding a UTF-8 byte sequence for a single code point split across chunks. I’ll probably build it in an external library and then propose for eventual inclusion in libcore, where it could (I hope) share code with Still, the "not necessarily at it precisely" part of Could |
This issue is now entering its cycle-long FCP for stabilization in 1.5 @SimonSapin yeah one thing we're particularly interested in are what exact guarantees we can provide in from the Curious to hear your thoughts! |
It can never be instantiated, so signify this by having it actually be an empty `enum`. cc rust-lang#27734
I think that It should be backward-compatible to add more methods (and private fields) later to provide more info about the error and how to decode the rest of the input. But I’m not 100% sure yet what that should look like. For example, if the input ends with 1~3 bytes that look like the beginning of a valid but incomplete UTF-8 sequence, some users might want re-assemble that code point if more input is coming later. |
Ok cool, thanks @SimonSapin! I'd be totally down for adding more information later if we can! |
It can never be instantiated, so signify this by having it actually be an empty `enum`. cc #27734
Alright, this: https://github.com/SimonSapin/rust-utf8/blob/d5192ce94e/lib.rs#L72-L300 has all the UTF-8 decoding functionality I’d like to have while being as flexible as possible in terms of how the data flows. The downside is that it’s quite a bit of API surface, and users need some amount of boilerplate code. ( Trying to cram all that info in
The |
Wow, thanks for doing that @SimonSapin! It does seem like that level of detail may want to belong outside of std rather than in, but I'm curious if we could perhaps contain at least some of the information? Would there be a way that some UTF-8 decoding would be necessary downstream but we could allow doing the "easy thing" of replacing with the replacement character on the fly? For example with |
Yeah, since this crate duplicates most of its logic with code in libcore and libcollections I felt that moving into core would help reduce duplication, and I like the idea of re-implementing "higher level" existing functions on top of it. But given the complexity of the API you’re probably right that this is better in an external library.
I don’t understand this part, sorry. Extrapolating a bit: yes, there are many ways the API can be simplified if you make some assumptions about how its used. For example, if you assume lossy decoding, But the whole idea here is to make it as general as possible with no assumption. Then more specialized things like
|
Aha! The insight of returning a string for the replacement character was what I was missing, that seems like a clever way to handle what I was thinking of! |
It might make sense to extend impl Utf8Error {
/// Explain what’s going on after `self.valid_up_to()`
fn details() -> Utf8ErrorDetails { … }
}
enum Utf8ErrorDetails {
/// There was an invalid byte sequence.
///
/// In lossy decoding, it should be replaced by a single U+FFFD replacement character.
/// The given index indicates where in the input to continue decoding.
/// It is always 1, 2, or 3 more than `Utf8Error::valid_up_to()`.
Invalid { continue_at_index: usize },
/// The input (after `Utf8Error::valid_up_to()`) ends with a sequence
/// that may or may not represent a valid code point
/// if more subsequent input will be available.
/// The given length (always 1, 2, or 3) is the maximum number of additional bytes
/// needed to determine whether it is indeed valid.
IncompleteSuffix { extra_bytes_needed: usize },
} I think this is enough to build upon and do everything my library does, but it’s not convenient: when reading incrementally and getting |
Probably |
I made a nicer API, with a stateful decoder returning iterators: https://github.com/SimonSapin/rust-utf8/tree/ba1fc675a5 Here is the public API: /// A stateful, incremental, zero-copy UTF-8 decoder with error handling.
///
/// Example: a simplified version of `String::from_utf8_lossy` can be written as:
///
/// ```rust
/// fn string_from_utf8_lossy(input: &[u8]) -> String {
/// let mut string = String::new();
/// let mut decoder = utf8::Decoder::new();
/// for piece in decoder.feed(input) {
/// string.push_str(&piece)
/// }
/// if let Some(piece) = decoder.end() {
/// string.push_str(&piece)
/// }
/// string
/// }
/// ```
pub struct Decoder { … }
impl Decoder {
/// Create a new decoder.
pub fn new() -> Decoder { … }
/// Feed some input bytes to the decoder.
/// Returns an iterator of decoded pieces which dereference to `&str`.
///
/// If the input ends with an incomplete but potentially valid UTF-8 sequence,
/// record that partial sequence for use in the next `feed()` call.
///
/// Panics if the iterator returned by an earlier `feed()` call was not exhausted.
pub fn feed<'d, 'i>(&'d mut self, input_chunk: &'i [u8]) -> ChunkDecoder<'d, 'i> { … }
/// Consume the decoder and indicate the end of the input.
/// If the previous input chunk given to `feed()` ended with an incomplete UTF-8 sequence,
/// this returns `Some(DecodedPiece::Error)`
/// which dereference to a `"\u{FFFD}"` replacement character.
///
/// Panics if the iterator returned by an earlier `feed()` call was not exhausted.
pub fn end(self) -> Option<DecodedPiece<'static>> { … }
}
/// An iterator decoding one chunk of input.
pub struct ChunkDecoder<'d, 'i> { … }
impl<'d, 'i> ChunkDecoder<'d, 'i> {
/// Return whether `next()` would return `None`.
pub fn eof(&self) -> bool { … }
}
impl<'d, 'i> Iterator for ChunkDecoder<'d, 'i> {
type Item = DecodedPiece<'i>;
fn next(&mut self) -> Option<Self::Item> { … }
}
/// One piece of the incremental UTF-8 decoding.
///
/// Dereferences to `&str`,
/// replacing each error with a single `"\u{FFFD}"` replacement character.
pub enum DecodedPiece<'a> {
/// A string slice of consecutive well-formed UTF-8 input
InputSlice(&'a str),
/// A single `char` that was split across more than one input chunks.
AcrossChunks(StrChar),
/// Represents one decoding error.
/// Dereferences to a single `"\u{FFFD}"` replacement character for lossy decoding.
Error,
}
impl<'a> Deref for DecodedPiece<'a> {
type Target = str;
fn deref(&self) -> &str { … }
}
/// Like `char`, but represented in memory as UTF-8 and dereferences to `&str`.
#[derive(Copy, Clone)]
pub struct StrChar { … }
impl Deref for StrChar {
type Target = str;
fn deref(&self) -> &str { … }
} @alexcrichton, Is this closer to something you might want to include? |
Some micro-benchmarks:
If I cheat and overallocate by 2x:
( |
Oh wow, thanks for the continued exploration here! I suspect an API like |
On the subject of It's probably outside the scope of this issue, but it might be useful to have a similar, standard |
The libs team discussed this during triage today and the decision was to stabilize. |
@DanielKeep see here, your suggestion is in the "alternatives" section. @alexcrichton Surely you're not planning to stabilise |
This commit stabilizes and deprecates library APIs whose FCP has closed in the last cycle, specifically: Stabilized APIs: * `fs::canonicalize` * `Path::{metadata, symlink_metadata, canonicalize, read_link, read_dir, exists, is_file, is_dir}` - all moved to inherent methods from the `PathExt` trait. * `Formatter::fill` * `Formatter::width` * `Formatter::precision` * `Formatter::sign_plus` * `Formatter::sign_minus` * `Formatter::alternate` * `Formatter::sign_aware_zero_pad` * `string::ParseError` * `Utf8Error::valid_up_to` * `Iterator::{cmp, partial_cmp, eq, ne, lt, le, gt, ge}` * `<[T]>::split_{first,last}{,_mut}` * `Condvar::wait_timeout` - note that `wait_timeout_ms` is not yet deprecated but will be once 1.5 is released. * `str::{R,}MatchIndices` * `str::{r,}match_indices` * `char::from_u32_unchecked` * `VecDeque::insert` * `VecDeque::shrink_to_fit` * `VecDeque::as_slices` * `VecDeque::as_mut_slices` * `VecDeque::swap_remove_front` - (renamed from `swap_front_remove`) * `VecDeque::swap_remove_back` - (renamed from `swap_back_remove`) * `Vec::resize` * `str::slice_mut_unchecked` * `FileTypeExt` * `FileTypeExt::{is_block_device, is_char_device, is_fifo, is_socket}` * `BinaryHeap::from` - `from_vec` deprecated in favor of this * `BinaryHeap::into_vec` - plus a `Into` impl * `BinaryHeap::into_sorted_vec` Deprecated APIs * `slice::ref_slice` * `slice::mut_ref_slice` * `iter::{range_inclusive, RangeInclusive}` * `std::dynamic_lib` Closes rust-lang#27706 Closes rust-lang#27725 cc rust-lang#27726 (align not stabilized yet) Closes rust-lang#27734 Closes rust-lang#27737 Closes rust-lang#27742 Closes rust-lang#27743 Closes rust-lang#27772 Closes rust-lang#27774 Closes rust-lang#27777 Closes rust-lang#27781 cc rust-lang#27788 (a few remaining methods though) Closes rust-lang#27790 Closes rust-lang#27793 Closes rust-lang#27796 Closes rust-lang#27810 cc rust-lang#28147 (not all parts stabilized)
This commit stabilizes and deprecates library APIs whose FCP has closed in the last cycle, specifically: Stabilized APIs: * `fs::canonicalize` * `Path::{metadata, symlink_metadata, canonicalize, read_link, read_dir, exists, is_file, is_dir}` - all moved to inherent methods from the `PathExt` trait. * `Formatter::fill` * `Formatter::width` * `Formatter::precision` * `Formatter::sign_plus` * `Formatter::sign_minus` * `Formatter::alternate` * `Formatter::sign_aware_zero_pad` * `string::ParseError` * `Utf8Error::valid_up_to` * `Iterator::{cmp, partial_cmp, eq, ne, lt, le, gt, ge}` * `<[T]>::split_{first,last}{,_mut}` * `Condvar::wait_timeout` - note that `wait_timeout_ms` is not yet deprecated but will be once 1.5 is released. * `str::{R,}MatchIndices` * `str::{r,}match_indices` * `char::from_u32_unchecked` * `VecDeque::insert` * `VecDeque::shrink_to_fit` * `VecDeque::as_slices` * `VecDeque::as_mut_slices` * `VecDeque::swap_remove_front` - (renamed from `swap_front_remove`) * `VecDeque::swap_remove_back` - (renamed from `swap_back_remove`) * `Vec::resize` * `str::slice_mut_unchecked` * `FileTypeExt` * `FileTypeExt::{is_block_device, is_char_device, is_fifo, is_socket}` * `BinaryHeap::from` - `from_vec` deprecated in favor of this * `BinaryHeap::into_vec` - plus a `Into` impl * `BinaryHeap::into_sorted_vec` Deprecated APIs * `slice::ref_slice` * `slice::mut_ref_slice` * `iter::{range_inclusive, RangeInclusive}` * `std::dynamic_lib` Closes #27706 Closes #27725 cc #27726 (align not stabilized yet) Closes #27734 Closes #27737 Closes #27742 Closes #27743 Closes #27772 Closes #27774 Closes #27777 Closes #27781 cc #27788 (a few remaining methods though) Closes #27790 Closes #27793 Closes #27796 Closes #27810 cc #28147 (not all parts stabilized)
@DanielKeep yeah we were thinking that we may want an official @canndrew nah we stabilized |
Ah good, I was worried there.
Of course then there'll be two incompatible ways of writing a diverging function: |
This is a tracking issue for the unstable
str_parse_error
andutf8_error
features in the standard library. Some various small error types popped up onceFromStr
had an associated error type and haven't been considered too closely for stabilization.Some thoughts are:
The text was updated successfully, but these errors were encountered: