Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for concrete errors in the standard library #27734

Closed
alexcrichton opened this issue Aug 12, 2015 · 21 comments · Fixed by #29254
Closed

Tracking issue for concrete errors in the standard library #27734

alexcrichton opened this issue Aug 12, 2015 · 21 comments · Fixed by #29254
Labels
B-unstable Blocker: Implemented in the nightly compiler and unstable. final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Milestone

Comments

@alexcrichton
Copy link
Member

This is a tracking issue for the unstable str_parse_error and utf8_error features in the standard library. Some various small error types popped up once FromStr had an associated error type and haven't been considered too closely for stabilization.

Some thoughts are:

  • Are the names of these types appropriate?
  • Are we covered in terms of future extensibility?
  • Are the relevant portions of the error exposed well?
@alexcrichton alexcrichton added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. B-unstable Blocker: Implemented in the nightly compiler and unstable. labels Aug 12, 2015
@aturon
Copy link
Member

aturon commented Sep 23, 2015

Nominating for 1.5 FCP discussion.

@BurntSushi
Copy link
Member

The str_parse_error feature seems to be for this code:

/// Error returned from `String::from`
#[unstable(feature = "str_parse_error", reason = "may want to be replaced with \
                                                  Void if it ever exists",
           issue = "27734")]
#[derive(Copy, Clone, Debug, Eq, PartialEq)]
pub struct ParseError(());

#[stable(feature = "rust1", since = "1.0.0")]
impl FromStr for String {
    type Err = ParseError;
    #[inline]
    fn from_str(s: &str) -> Result<String, ParseError> {
        Ok(String::from(s))
    }
}

This does indeed seem to be a good candidate for a void type. However, it is actually a unit type. Was that intended?

For utf8_error, it looks like it's just the valid_up_to method, which seems pretty reasonable to me.

@SimonSapin
Copy link
Contributor

For utf8_error, it looks like it's just the valid_up_to method, which seems pretty reasonable to me.

The currently documented semantics are not very helpful:

Starting at the index provided, but not necessarily at it precisely, an invalid UTF-8 encoding sequence was found.

I’d like to know precisely.

Taking a step back, I’d like to reproduce the functionality of String::from_utf8_lossy, but without allocating memory / copying data. Emitting a stream of &str slices and '\u{FFFD}' replacement characters, but have the correct number of replacement characters. Could enough information be added on Utf8Error to make this doable by calling str::from_utf8 repeatedly? Or should this be a new API entirely?

@SimonSapin
Copy link
Contributor

Never mind. For the use case I’m thinking of I also want to read the input bytes in chunks, and support decoding a UTF-8 byte sequence for a single code point split across chunks. I’ll probably build it in an external library and then propose for eventual inclusion in libcore, where it could (I hope) share code with str::from_utf8.

Still, the "not necessarily at it precisely" part of valid_up_to is not very satisfying. Perhaps it could be documented as "the maximum index such that str::from_utf8(input_bytes[..index]) returns Some(_)" ? If I’m reading the from_utf8 code correctly, that’s what it is already.

Could Utf8Error even have a method that returns that prefix &str slice, rather than just an index? Or would adding the lifetime parameter be a breaking change to stable API?

@alexcrichton
Copy link
Member Author

This issue is now entering its cycle-long FCP for stabilization in 1.5


@SimonSapin yeah one thing we're particularly interested in are what exact guarantees we can provide in from the Utf8Error type, especially around the indexes returned. I believe that adding a lifetime parameter would be backwards-incompatible now, so that's probably out of the question. Beyond that though, it sounds like you think the wording around the documentation should be tweaked, and perhaps another piece of information could be added along the lines of "this is the byte where something bad happened"? Or perhaps also a "skip to this position to keep decoding" kinda position?

Curious to hear your thoughts!

@alexcrichton alexcrichton added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed I-nominated labels Sep 24, 2015
alexcrichton added a commit to alexcrichton/rust that referenced this issue Sep 24, 2015
It can never be instantiated, so signify this by having it actually be an empty
`enum`.

cc rust-lang#27734
@SimonSapin
Copy link
Contributor

I think that Utf8Error::valid_up_to can be stabilized with #28637

It should be backward-compatible to add more methods (and private fields) later to provide more info about the error and how to decode the rest of the input. But I’m not 100% sure yet what that should look like. For example, if the input ends with 1~3 bytes that look like the beginning of a valid but incomplete UTF-8 sequence, some users might want re-assemble that code point if more input is coming later.

@alexcrichton
Copy link
Member Author

Ok cool, thanks @SimonSapin! I'd be totally down for adding more information later if we can!

bors added a commit that referenced this issue Sep 26, 2015
It can never be instantiated, so signify this by having it actually be an empty
`enum`.

cc #27734
@SimonSapin
Copy link
Contributor

Alright, this: https://github.com/SimonSapin/rust-utf8/blob/d5192ce94e/lib.rs#L72-L300 has all the UTF-8 decoding functionality I’d like to have while being as flexible as possible in terms of how the data flows. The downside is that it’s quite a bit of API surface, and users need some amount of boilerplate code. (StrChar in particular is a micro-optimization. char could be used instead, but it’d have to be converted from UTF-8 to UTF-32 only (most likely) to be converted back to UTF-8 again.)

Trying to cram all that info in Utf8Error doesn’t sound nice. The current API has lots of surface, but I haven’t managed to reduce it while maintaining flexibility. (E.g. not require that all input byte slices have the same lifetime.) Any suggestion?

PushLossyDecoder above in the same file is slightly more convenient API where the user “pushes” bytes into the decoder which in turns pushes &str slices somewhere else. Compare with the data flow of iterator adaptors where the users “pulls” data from Iterator::next which in turn pulls from another iterator.

The tests directory also has re-implementations of std::str::from_utf8 and String::from_utf8_lossy.

@alexcrichton
Copy link
Member Author

Wow, thanks for doing that @SimonSapin! It does seem like that level of detail may want to belong outside of std rather than in, but I'm curious if we could perhaps contain at least some of the information? Would there be a way that some UTF-8 decoding would be necessary downstream but we could allow doing the "easy thing" of replacing with the replacement character on the fly? For example with DecodeStepStatus::Error is there a way to know where to start next after emitting a replacement character?

@SimonSapin
Copy link
Contributor

Yeah, since this crate duplicates most of its logic with code in libcore and libcollections I felt that moving into core would help reduce duplication, and I like the idea of re-implementing "higher level" existing functions on top of it. But given the complexity of the API you’re probably right that this is better in an external library.

Would there be a way that some UTF-8 decoding would be necessary downstream but we could allow doing the "easy thing" of replacing with the replacement character on the fly?

I don’t understand this part, sorry.

Extrapolating a bit: yes, there are many ways the API can be simplified if you make some assumptions about how its used. For example, if you assume lossy decoding, "\u{FFFD}" (a &'static str coerced to &'a str) can be returned the same as a "normal" decoded str slice, so errors don’t need to exist in the API at all.

But the whole idea here is to make it as general as possible with no assumption. Then more specialized things like PushLossyDecoder can be built on top.

For example with DecodeStepStatus::Error is there a way to know where to start next after emitting a replacement character?

DecodeStepStatus::Error has a remaining_input_after_error field for exactly this :) Use that as the parameter to decode_step in the next iteration of a loop.

@alexcrichton
Copy link
Member Author

Aha! The insight of returning a string for the replacement character was what I was missing, that seems like a clever way to handle what I was thinking of!

@steveklabnik steveklabnik added this to the 1.5 milestone Oct 1, 2015
@SimonSapin
Copy link
Contributor

It might make sense to extend Utf8Error like this: (names and docs to be bikeshedded)

impl Utf8Error {
    /// Explain what’s going on after `self.valid_up_to()`
    fn details() -> Utf8ErrorDetails {}
}

enum Utf8ErrorDetails {
    /// There was an invalid byte sequence.
    ///
    /// In lossy decoding, it should be replaced by a single U+FFFD replacement character.
    /// The given index indicates where in the input to continue decoding.
    /// It is always 1, 2, or 3 more than `Utf8Error::valid_up_to()`.
    Invalid { continue_at_index: usize },

    /// The input (after `Utf8Error::valid_up_to()`) ends with a sequence
    /// that may or may not represent a valid code point
    /// if more subsequent input will be available.
    /// The given length (always 1, 2, or 3) is the maximum number of additional bytes
    /// needed to determine whether it is indeed valid.
    IncompleteSuffix { extra_bytes_needed: usize },
}

I think this is enough to build upon and do everything my library does, but it’s not convenient: when reading incrementally and getting IncompleteSuffix, you have to concatenate chunk_n[valid_up_to..] and chunk_n_plus_1[..extra_bytes_needed] (or add more chunks if chunk_n_plus_1 is not long enough), call str::from_utf8 on these 2 to 4 bytes, and deal with the case when that returns Utf8ErrorDetails::Invalid.

@SimonSapin
Copy link
Contributor

Probably extra_bytes_needed can even be dropped, just use a [u8; 4] buffer for dealing with sequences split across chunks that might be a valid code point.

@SimonSapin
Copy link
Contributor

I made a nicer API, with a stateful decoder returning iterators: https://github.com/SimonSapin/rust-utf8/tree/ba1fc675a5

Here is the public API:

/// A stateful, incremental, zero-copy UTF-8 decoder with error handling.
///
/// Example: a simplified version of `String::from_utf8_lossy` can be written as:
///
/// ```rust
/// fn string_from_utf8_lossy(input: &[u8]) -> String {
///     let mut string = String::new();
///     let mut decoder = utf8::Decoder::new();
///     for piece in decoder.feed(input) {
///         string.push_str(&piece)
///     }
///     if let Some(piece) = decoder.end() {
///         string.push_str(&piece)
///     }
///     string
/// }
/// ```
pub struct Decoder {}

impl Decoder {
    /// Create a new decoder.
    pub fn new() -> Decoder {}

    /// Feed some input bytes to the decoder.
    /// Returns an iterator of decoded pieces which dereference to `&str`.
    ///
    /// If the input ends with an incomplete but potentially valid UTF-8 sequence,
    /// record that partial sequence for use in the next `feed()` call.
    ///
    /// Panics if the iterator returned by an earlier `feed()` call was not exhausted.
    pub fn feed<'d, 'i>(&'d mut self, input_chunk: &'i [u8]) -> ChunkDecoder<'d, 'i> {}

    /// Consume the decoder and indicate the end of the input.
    /// If the previous input chunk given to `feed()` ended with an incomplete UTF-8 sequence,
    /// this returns `Some(DecodedPiece::Error)`
    /// which dereference to a `"\u{FFFD}"` replacement character.
    ///
    /// Panics if the iterator returned by an earlier `feed()` call was not exhausted.
    pub fn end(self) -> Option<DecodedPiece<'static>> {}
}

/// An iterator decoding one chunk of input.
pub struct ChunkDecoder<'d, 'i> {}

impl<'d, 'i> ChunkDecoder<'d, 'i> {
    /// Return whether `next()` would return `None`.
    pub fn eof(&self) -> bool {}
}

impl<'d, 'i> Iterator for ChunkDecoder<'d, 'i> {
    type Item = DecodedPiece<'i>;
    fn next(&mut self) -> Option<Self::Item> {}
}

/// One piece of the incremental UTF-8 decoding.
///
/// Dereferences to `&str`,
/// replacing each error with a single `"\u{FFFD}"` replacement character.
pub enum DecodedPiece<'a> {
    /// A string slice of consecutive well-formed UTF-8 input
    InputSlice(&'a str),

    /// A single `char` that was split across more than one input chunks.
    AcrossChunks(StrChar),

    /// Represents one decoding error.
    /// Dereferences to a single `"\u{FFFD}"` replacement character for lossy decoding.
    Error,
}

impl<'a> Deref for DecodedPiece<'a> {
    type Target = str;
    fn deref(&self) -> &str {}
}

/// Like `char`, but represented in memory as UTF-8 and dereferences to `&str`.
#[derive(Copy, Clone)]
pub struct StrChar {}

impl Deref for StrChar {
    type Target = str;
    fn deref(&self) -> &str {}
}

@alexcrichton, Is this closer to something you might want to include?

@SimonSapin
Copy link
Contributor

Some micro-benchmarks:

test bench_our_string_from_utf8_lossy ... bench:       1,065 ns/iter (+/- 52) = 138 MB/s
test bench_std_string_from_utf8_lossy ... bench:         770 ns/iter (+/- 17) = 190 MB/s

If I cheat and overallocate by 2x:

test bench_our_string_from_utf8_lossy ... bench:         848 ns/iter (+/- 45) = 173 MB/s
test bench_std_string_from_utf8_lossy ... bench:         776 ns/iter (+/- 36) = 189 MB/s

("\u{FFFD}" is 3 bytes, sometimes longer than the error it replaces.)

@alexcrichton
Copy link
Member Author

Oh wow, thanks for the continued exploration here! I suspect an API like Decoder may wish to live outside the standard library for now, but I'd be fine extending Utf8Error with the details method like you mentioned above.

@DanielKeep
Copy link
Contributor

On the subject of <String as FromStr>::ParseError potentially being a void type, I've found some value in doing that in conv. Specifically, there's a single enum NoError {} type that implements Error. This is leveraged by an UnwrapOk extension trait implemented for<T> Result<T, NoError> to provide an unwrap_ok method that is basically unwrap, but cannot panic. As such, it doesn't need landing pads and specifically indicates that the unwrap in this case is effectively a no-op (and not a code smell like a regular unwrap call).

It's probably outside the scope of this issue, but it might be useful to have a similar, standard NoError type in Rust.

@alexcrichton
Copy link
Member Author

The libs team discussed this during triage today and the decision was to stabilize.

@canndrew
Copy link
Contributor

@DanielKeep see here, your suggestion is in the "alternatives" section.

@alexcrichton Surely you're not planning to stabilise <String as FromStr>::ParseError as () right?

alexcrichton added a commit to alexcrichton/rust that referenced this issue Oct 25, 2015
This commit stabilizes and deprecates library APIs whose FCP has closed in the
last cycle, specifically:

Stabilized APIs:

* `fs::canonicalize`
* `Path::{metadata, symlink_metadata, canonicalize, read_link, read_dir, exists,
   is_file, is_dir}` - all moved to inherent methods from the `PathExt` trait.
* `Formatter::fill`
* `Formatter::width`
* `Formatter::precision`
* `Formatter::sign_plus`
* `Formatter::sign_minus`
* `Formatter::alternate`
* `Formatter::sign_aware_zero_pad`
* `string::ParseError`
* `Utf8Error::valid_up_to`
* `Iterator::{cmp, partial_cmp, eq, ne, lt, le, gt, ge}`
* `<[T]>::split_{first,last}{,_mut}`
* `Condvar::wait_timeout` - note that `wait_timeout_ms` is not yet deprecated
  but will be once 1.5 is released.
* `str::{R,}MatchIndices`
* `str::{r,}match_indices`
* `char::from_u32_unchecked`
* `VecDeque::insert`
* `VecDeque::shrink_to_fit`
* `VecDeque::as_slices`
* `VecDeque::as_mut_slices`
* `VecDeque::swap_remove_front` - (renamed from `swap_front_remove`)
* `VecDeque::swap_remove_back` - (renamed from `swap_back_remove`)
* `Vec::resize`
* `str::slice_mut_unchecked`
* `FileTypeExt`
* `FileTypeExt::{is_block_device, is_char_device, is_fifo, is_socket}`
* `BinaryHeap::from` - `from_vec` deprecated in favor of this
* `BinaryHeap::into_vec` - plus a `Into` impl
* `BinaryHeap::into_sorted_vec`

Deprecated APIs

* `slice::ref_slice`
* `slice::mut_ref_slice`
* `iter::{range_inclusive, RangeInclusive}`
* `std::dynamic_lib`

Closes rust-lang#27706
Closes rust-lang#27725
cc rust-lang#27726 (align not stabilized yet)
Closes rust-lang#27734
Closes rust-lang#27737
Closes rust-lang#27742
Closes rust-lang#27743
Closes rust-lang#27772
Closes rust-lang#27774
Closes rust-lang#27777
Closes rust-lang#27781
cc rust-lang#27788 (a few remaining methods though)
Closes rust-lang#27790
Closes rust-lang#27793
Closes rust-lang#27796
Closes rust-lang#27810
cc rust-lang#28147 (not all parts stabilized)
bors added a commit that referenced this issue Oct 25, 2015
This commit stabilizes and deprecates library APIs whose FCP has closed in the
last cycle, specifically:

Stabilized APIs:

* `fs::canonicalize`
* `Path::{metadata, symlink_metadata, canonicalize, read_link, read_dir, exists,
   is_file, is_dir}` - all moved to inherent methods from the `PathExt` trait.
* `Formatter::fill`
* `Formatter::width`
* `Formatter::precision`
* `Formatter::sign_plus`
* `Formatter::sign_minus`
* `Formatter::alternate`
* `Formatter::sign_aware_zero_pad`
* `string::ParseError`
* `Utf8Error::valid_up_to`
* `Iterator::{cmp, partial_cmp, eq, ne, lt, le, gt, ge}`
* `<[T]>::split_{first,last}{,_mut}`
* `Condvar::wait_timeout` - note that `wait_timeout_ms` is not yet deprecated
  but will be once 1.5 is released.
* `str::{R,}MatchIndices`
* `str::{r,}match_indices`
* `char::from_u32_unchecked`
* `VecDeque::insert`
* `VecDeque::shrink_to_fit`
* `VecDeque::as_slices`
* `VecDeque::as_mut_slices`
* `VecDeque::swap_remove_front` - (renamed from `swap_front_remove`)
* `VecDeque::swap_remove_back` - (renamed from `swap_back_remove`)
* `Vec::resize`
* `str::slice_mut_unchecked`
* `FileTypeExt`
* `FileTypeExt::{is_block_device, is_char_device, is_fifo, is_socket}`
* `BinaryHeap::from` - `from_vec` deprecated in favor of this
* `BinaryHeap::into_vec` - plus a `Into` impl
* `BinaryHeap::into_sorted_vec`

Deprecated APIs

* `slice::ref_slice`
* `slice::mut_ref_slice`
* `iter::{range_inclusive, RangeInclusive}`
* `std::dynamic_lib`

Closes #27706
Closes #27725
cc #27726 (align not stabilized yet)
Closes #27734
Closes #27737
Closes #27742
Closes #27743
Closes #27772
Closes #27774
Closes #27777
Closes #27781
cc #27788 (a few remaining methods though)
Closes #27790
Closes #27793
Closes #27796
Closes #27810
cc #28147 (not all parts stabilized)
@alexcrichton
Copy link
Member Author

@DanielKeep yeah we were thinking that we may want an official enum Void {} type one day, and in that situation this current ParseError can just become a reexport of that name.

@canndrew nah we stabilized enum ParseError {} (a new empty enum)

@canndrew
Copy link
Contributor

nah we stabilized enum ParseError {} (a new empty enum)

Ah good, I was worried there.

we were thinking that we may want an official enum Void {} type one day,

Of course then there'll be two incompatible ways of writing a diverging function: fn foo() -> Void and fn foo() -> !.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-unstable Blocker: Implemented in the nightly compiler and unstable. final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants