-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement RFC 3348, c"foo"
literals
#108801
Conversation
r? @wesleywiser (rustbot has picked a reviewer for you, use r? to override) |
Char(char), | ||
} | ||
|
||
pub fn unescape_c_string<F>(src: &str, mode: Mode, callback: &mut F) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If c"..."
requires different unescaping from some other existing strings, then something is going wrong, in general.
Perhaps implementation for c"..."
and the stuff from rust-lang/rfcs#3349 should be decoupled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has to be different because returning a char
doesn't cover all cases for C string literals. If the RFC that you mentioned is accepted, then byte string literals can't have units represented as characters too. We need to differentiate unicode characters that should be encoded using utf8. c"À"
is C3 80
while codepoint is 0xC0
, and c"\xC0"
would encode to [0xC0]
directly. Before this PR, byte strings pass these byte values as char
s which are then converted into u8
s, while C strings need to pass chars that need to be encoded as UTF-8 as char
s and bytes that need to be appended as u8
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand what you are saying.
Both byte and C strings support non-UTF8 so (Rust) char
s are out of the question.
I'm concerned about the difference between byte strings and C strings, both produce arbitrary non-UTF [u8]
and any differences between them should eventually be eliminated (that's the point of rust-lang/rfcs#3349 from what I remember).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what you are saying is true. but currently, both byte strings and normal strings emit char
s in their implementation. Byte strings just use the codepoints to represent the byte values, but that would need to be changed to an enum (just like how this PR changes it for c literals) if we were to implement that rfc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that most of this complication comes from the fact that the C-str RFC explicit states that it supports both \u
and \x
escapes in c""
literals. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@compiler-errors Not necessarily about the \u
escape, but more about the \x
escape which has a different meaning in byte strings and characters. nnethercote's comment at the RFC mentioned above suggested that a table should make this clearer:
Example | # sets* | Characters | Escapes | |
---|---|---|---|---|
Character | 'H' | 0 | All Unicode | Quote & ASCII & Unicode |
String | "hello" | 0 | All Unicode | Quote & ASCII & Unicode |
Raw string | r#"hello"# | <256 | All Unicode | N/A |
Byte | b'H' | 0 | All ASCII | Quote & Byte |
Byte string | b"hello" | 0 | All ASCII | Quote & Byte |
Raw byte string | br#"hello"# | <256 | All ASCII | N/A |
C string | c"hello" | 0 | All unicode | Quote & Byte & Unicode |
Note that since normal strings accept unicode, we can emit char
s that correspond to the actual characters. But for byte strings this is different. Byte strings allow bytes that are not encoded as UTF-8. (e.g. \xFF
allowed in byte strings but not in normal strings) How do we unescape them currently? We currently emit the codepoint (e.g. \xFF
-> ÿ
U+00FF) for byte strings and then interpret the values later on.
That means that ÿ
character emitted by a normal string means "ÿ", with codepoint U+00FF, encoded in UTF-8 as 0xC3 0xBF
. But this emitted for a byte string would mean the byte 0xFF
only. C strings are explicitly allowed to have both, therefore it is necessary to use an enum to convey either the character encoded as UTF-8 or the byte value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fee1-dead In the entry in the "Characters" column, in the "C string" row, do you really mean "all bytes exept NUL"?. IIRC Rust files are required to be valid UTF-8, and RFC 3348 has changed nothing about that. At least I found nothing in the RFC's text indicating that. The goal was more about the escapes column: the encoded result can be a non-valid unicode string, but the literal itself still has to be valid UTF-8. Otherwise this would mean that programs processing rust source code cannot assume UTF-8 validity of the source code any more. In other words, any program that uses Rust's String
type to represent a slice of Rust code (including Rust's proc macro infrastructure!) would fail for specific snippets containing c strings that have invalid UTF-8.
I think that entry should rather read "All Unicode" or "All UTF-8".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@est31: corrected, thanks.
This comment has been minimized.
This comment has been minimized.
Hey! It looks like you've submitted a new PR for the library teams! If this PR contains changes to any Examples of
Some changes occurred in src/tools/clippy cc @rust-lang/clippy |
☔ The latest upstream changes (presumably #108998) made this pull request unmergeable. Please resolve the merge conflicts. |
c"foo"
literalsc"foo"
literals
r? compiler |
…r-errors Implement RFC 3348, `c"foo"` literals RFC: rust-lang/rfcs#3348 Tracking issue: rust-lang#105723
Rollup of 6 pull requests Successful merges: - rust-lang#103056 (Fix `checked_{add,sub}_duration` incorrectly returning `None` when `other` has more than `i64::MAX` seconds) - rust-lang#108801 (Implement RFC 3348, `c"foo"` literals) - rust-lang#110773 (Reduce MIR dump file count for MIR-opt tests) - rust-lang#110876 (Added default target cpu to `--print target-cpus` output and updated docs) - rust-lang#111068 (Improve check-cfg implementation) - rust-lang#111238 (btree_map: `Cursor{,Mut}::peek_prev` must agree) Failed merges: - rust-lang#110694 (Implement builtin # syntax and use it for offset_of!(...)) r? `@ghost` `@rustbot` modify labels: rollup
Looks like rustfmt don't know about that new literals, sadly. |
…r-errors Implement RFC 3348, `c"foo"` literals RFC: rust-lang/rfcs#3348 Tracking issue: rust-lang#105723
@fee1-dead this should be feature gated under Extreme confusion:
|
|
use c literals in compiler and library Use c literals rust-lang#108801 in compiler and library currently blocked on: * <strike>rustfmt: don't know how to format c literals</strike> nope, nightly one works. * <strike>bootstrap</strike> r? `@ghost` `@rustbot` blocked
use c literals in compiler and library Use c literals rust-lang#108801 in compiler and library currently blocked on: * <strike>rustfmt: don't know how to format c literals</strike> nope, nightly one works. * <strike>bootstrap</strike> r? `@ghost` `@rustbot` blocked
…=compiler-errors Revert the lexing of `c"…"` string literals Fixes \[after beta-backport\] rust-lang#113235. Further progress is tracked in rust-lang#113333. This PR *manually* reverts parts of rust-lang#108801 (since a git-revert would've been too coarse-grained & messy) and git-reverts rust-lang#111647. CC `@fee1-dead` (rust-lang#108801) `@klensy` (rust-lang#111647) r? `@compiler-errors` `@rustbot` label F-c_str_literals beta-nominated
…ilstrieb Stabilize C string literals RFC: https://rust-lang.github.io/rfcs/3348-c-str-literal.html Tracking issue: rust-lang#105723 Documentation PR (reference manual): rust-lang/reference#1423 # Stabilization report Stabilizes C string and raw C string literals (`c"..."` and `cr#"..."#`), which are expressions of type [`&CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). Both new literals require Rust edition 2021 or later. ```rust const HELLO: &core::ffi::CStr = c"Hello, world!"; ``` C strings may contain any byte other than `NUL` (`b'\x00'`), and their in-memory representation is guaranteed to end with `NUL`. ## Implementation Originally implemented by PR rust-lang#108801, which was reverted due to unintentional changes to lexer behavior in Rust editions < 2021. The current implementation landed in PR rust-lang#113476, which restricts C string literals to Rust edition >= 2021. ## Resolutions to open questions from the RFC * Adding C character literals (`c'.'`) of type `c_char` is not part of this feature. * Support for `c"..."` literals does not prevent `c'.'` literals from being added in the future. * C string literals should not be blocked on making `&CStr` a thin pointer. * It's possible to declare constant expressions of type `&'static CStr` in stable Rust (as of v1.59), so C string literals are not adding additional coupling on the internal representation of `CStr`. * The unstable `concat_bytes!` macro should not accept `c"..."` literals. * C strings have two equally valid `&[u8]` representations (with or without terminal `NUL`), so allowing them to be used in `concat_bytes!` would be ambiguous. * Adding a type to represent C strings containing valid UTF-8 is not part of this feature. * Support for a hypothetical `&Utf8CStr` may be explored in the future, should such a type be added to Rust.
Stabilize C string literals RFC: https://rust-lang.github.io/rfcs/3348-c-str-literal.html Tracking issue: rust-lang/rust#105723 Documentation PR (reference manual): rust-lang/reference#1423 # Stabilization report Stabilizes C string and raw C string literals (`c"..."` and `cr#"..."#`), which are expressions of type [`&CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). Both new literals require Rust edition 2021 or later. ```rust const HELLO: &core::ffi::CStr = c"Hello, world!"; ``` C strings may contain any byte other than `NUL` (`b'\x00'`), and their in-memory representation is guaranteed to end with `NUL`. ## Implementation Originally implemented by PR rust-lang/rust#108801, which was reverted due to unintentional changes to lexer behavior in Rust editions < 2021. The current implementation landed in PR rust-lang/rust#113476, which restricts C string literals to Rust edition >= 2021. ## Resolutions to open questions from the RFC * Adding C character literals (`c'.'`) of type `c_char` is not part of this feature. * Support for `c"..."` literals does not prevent `c'.'` literals from being added in the future. * C string literals should not be blocked on making `&CStr` a thin pointer. * It's possible to declare constant expressions of type `&'static CStr` in stable Rust (as of v1.59), so C string literals are not adding additional coupling on the internal representation of `CStr`. * The unstable `concat_bytes!` macro should not accept `c"..."` literals. * C strings have two equally valid `&[u8]` representations (with or without terminal `NUL`), so allowing them to be used in `concat_bytes!` would be ambiguous. * Adding a type to represent C strings containing valid UTF-8 is not part of this feature. * Support for a hypothetical `&Utf8CStr` may be explored in the future, should such a type be added to Rust.
Stabilize C string literals RFC: https://rust-lang.github.io/rfcs/3348-c-str-literal.html Tracking issue: rust-lang/rust#105723 Documentation PR (reference manual): rust-lang/reference#1423 # Stabilization report Stabilizes C string and raw C string literals (`c"..."` and `cr#"..."#`), which are expressions of type [`&CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). Both new literals require Rust edition 2021 or later. ```rust const HELLO: &core::ffi::CStr = c"Hello, world!"; ``` C strings may contain any byte other than `NUL` (`b'\x00'`), and their in-memory representation is guaranteed to end with `NUL`. ## Implementation Originally implemented by PR rust-lang/rust#108801, which was reverted due to unintentional changes to lexer behavior in Rust editions < 2021. The current implementation landed in PR rust-lang/rust#113476, which restricts C string literals to Rust edition >= 2021. ## Resolutions to open questions from the RFC * Adding C character literals (`c'.'`) of type `c_char` is not part of this feature. * Support for `c"..."` literals does not prevent `c'.'` literals from being added in the future. * C string literals should not be blocked on making `&CStr` a thin pointer. * It's possible to declare constant expressions of type `&'static CStr` in stable Rust (as of v1.59), so C string literals are not adding additional coupling on the internal representation of `CStr`. * The unstable `concat_bytes!` macro should not accept `c"..."` literals. * C strings have two equally valid `&[u8]` representations (with or without terminal `NUL`), so allowing them to be used in `concat_bytes!` would be ambiguous. * Adding a type to represent C strings containing valid UTF-8 is not part of this feature. * Support for a hypothetical `&Utf8CStr` may be explored in the future, should such a type be added to Rust.
Stabilize C string literals RFC: https://rust-lang.github.io/rfcs/3348-c-str-literal.html Tracking issue: rust-lang/rust#105723 Documentation PR (reference manual): rust-lang/reference#1423 # Stabilization report Stabilizes C string and raw C string literals (`c"..."` and `cr#"..."#`), which are expressions of type [`&CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). Both new literals require Rust edition 2021 or later. ```rust const HELLO: &core::ffi::CStr = c"Hello, world!"; ``` C strings may contain any byte other than `NUL` (`b'\x00'`), and their in-memory representation is guaranteed to end with `NUL`. ## Implementation Originally implemented by PR rust-lang/rust#108801, which was reverted due to unintentional changes to lexer behavior in Rust editions < 2021. The current implementation landed in PR rust-lang/rust#113476, which restricts C string literals to Rust edition >= 2021. ## Resolutions to open questions from the RFC * Adding C character literals (`c'.'`) of type `c_char` is not part of this feature. * Support for `c"..."` literals does not prevent `c'.'` literals from being added in the future. * C string literals should not be blocked on making `&CStr` a thin pointer. * It's possible to declare constant expressions of type `&'static CStr` in stable Rust (as of v1.59), so C string literals are not adding additional coupling on the internal representation of `CStr`. * The unstable `concat_bytes!` macro should not accept `c"..."` literals. * C strings have two equally valid `&[u8]` representations (with or without terminal `NUL`), so allowing them to be used in `concat_bytes!` would be ambiguous. * Adding a type to represent C strings containing valid UTF-8 is not part of this feature. * Support for a hypothetical `&Utf8CStr` may be explored in the future, should such a type be added to Rust.
Stabilize C string literals RFC: https://rust-lang.github.io/rfcs/3348-c-str-literal.html Tracking issue: rust-lang/rust#105723 Documentation PR (reference manual): rust-lang/reference#1423 # Stabilization report Stabilizes C string and raw C string literals (`c"..."` and `cr#"..."#`), which are expressions of type [`&CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). Both new literals require Rust edition 2021 or later. ```rust const HELLO: &core::ffi::CStr = c"Hello, world!"; ``` C strings may contain any byte other than `NUL` (`b'\x00'`), and their in-memory representation is guaranteed to end with `NUL`. ## Implementation Originally implemented by PR rust-lang/rust#108801, which was reverted due to unintentional changes to lexer behavior in Rust editions < 2021. The current implementation landed in PR rust-lang/rust#113476, which restricts C string literals to Rust edition >= 2021. ## Resolutions to open questions from the RFC * Adding C character literals (`c'.'`) of type `c_char` is not part of this feature. * Support for `c"..."` literals does not prevent `c'.'` literals from being added in the future. * C string literals should not be blocked on making `&CStr` a thin pointer. * It's possible to declare constant expressions of type `&'static CStr` in stable Rust (as of v1.59), so C string literals are not adding additional coupling on the internal representation of `CStr`. * The unstable `concat_bytes!` macro should not accept `c"..."` literals. * C strings have two equally valid `&[u8]` representations (with or without terminal `NUL`), so allowing them to be used in `concat_bytes!` would be ambiguous. * Adding a type to represent C strings containing valid UTF-8 is not part of this feature. * Support for a hypothetical `&Utf8CStr` may be explored in the future, should such a type be added to Rust.
RFC: rust-lang/rfcs#3348
Tracking issue: #105723