Not wrap strings at escape backslash #4524

davidBar-On · 2020-11-13T18:14:39Z

Suggested resolution for issue #4471 - not wrapping string at escape backslash (\), including some additional test cases.

The approach used is that if wrapping position is a \, then if it may be escape code wrap at the previous position. Whether the \ may be escape code is identified by counting the number of preceding backslashes and make sure to have even (or zero) number of backslashes before the wrapping position. This is to handle escaped backslashes ("\\").

The approach of just handling the \ as non-punctuation character is probably less preferable, since strings like \u{1f} may be wrapped at the { which is not good from readability point of view. However, if this is acceptable, then handling the \ as non-punctuation character is easier approach.

calebcartwright · 2020-12-04T05:14:25Z

Thank you for the PR @davidBar-On . I think the overall idea of not breaking within a unicode escape is indeed the right approach to address the linked issue, but we'll need to iterate on the implementation a bit.

\u{1f} may be wrapped at the { which is not good from readability point of view

It is indeed true that we would not want to wrap there, but more so because that's still breaking the sequence just in at a different index than the one originally reported; it still results in breakage/changing of the original escape.

Unfortunately I've not been able to set aside the time yet to really dig into this, but I suspect the problem is that rustfmt isn't really checking to see whether the grapheme is actually an escape sequence in conjunction with the punctuation check, so the leading \ in the sequence grapheme causing an incorrect assumption that it's a true punctuation and can be wrapped at that index (\ vs \u{....})

There's some historical/legacy behavior here that we need to be careful to continue to honor which first attempt to break around whitespace before falling back to punctuation, however, we need to incorporate some kind of escape sequence check too.

I think a better implementation would be the inclusion of a new check for an escape sequence (e.g. for this case I think something like grapheme.as_bytse().starts_with(b"\u{") would suffice) instead of the additional logic/closure/input scanning originally proposed. It might be best to start small and just cover unicode escapes to resolve the reported issue, and we can always iterate on it later to account for other escapes, e.g. ascii/byte sequences

davidBar-On · 2020-12-05T20:29:19Z

It is indeed true that we would not want to wrap there, but more so because that's still breaking the sequence just in at a different index than the one originally reported; it still results in breakage/changing of the original escape.

The main issue, as I understand it, is not that the sequence is being broken at a point which makes the string less readable, but that break immediately after escape \ makes the code incorrect. In this case the added \ for continuing the string at the next line cause the line to end with \\, which is interpreted as escaped \ and therefore both the continuation and the escape \ are gone.

I think a better implementation would be the inclusion of a new check for an escape sequence (e.g. for this case I think something like grapheme.as_bytse().starts_with(b"\u{")would suffice)

The problem is that the issue may happen with other sequences that include \. For example the last test case I added in strings-lit.rs:

const ASCII: &str = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\\\nyyyyyyyyyyyy";

is currently formatted as (when rustfmt-format_strings: true is used):

const ASCII: &str =
    "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\\\\
     nyyyyyyyyyyyy";

Which is also not correct. Therefore it seems than strings should never be broken immediately after an uneven number of consecutive \. This is the reason for the code I suggested.

... instead of the additional logic/closure/input scanning originally proposed.

Just for my information, what is the main issue with the additional code? Is it compiling time? The complexity of rustfmt? Other?

It might be best to start small and just cover unicode escapes to resolve the reported issue, and we can always iterate on it later to account for other escapes, e.g. ascii/byte sequences

If adding the code I suggested is problematic, what about handling the \ as non-punctuation character? This will solve the issue for any occurance of \ and the change seem to be simple: changing the two occurrences of is_punctuation() in break_string() to "(is_punctuation(grapheme) && grapheme != '\')", or event make the change in is_punctuation() itself.

calebcartwright · 2020-12-07T02:14:17Z

As you're probably already aware, \u followed by a unicode char code enclosed within braces is a unicode escape . You'll notice how this is manifested in the grapheme clusters with a string lit outside rustfmt (just make sure you grab the unicode-segmentation crate)

let graphemes = "id\u{1f}".graphemes(false).collect::<Vec<&str>>();
println!("{:?}", graphemes);

prints:

["i", "d", "\u{1f}"]

Unfortunately, I believe this context is lost for rustfmt since we're obtaining the lit value via the span and the unescaping happens really early in the lexing/parsing flow. I imagine this is why the graphemes coming to rustfmt are apparently functionally the single chars you are seeing (e.g. "\", "u", "{", etc. instead of the atomic unicode escape) which get processed individually.

This I would suggest is the real underlying problem.

We should be able to break the string before/after a complete escape (unicode or otherwise) but never within.

The main issue, as I understand it, is not that the sequence is being broken at a point which makes the string less readable, but that break immediately after escape \ makes the code incorrect. In this case the added \ for continuing the string at the next line cause the line to end with \, which is interpreted as escaped \ and therefore both the continuation and the escape \ are gone.

I'm not sure if this is a statement or a question, but feel like this has been answered/stated a few times now. We should not break within an escape.

The problem is that the issue may happen with other sequences that include . For example the last test case I added in strings-lit.rs:

I am aware, as indicated in my previous comment that you were responding to with this, about dealing with other types of escapes later on 😉

Just for my information, what is the main issue with the additional code? Is it compiling time? The complexity of rustfmt? Other?

I previously spoke to the root cause above, and we should always strive to solve the underlying problem instead of adding work to deal with symptoms/side effects of the underlying problem. However, if my diagnosis is correct here then that comes from rustc so there's not much we can do about that (it's focused on compiler things naturally, not pretty formatting) and may have to work around it anyway, unfortunately.

That being said, this looks to me like a minor adjustment to the index used for breaking, and I don't feel like we should have to iterate through the string (potentially to the end) to get a count of escape chars for the odd/even and split index determination.

Instead of break_not_at_backslash, could we not just do something like this line at the top of the existing break_at?

let index = if input[index] == "\\" { index - 1 } else { index };

we know rustfmt isn't going to break at a 0 index, so if it's found a punctuation char that happens to be a \ at an index for splitting then I'd think we could just move the index back one

calebcartwright · 2020-12-07T19:59:42Z

Instead of break_not_at_backslash, could we not just do something like this line at the top of the existing break_at?
let index = if input[index] == "\\" { index - 1 } else { index };
we know rustfmt isn't going to break at a 0 index, so if it's found a punctuation char that happens to be a \ at an index for splitting then I'd think we could just move the index back one

Actually there's edge cases where that could fail aren't there, such as an escaped backslash immediately followed by a character that would become an escape with the second backslash being right at the max width index (e.g. foo\\notfoo). Sigh this is fun 🙃

davidBar-On · 2020-12-07T20:54:18Z

Actually there's edge cases where that could fail aren't there, such as an escaped backslash immediately followed by a character that would become an escape with the second backslash being right at the max width index (e.g. foo\notfoo). Sigh this is fun 🙃

Yes, I also think this is the case. This is why the code I suggested is scanning back until the first non-\ to count the number of consecutive \. When the count is even the break place is not changed. When the count is odd the break is at index -1.

calebcartwright · 2020-12-08T03:44:17Z

This is why the code I suggested is scanning back until the first non-\ to count the number of consecutive . When the count is even the break place is not changed. When the count is odd the break is at index -1

Gotcha, now I'm seeing why you ended up with the proposed solution, thanks for your patience! It's still a bummer that we don't get to see the "real" graphemes associated with the raw string in source, but in the absence of such an option I do think we'll probably be best served with what's been proposed here.

Couple minor nits that I'll leave inline for your review, but otherwise we should be good to go

calebcartwright · 2020-12-08T03:51:26Z

src/formatting/string.rs

+    // Don't allow break at `\` as it is used to concat string lines.
+    let break_not_at_backslash = |index /* grapheme at index is included */| {
+        if input[0..=index].iter().last().unwrap().ne(&"\\") {
+            break_at(index)
+        } else {
+            let backslash_count = match input[0..index]
+                .iter()
+                .rposition(|grapheme| grapheme.ne(&"\\"))
+            {
+                // Break at the reverse-first non `\`
+                Some(non_backslash_index) => index - non_backslash_index,
+                // Only `\` to the right - break after even number to not break after escape `\`
+                None => index % 2,
+            };
+            break_at(index - backslash_count % 2)
+        }
+    };
+


Ultimately we're trying to adjust the target index for breaking if we're at an escape, so I'm thinking we might be better off from a readability perspective by moving this logic within break_at instead of having this separate closure to avoid having different calls below/potentially using the wrong one later on.

That feels even more relevant since this is primarily needed as a workaround/fix for the grapheme issue discussed previously.

Thoughts?

O.k. Will move the added closure logic into break_at.

calebcartwright · 2020-12-08T03:52:43Z

src/formatting/string.rs

@@ -268,6 +268,24 @@ fn break_string(max_width: usize, trim_end: bool, line_end: &str, input: &[&str]
        }
    };

+    // Don't allow break at `\` as it is used to concat string lines.
+    let break_not_at_backslash = |index /* grapheme at index is included */| {
+        if input[0..=index].iter().last().unwrap().ne(&"\\") {


input[index] instead of creating the iterator and unwrapping just to get the same thing?

OK. Found that explicit casting should be added: (input[index] as &str). Otherwise there is an error cannot infer type, but I assume adding the casting is o.k.

calebcartwright · 2020-12-08T03:57:39Z

src/formatting/string.rs

+            let backslash_count = match input[0..index]
+                .iter()
+                .rposition(|grapheme| grapheme.ne(&"\\"))
+            {
+                // Break at the reverse-first non `\`
+                Some(non_backslash_index) => index - non_backslash_index,
+                // Only `\` to the right - break after even number to not break after escape `\`
+                None => index % 2,
+            };
+            break_at(index - backslash_count % 2)


Minor nit here because the variable names were originally throwing me off. What about something like this:

let index_offset = match input[0..index] .iter() .rposition(|grapheme| grapheme.ne(&"\\")) { // Break at the reverse-first non `\` Some(non_backslash_index) => index - non_backslash_index, // Only `\` to the left - break after even number to not break after escape `\` None => index, } % 2; break_at(index - index_offset)

or with the binop modulo exprs within both arm bodies:

let index_offset = match input[0..index] .iter() .rposition(|grapheme| grapheme.ne(&"\\")) { // Break at the reverse-first non `\` Some(non_backslash_index) => (index - non_backslash_index) % 2, // Only `\` to the left - break after even number to not break after escape `\` None => index % 2, }; break_at(index - index_offset)

I now see the problem with the names. Will change the code to the second suggestion, as it seems more clear to read.

calebcartwright · 2020-12-08T03:57:58Z

src/formatting/string.rs

+            {
+                // Break at the reverse-first non `\`
+                Some(non_backslash_index) => index - non_backslash_index,
+                // Only `\` to the right - break after even number to not break after escape `\`


This would be to the left, not the right no?

Yes, should be left .... Will change the comment.

davidBar-On · 2020-12-08T12:59:23Z

Submitted to modifications according to the comments.

calebcartwright

Couple final nits, otherwise LGTM!

calebcartwright · 2020-12-09T01:06:05Z

src/formatting/string.rs

+        let index = if (input[index_in] as &str).ne("\\") {
+            index_in
+        } else {
+            let index_offset = match input[0..index_in]
+                .iter()
+                .rposition(|grapheme| grapheme.ne(&"\\"))
+            {
+                // There is a non-`\` to the left
+                Some(non_backslash_index) => (index_in - non_backslash_index) % 2,
+                // Only `\` to the left
+                None => index_in % 2,
+            };
+            // Make sure break is after even number (including zero) of `\`
+            index_in - index_offset
+        };


Admittedly another nit, but I think keeping index as the closure param name and just shadowing it with the assignment on 231 would be fine, especially since it's a usize

OK. submitted the change.

calebcartwright · 2020-12-09T01:06:41Z

src/formatting/string.rs

+    let break_at = |index_in /* grapheme at index is included */| {
+        // Ensure break is not after an escape '\' as it will "escape" the `\` that is added
+        // for concatenating the two parts of the broken line.
+        let index = if (input[index_in] as &str).ne("\\") {


Suggested change

let index = if (input[index_in] as &str).ne("\\") {

let index = if input[index_in] != "\\" {

OK. Submitted the change (and I have still much to learn about rust, since I was not even aware that != and ne are not the same).

calebcartwright · 2020-12-09T23:40:56Z

Awesome, thanks so much!

…with escape backslash

Issue rust-lang#4471 - not wrap strings at backslash

5a371a0

calebcartwright added the pr-not-reviewed label Nov 14, 2020

calebcartwright removed the pr-not-reviewed label Dec 4, 2020

calebcartwright reviewed Dec 8, 2020

View reviewed changes

calebcartwright added the pr-waiting-on-author label Dec 8, 2020

davidBar-On added 2 commits December 8, 2020 13:17

Merge branch 'master' into issue-4471

bb8a8b9

Changes according to comments to the fix PR rust-lang#4524

e8eb088

calebcartwright removed the pr-waiting-on-author label Dec 8, 2020

calebcartwright reviewed Dec 9, 2020

View reviewed changes

Some more modifications to improve the code per comments received

c6e1673

calebcartwright merged commit 2ce75b1 into rust-lang:master Dec 9, 2020

karyon added backport-triage 1x-backport:pending Fixed/resolved in source but not yet backported to a 1x branch and release and removed backport-triage labels Oct 27, 2021

calebcartwright mentioned this pull request Dec 21, 2021

format_strings breaks after "\\" resulting in invalid output #5138

Open

davidBar-On added a commit to davidBar-On/rustfmt that referenced this pull request Aug 4, 2022

Backport rust-lang#4524 and fix rust-lang#5138 - proper wrap strings …

2474b49

…with escape backslash

davidBar-On added a commit to davidBar-On/rustfmt that referenced this pull request Aug 4, 2022

Backport rust-lang#4524 and fix rust-lang#5138 - proper wrap strings …

12e16ba

…with escape backslash

davidBar-On mentioned this pull request Aug 4, 2022

Backport #4524 and fix #5138 - proper wrap strings with escape backslash #5483

Open

ahl mentioned this pull request Dec 7, 2023

fix #4471 and #5138 #5976

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not wrap strings at escape backslash #4524

Not wrap strings at escape backslash #4524

davidBar-On commented Nov 13, 2020

calebcartwright commented Dec 4, 2020

davidBar-On commented Dec 5, 2020

calebcartwright commented Dec 7, 2020

calebcartwright commented Dec 7, 2020

davidBar-On commented Dec 7, 2020

calebcartwright commented Dec 8, 2020

calebcartwright Dec 8, 2020

davidBar-On Dec 8, 2020

calebcartwright Dec 8, 2020

davidBar-On Dec 8, 2020

calebcartwright Dec 8, 2020

davidBar-On Dec 8, 2020

calebcartwright Dec 8, 2020

davidBar-On Dec 8, 2020

davidBar-On commented Dec 8, 2020

calebcartwright left a comment

calebcartwright Dec 9, 2020

davidBar-On Dec 9, 2020

calebcartwright Dec 9, 2020

davidBar-On Dec 9, 2020

calebcartwright commented Dec 9, 2020

	let index = if (input[index_in] as &str).ne("\\") {
	let index = if input[index_in] != "\\" {

Not wrap strings at escape backslash #4524

Not wrap strings at escape backslash #4524

Conversation

davidBar-On commented Nov 13, 2020

calebcartwright commented Dec 4, 2020

davidBar-On commented Dec 5, 2020

calebcartwright commented Dec 7, 2020

calebcartwright commented Dec 7, 2020

davidBar-On commented Dec 7, 2020

calebcartwright commented Dec 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidBar-On commented Dec 8, 2020

calebcartwright left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calebcartwright commented Dec 9, 2020