Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not wrap strings at escape backslash #4524

Merged
merged 4 commits into from
Dec 9, 2020

Conversation

davidBar-On
Copy link
Contributor

Suggested resolution for issue #4471 - not wrapping string at escape backslash (\), including some additional test cases.

The approach used is that if wrapping position is a \, then if it may be escape code wrap at the previous position. Whether the \ may be escape code is identified by counting the number of preceding backslashes and make sure to have even (or zero) number of backslashes before the wrapping position. This is to handle escaped backslashes ("\\").

The approach of just handling the \ as non-punctuation character is probably less preferable, since strings like \u{1f} may be wrapped at the { which is not good from readability point of view. However, if this is acceptable, then handling the \ as non-punctuation character is easier approach.

@calebcartwright
Copy link
Member

Thank you for the PR @davidBar-On . I think the overall idea of not breaking within a unicode escape is indeed the right approach to address the linked issue, but we'll need to iterate on the implementation a bit.

\u{1f} may be wrapped at the { which is not good from readability point of view

It is indeed true that we would not want to wrap there, but more so because that's still breaking the sequence just in at a different index than the one originally reported; it still results in breakage/changing of the original escape.

Unfortunately I've not been able to set aside the time yet to really dig into this, but I suspect the problem is that rustfmt isn't really checking to see whether the grapheme is actually an escape sequence in conjunction with the punctuation check, so the leading \ in the sequence grapheme causing an incorrect assumption that it's a true punctuation and can be wrapped at that index (\ vs \u{....})

There's some historical/legacy behavior here that we need to be careful to continue to honor which first attempt to break around whitespace before falling back to punctuation, however, we need to incorporate some kind of escape sequence check too.

I think a better implementation would be the inclusion of a new check for an escape sequence (e.g. for this case I think something like grapheme.as_bytse().starts_with(b"\u{") would suffice) instead of the additional logic/closure/input scanning originally proposed. It might be best to start small and just cover unicode escapes to resolve the reported issue, and we can always iterate on it later to account for other escapes, e.g. ascii/byte sequences

@davidBar-On
Copy link
Contributor Author

It is indeed true that we would not want to wrap there, but more so because that's still breaking the sequence just in at a different index than the one originally reported; it still results in breakage/changing of the original escape.

The main issue, as I understand it, is not that the sequence is being broken at a point which makes the string less readable, but that break immediately after escape \ makes the code incorrect. In this case the added \ for continuing the string at the next line cause the line to end with \\, which is interpreted as escaped \ and therefore both the continuation and the escape \ are gone.

I think a better implementation would be the inclusion of a new check for an escape sequence (e.g. for this case I think something like grapheme.as_bytse().starts_with(b"\u{")would suffice)

The problem is that the issue may happen with other sequences that include \. For example the last test case I added in strings-lit.rs:

const ASCII: &str = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\\\nyyyyyyyyyyyy";

is currently formatted as (when rustfmt-format_strings: true is used):

const ASCII: &str =
    "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\\\\
     nyyyyyyyyyyyy";

Which is also not correct. Therefore it seems than strings should never be broken immediately after an uneven number of consecutive \. This is the reason for the code I suggested.

... instead of the additional logic/closure/input scanning originally proposed.

Just for my information, what is the main issue with the additional code? Is it compiling time? The complexity of rustfmt? Other?

It might be best to start small and just cover unicode escapes to resolve the reported issue, and we can always iterate on it later to account for other escapes, e.g. ascii/byte sequences

If adding the code I suggested is problematic, what about handling the \ as non-punctuation character? This will solve the issue for any occurance of \ and the change seem to be simple: changing the two occurrences of is_punctuation() in break_string() to "(is_punctuation(grapheme) && grapheme != '\')", or event make the change in is_punctuation() itself.

@calebcartwright
Copy link
Member

As you're probably already aware, \u followed by a unicode char code enclosed within braces is a unicode escape . You'll notice how this is manifested in the grapheme clusters with a string lit outside rustfmt (just make sure you grab the unicode-segmentation crate)

let graphemes = "id\u{1f}".graphemes(false).collect::<Vec<&str>>();
println!("{:?}", graphemes);

prints:

["i", "d", "\u{1f}"]

Unfortunately, I believe this context is lost for rustfmt since we're obtaining the lit value via the span and the unescaping happens really early in the lexing/parsing flow. I imagine this is why the graphemes coming to rustfmt are apparently functionally the single chars you are seeing (e.g. "\", "u", "{", etc. instead of the atomic unicode escape) which get processed individually.

This I would suggest is the real underlying problem.

We should be able to break the string before/after a complete escape (unicode or otherwise) but never within.

The main issue, as I understand it, is not that the sequence is being broken at a point which makes the string less readable, but that break immediately after escape \ makes the code incorrect. In this case the added \ for continuing the string at the next line cause the line to end with \, which is interpreted as escaped \ and therefore both the continuation and the escape \ are gone.

I'm not sure if this is a statement or a question, but feel like this has been answered/stated a few times now. We should not break within an escape.

The problem is that the issue may happen with other sequences that include . For example the last test case I added in strings-lit.rs:

I am aware, as indicated in my previous comment that you were responding to with this, about dealing with other types of escapes later on 😉

Just for my information, what is the main issue with the additional code? Is it compiling time? The complexity of rustfmt? Other?

I previously spoke to the root cause above, and we should always strive to solve the underlying problem instead of adding work to deal with symptoms/side effects of the underlying problem. However, if my diagnosis is correct here then that comes from rustc so there's not much we can do about that (it's focused on compiler things naturally, not pretty formatting) and may have to work around it anyway, unfortunately.

That being said, this looks to me like a minor adjustment to the index used for breaking, and I don't feel like we should have to iterate through the string (potentially to the end) to get a count of escape chars for the odd/even and split index determination.

Instead of break_not_at_backslash, could we not just do something like this line at the top of the existing break_at?

let index = if input[index] == "\\" { index - 1 } else { index };

we know rustfmt isn't going to break at a 0 index, so if it's found a punctuation char that happens to be a \ at an index for splitting then I'd think we could just move the index back one

@calebcartwright
Copy link
Member

Instead of break_not_at_backslash, could we not just do something like this line at the top of the existing break_at?

let index = if input[index] == "\\" { index - 1 } else { index };

we know rustfmt isn't going to break at a 0 index, so if it's found a punctuation char that happens to be a \ at an index for splitting then I'd think we could just move the index back one

Actually there's edge cases where that could fail aren't there, such as an escaped backslash immediately followed by a character that would become an escape with the second backslash being right at the max width index (e.g. foo\\notfoo). Sigh this is fun 🙃

@davidBar-On
Copy link
Contributor Author

Actually there's edge cases where that could fail aren't there, such as an escaped backslash immediately followed by a character that would become an escape with the second backslash being right at the max width index (e.g. foo\notfoo). Sigh this is fun 🙃

Yes, I also think this is the case. This is why the code I suggested is scanning back until the first non-\ to count the number of consecutive \. When the count is even the break place is not changed. When the count is odd the break is at index -1.

@calebcartwright
Copy link
Member

This is why the code I suggested is scanning back until the first non-\ to count the number of consecutive . When the count is even the break place is not changed. When the count is odd the break is at index -1

Gotcha, now I'm seeing why you ended up with the proposed solution, thanks for your patience! It's still a bummer that we don't get to see the "real" graphemes associated with the raw string in source, but in the absence of such an option I do think we'll probably be best served with what's been proposed here.

Couple minor nits that I'll leave inline for your review, but otherwise we should be good to go

Comment on lines 271 to 288
// Don't allow break at `\` as it is used to concat string lines.
let break_not_at_backslash = |index /* grapheme at index is included */| {
if input[0..=index].iter().last().unwrap().ne(&"\\") {
break_at(index)
} else {
let backslash_count = match input[0..index]
.iter()
.rposition(|grapheme| grapheme.ne(&"\\"))
{
// Break at the reverse-first non `\`
Some(non_backslash_index) => index - non_backslash_index,
// Only `\` to the right - break after even number to not break after escape `\`
None => index % 2,
};
break_at(index - backslash_count % 2)
}
};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately we're trying to adjust the target index for breaking if we're at an escape, so I'm thinking we might be better off from a readability perspective by moving this logic within break_at instead of having this separate closure to avoid having different calls below/potentially using the wrong one later on.

That feels even more relevant since this is primarily needed as a workaround/fix for the grapheme issue discussed previously.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O.k. Will move the added closure logic into break_at.

@@ -268,6 +268,24 @@ fn break_string(max_width: usize, trim_end: bool, line_end: &str, input: &[&str]
}
};

// Don't allow break at `\` as it is used to concat string lines.
let break_not_at_backslash = |index /* grapheme at index is included */| {
if input[0..=index].iter().last().unwrap().ne(&"\\") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input[index] instead of creating the iterator and unwrapping just to get the same thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Found that explicit casting should be added: (input[index] as &str). Otherwise there is an error cannot infer type, but I assume adding the casting is o.k.

Comment on lines 276 to 285
let backslash_count = match input[0..index]
.iter()
.rposition(|grapheme| grapheme.ne(&"\\"))
{
// Break at the reverse-first non `\`
Some(non_backslash_index) => index - non_backslash_index,
// Only `\` to the right - break after even number to not break after escape `\`
None => index % 2,
};
break_at(index - backslash_count % 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit here because the variable names were originally throwing me off. What about something like this:

            let index_offset = match input[0..index]
                .iter()
                .rposition(|grapheme| grapheme.ne(&"\\"))
            {
                // Break at the reverse-first non `\`
                Some(non_backslash_index) => index - non_backslash_index,
                // Only `\` to the left - break after even number to not break after escape `\`
                None => index,
            } % 2;
            break_at(index - index_offset)

or with the binop modulo exprs within both arm bodies:

            let index_offset = match input[0..index]
                .iter()
                .rposition(|grapheme| grapheme.ne(&"\\"))
            {
                // Break at the reverse-first non `\`
                Some(non_backslash_index) => (index - non_backslash_index) % 2,
                // Only `\` to the left - break after even number to not break after escape `\`
                None => index % 2,
            };
            break_at(index - index_offset)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now see the problem with the names. Will change the code to the second suggestion, as it seems more clear to read.

{
// Break at the reverse-first non `\`
Some(non_backslash_index) => index - non_backslash_index,
// Only `\` to the right - break after even number to not break after escape `\`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be to the left, not the right no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, should be left .... Will change the comment.

@davidBar-On
Copy link
Contributor Author

Submitted to modifications according to the comments.

Copy link
Member

@calebcartwright calebcartwright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple final nits, otherwise LGTM!

Comment on lines 231 to 245
let index = if (input[index_in] as &str).ne("\\") {
index_in
} else {
let index_offset = match input[0..index_in]
.iter()
.rposition(|grapheme| grapheme.ne(&"\\"))
{
// There is a non-`\` to the left
Some(non_backslash_index) => (index_in - non_backslash_index) % 2,
// Only `\` to the left
None => index_in % 2,
};
// Make sure break is after even number (including zero) of `\`
index_in - index_offset
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly another nit, but I think keeping index as the closure param name and just shadowing it with the assignment on 231 would be fine, especially since it's a usize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. submitted the change.

let break_at = |index_in /* grapheme at index is included */| {
// Ensure break is not after an escape '\' as it will "escape" the `\` that is added
// for concatenating the two parts of the broken line.
let index = if (input[index_in] as &str).ne("\\") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let index = if (input[index_in] as &str).ne("\\") {
let index = if input[index_in] != "\\" {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Submitted the change (and I have still much to learn about rust, since I was not even aware that != and ne are not the same).

@calebcartwright
Copy link
Member

Awesome, thanks so much!

@calebcartwright calebcartwright merged commit 2ce75b1 into rust-lang:master Dec 9, 2020
@karyon karyon added backport-triage 1x-backport:pending Fixed/resolved in source but not yet backported to a 1x branch and release and removed backport-triage labels Oct 27, 2021
davidBar-On added a commit to davidBar-On/rustfmt that referenced this pull request Aug 4, 2022
davidBar-On added a commit to davidBar-On/rustfmt that referenced this pull request Aug 4, 2022
@ahl ahl mentioned this pull request Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1x-backport:pending Fixed/resolved in source but not yet backported to a 1x branch and release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants