-
Notifications
You must be signed in to change notification settings - Fork 898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not wrap strings at escape backslash #4524
Conversation
Thank you for the PR @davidBar-On . I think the overall idea of not breaking within a unicode escape is indeed the right approach to address the linked issue, but we'll need to iterate on the implementation a bit.
It is indeed true that we would not want to wrap there, but more so because that's still breaking the sequence just in at a different index than the one originally reported; it still results in breakage/changing of the original escape. Unfortunately I've not been able to set aside the time yet to really dig into this, but I suspect the problem is that rustfmt isn't really checking to see whether the grapheme is actually an escape sequence in conjunction with the punctuation check, so the leading There's some historical/legacy behavior here that we need to be careful to continue to honor which first attempt to break around whitespace before falling back to punctuation, however, we need to incorporate some kind of escape sequence check too. I think a better implementation would be the inclusion of a new check for an escape sequence (e.g. for this case I think something like |
The main issue, as I understand it, is not that the sequence is being broken at a point which makes the string less readable, but that break immediately after escape
The problem is that the issue may happen with other sequences that include const ASCII: &str = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\\\nyyyyyyyyyyyy"; is currently formatted as (when const ASCII: &str =
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\\\\
nyyyyyyyyyyyy"; Which is also not correct. Therefore it seems than strings should never be broken immediately after an uneven number of consecutive
Just for my information, what is the main issue with the additional code? Is it compiling time? The complexity of rustfmt? Other?
If adding the code I suggested is problematic, what about handling the |
As you're probably already aware, let graphemes = "id\u{1f}".graphemes(false).collect::<Vec<&str>>();
println!("{:?}", graphemes); prints: ["i", "d", "\u{1f}"] Unfortunately, I believe this context is lost for rustfmt since we're obtaining the lit value via the span and the unescaping happens really early in the lexing/parsing flow. I imagine this is why the graphemes coming to rustfmt are apparently functionally the single chars you are seeing (e.g. "\", "u", "{", etc. instead of the atomic unicode escape) which get processed individually. This I would suggest is the real underlying problem. We should be able to break the string before/after a complete escape (unicode or otherwise) but never within.
I'm not sure if this is a statement or a question, but feel like this has been answered/stated a few times now. We should not break within an escape.
I am aware, as indicated in my previous comment that you were responding to with this, about dealing with other types of escapes later on 😉
I previously spoke to the root cause above, and we should always strive to solve the underlying problem instead of adding work to deal with symptoms/side effects of the underlying problem. However, if my diagnosis is correct here then that comes from rustc so there's not much we can do about that (it's focused on compiler things naturally, not pretty formatting) and may have to work around it anyway, unfortunately. That being said, this looks to me like a minor adjustment to the index used for breaking, and I don't feel like we should have to iterate through the string (potentially to the end) to get a count of escape chars for the odd/even and split index determination. Instead of let index = if input[index] == "\\" { index - 1 } else { index }; we know rustfmt isn't going to break at a 0 index, so if it's found a punctuation char that happens to be a |
Actually there's edge cases where that could fail aren't there, such as an escaped backslash immediately followed by a character that would become an escape with the second backslash being right at the max width index (e.g. |
Yes, I also think this is the case. This is why the code I suggested is scanning back until the first non- |
Gotcha, now I'm seeing why you ended up with the proposed solution, thanks for your patience! It's still a bummer that we don't get to see the "real" graphemes associated with the raw string in source, but in the absence of such an option I do think we'll probably be best served with what's been proposed here. Couple minor nits that I'll leave inline for your review, but otherwise we should be good to go |
src/formatting/string.rs
Outdated
// Don't allow break at `\` as it is used to concat string lines. | ||
let break_not_at_backslash = |index /* grapheme at index is included */| { | ||
if input[0..=index].iter().last().unwrap().ne(&"\\") { | ||
break_at(index) | ||
} else { | ||
let backslash_count = match input[0..index] | ||
.iter() | ||
.rposition(|grapheme| grapheme.ne(&"\\")) | ||
{ | ||
// Break at the reverse-first non `\` | ||
Some(non_backslash_index) => index - non_backslash_index, | ||
// Only `\` to the right - break after even number to not break after escape `\` | ||
None => index % 2, | ||
}; | ||
break_at(index - backslash_count % 2) | ||
} | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately we're trying to adjust the target index for breaking if we're at an escape, so I'm thinking we might be better off from a readability perspective by moving this logic within break_at
instead of having this separate closure to avoid having different calls below/potentially using the wrong one later on.
That feels even more relevant since this is primarily needed as a workaround/fix for the grapheme issue discussed previously.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
O.k. Will move the added closure logic into break_at
.
src/formatting/string.rs
Outdated
@@ -268,6 +268,24 @@ fn break_string(max_width: usize, trim_end: bool, line_end: &str, input: &[&str] | |||
} | |||
}; | |||
|
|||
// Don't allow break at `\` as it is used to concat string lines. | |||
let break_not_at_backslash = |index /* grapheme at index is included */| { | |||
if input[0..=index].iter().last().unwrap().ne(&"\\") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input[index]
instead of creating the iterator and unwrapping just to get the same thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Found that explicit casting should be added: (input[index] as &str)
. Otherwise there is an error cannot infer type
, but I assume adding the casting is o.k.
src/formatting/string.rs
Outdated
let backslash_count = match input[0..index] | ||
.iter() | ||
.rposition(|grapheme| grapheme.ne(&"\\")) | ||
{ | ||
// Break at the reverse-first non `\` | ||
Some(non_backslash_index) => index - non_backslash_index, | ||
// Only `\` to the right - break after even number to not break after escape `\` | ||
None => index % 2, | ||
}; | ||
break_at(index - backslash_count % 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit here because the variable names were originally throwing me off. What about something like this:
let index_offset = match input[0..index]
.iter()
.rposition(|grapheme| grapheme.ne(&"\\"))
{
// Break at the reverse-first non `\`
Some(non_backslash_index) => index - non_backslash_index,
// Only `\` to the left - break after even number to not break after escape `\`
None => index,
} % 2;
break_at(index - index_offset)
or with the binop modulo exprs within both arm bodies:
let index_offset = match input[0..index]
.iter()
.rposition(|grapheme| grapheme.ne(&"\\"))
{
// Break at the reverse-first non `\`
Some(non_backslash_index) => (index - non_backslash_index) % 2,
// Only `\` to the left - break after even number to not break after escape `\`
None => index % 2,
};
break_at(index - index_offset)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now see the problem with the names. Will change the code to the second suggestion, as it seems more clear to read.
src/formatting/string.rs
Outdated
{ | ||
// Break at the reverse-first non `\` | ||
Some(non_backslash_index) => index - non_backslash_index, | ||
// Only `\` to the right - break after even number to not break after escape `\` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be to the left
, not the right
no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, should be left
.... Will change the comment.
Submitted to modifications according to the comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple final nits, otherwise LGTM!
src/formatting/string.rs
Outdated
let index = if (input[index_in] as &str).ne("\\") { | ||
index_in | ||
} else { | ||
let index_offset = match input[0..index_in] | ||
.iter() | ||
.rposition(|grapheme| grapheme.ne(&"\\")) | ||
{ | ||
// There is a non-`\` to the left | ||
Some(non_backslash_index) => (index_in - non_backslash_index) % 2, | ||
// Only `\` to the left | ||
None => index_in % 2, | ||
}; | ||
// Make sure break is after even number (including zero) of `\` | ||
index_in - index_offset | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Admittedly another nit, but I think keeping index
as the closure param name and just shadowing it with the assignment on 231 would be fine, especially since it's a usize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. submitted the change.
src/formatting/string.rs
Outdated
let break_at = |index_in /* grapheme at index is included */| { | ||
// Ensure break is not after an escape '\' as it will "escape" the `\` that is added | ||
// for concatenating the two parts of the broken line. | ||
let index = if (input[index_in] as &str).ne("\\") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let index = if (input[index_in] as &str).ne("\\") { | |
let index = if input[index_in] != "\\" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Submitted the change (and I have still much to learn about rust, since I was not even aware that !=
and ne
are not the same).
Awesome, thanks so much! |
…with escape backslash
…with escape backslash
Suggested resolution for issue #4471 - not wrapping string at escape backslash (
\
), including some additional test cases.The approach used is that if wrapping position is a
\
, then if it may be escape code wrap at the previous position. Whether the\
may be escape code is identified by counting the number of preceding backslashes and make sure to have even (or zero) number of backslashes before the wrapping position. This is to handle escaped backslashes ("\\
").The approach of just handling the
\
as non-punctuation character is probably less preferable, since strings like\u{1f}
may be wrapped at the{
which is not good from readability point of view. However, if this is acceptable, then handling the\
as non-punctuation character is easier approach.