-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: escape invalid UTF-8 bytes in debug output for Match #1203
fix: escape invalid UTF-8 bytes in debug output for Match #1203
Conversation
src/regex/bytes.rs
Outdated
fmt.field("bytes", &s); | ||
|
||
let bytes = self.as_bytes(); | ||
let formatted = bytes_to_string_with_invalid_utf8_escaped(bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use regex_automata::util::escape::DebugHaystack
instead? It will basically do what you have here, but will only escape invalid UTF-8. What you've implemented here will escape not only invalid UTF-8, but all UTF-8 that isn't ASCII. (I think that would be a cure worse than the disease.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified to use DebugHaystack. I thought there would be such a feature but couldn't find it. Thanks for your suggestion. 88112b3
debug_str, | ||
r#"Match { start: 7, end: 13, bytes: "\\xFFworld" }"# | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some tests with non-ASCII UTF-8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added along with other tests.
d18841e
src/regex/bytes.rs
Outdated
fn bytes_to_string_with_invalid_utf8_escaped(bytes: &[u8]) -> String { | ||
let mut result = String::new(); | ||
for &byte in bytes { | ||
if byte.is_ascii() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outputs valid UTF-8 characters as is
This is why what you said isn't accurate here. This only outputs ASCII characters as-is. Everything else, including valid UTF-8 that isn't ASCII, is emitted as escape byte sequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This PR is on crates.io in |
Description
Debug
implementation forMatch
has been updated to useDebugHaystack
. This provides a way to handle the formatting of&[u8]
for debug output.\xHH
).\t
,\n
) are properly escaped.