Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex engines only look for 'LF' line endings when using RegexOptions.Multiline mode as opposed to looking for 'CRLF' as well #66845

Closed
joperezr opened this issue Mar 18, 2022 · 3 comments

Comments

@joperezr
Copy link
Member

joperezr commented Mar 18, 2022

When using RegexOptions.Multiline, a character $ in a pattern represents end-of-line, which should match both LF or CRLF line endings in case the input is using any of the two variations for line endings, but today we only look for LF endings. This causes issues when using patterns that are supposed to match at the end of each line, since it is not always the case where pattern-authors will consider that the extra return char might probably match depending on the input's line endings. I also checked against different regex engines like PCRE and they do match $ to both types of line endings.

Quick Repro:
If you have a pattern that is trying to get the last word of each line, you might do something like:

var testString = "This is the first example\r\n   This is the second example\r\n";
var testString2 = "This is another example\n    This does work\n";
var regex = new Regex(@"[^s]+$", RegexOptions.MultiLine);
var result = regex.Matches(testString); // doesn't match
var result2 = regex.Matches(testString2); // This does work

This will not work today with any of our engines, since \r in won't match [^s] and it won't match $ either.

@ghost
Copy link

ghost commented Mar 18, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

When using RegexOptions.Multiline, a character $ in a pattern represents end-of-line, which should match both LF or CRLF line endings in case the input is using any of the two variations for line endings, but today we only look for LF endings. This causes issues when using patterns that are supposed to match at the end of each line, since it is not always the case where pattern-authors will consider that the extra return char might probably match depending on the input's line endings. I also checked against different regex engines like PCRE and they do match $ to both types of line endings.

Quick Repro:
If you have a pattern that is trying to get the last word of each line, you might do something like:

var testString = "This is the first example\r\n   This is the second example\r\n";
var regex = new Regex(@"[^s]+$", RegexOptions.MultiLine);
var result = regex.Matches(testString);

This will not work today with any of our engines, since \r in won't match [^s] and it won't match $ either.

Author: joperezr
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: 7.0.0

@danmoseley
Copy link
Member

This is closely related to #25598 (which is still up for grabs) although in that case the idea was that you'd opt into the laxer behavior.

I assume what you see with $ you also see with \z \Z and ^ (in multiline mode). See my beautiful table #25598 (comment)

@danmoseley
Copy link
Member

danmoseley commented Jun 11, 2023

I think we should close this in favor of #25598. We aren't going to change the default behavior of $ as it would be super breaking. The best we can do is add RegexOptions.AnyNewLine which will tell us to make $ match \r\n or \n.

(By the way, we should probably not use the word "match" in this context. $ doesn't ever match anything, it is an assertion the match must satisfy. The docs don't help, eg., this says "match":

By default, $ matches only the end of the input string. If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.

)

@ghost ghost locked as resolved and limited conversation to collaborators Jul 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants