Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix -w #10

Closed
petdance opened this issue Mar 26, 2017 · 12 comments
Closed

Fix -w #10

petdance opened this issue Mar 26, 2017 · 12 comments
Assignees
Milestone

Comments

@petdance
Copy link
Collaborator

Redo the -w flag to properly handle words.

Update documentation that this affects.

@petdance petdance added this to the 2.999_01 milestone Mar 26, 2017
@n1vux
Copy link
Contributor

n1vux commented Mar 26, 2017

xref discussion under beyondgrep/ack2#445 Metacharacters and Non-Word chars ; https://github.com/petdance/ack2/issues/565 Unicode (ugh but should have tests to track what DOES work or new enough perls)

@petdance petdance self-assigned this Mar 27, 2017
@petdance
Copy link
Collaborator Author

This needs #1 for proper testing.

@petdance
Copy link
Collaborator Author

Right now this has been changed to this:

    if ( $opt->{w} ) {
        $str = "\\b(?:$str)\\b";
    }  

Not sure if this is what we want to live with.

@n1vux
Copy link
Contributor

n1vux commented Mar 27, 2017

As discussed on prior thread, that handles the commonsense case (where $str is in fact a sensible literal word, or a RE that will match word-chars only) but has nasty edge/corner cases when $str may match whitespace or non-word-punc in first or last position.

root issue is that the \b zerowidth assertion treats non-word printables asymmetrically from word-printable, in that (1)there is no \b within sequente space punc space punc, and (2)the beginning and end of string are Non-word for \b purposes so \b(-foo-)\b will not match in first or last position ever.

we must either document that only the commonsense case where -w means word means word and Warn/Error if pattern $str starts or ends with non-word punc [including metachas; if you want -w $RE you should drop the -w and say which word-effect you want] or define exactly what -w means for $str to match a non-word word in documentation and tests and then implement it.

We discussed some ways using \k etc to make it work in edge cases but depends what we want the edges to be.

@petdance
Copy link
Collaborator Author

I'll be getting at those edge cases today. That's why I didn't close this.

@petdance
Copy link
Collaborator Author

petdance commented Mar 27, 2017

(?:^|\b|\s)\K(?:PATTERN)(?=\s|\b|$) is what we talked about latest. We'll have to see what it does to performance.

@petdance
Copy link
Collaborator Author

We may also want to make an optimization such that if $pattern =~ /^\w+$/ that we use the simple \b$pattern\b.

@petdance petdance closed this as completed Apr 1, 2017
@patch
Copy link

patch commented Jun 14, 2017

\b{wb} would be better than \b, at least for acking natural language, but it is relatively new to Perl and the semantics are less well-known than programmers’ longstanding expectations of “word boundaries.”

@petdance
Copy link
Collaborator Author

When did it come around? I'm not familiar with it.

@patch
Copy link

patch commented Jun 14, 2017

Perl 5.22 with notable improvements in 5.24.

References:

@petdance
Copy link
Collaborator Author

Interesting. Thanks for the pointers.

I would never put something in ack where the behavior changes depending on what version of Perl is running.

@patch
Copy link

patch commented Jun 14, 2017

The regex syntax and semantics change in every major Perl release, including those of \b when new Unicode “word” characters are added annually, and therefore those of ack change as well, but I understand your desire for a relatively stable flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants