Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: (|a)* matches more text than (|a)+ does #46123

Closed
rsc opened this issue May 12, 2021 · 1 comment
Closed

regexp: (|a)* matches more text than (|a)+ does #46123

rsc opened this issue May 12, 2021 · 1 comment
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@rsc
Copy link
Contributor

rsc commented May 12, 2021

@BurntSushi noticed that in Rust regex (rust-lang/regex#779):

  • (|a)* matching aaa matches the entire text.
  • (|a)+ matching aaa matches only the empty string at the start of the text.

This is a bit of an unlikely corner case, but it's certainly a significant violation of the rules of regexps for * to match more than +. The correct answer is for (|a)* to match an empty string too, because each iteration prefers the empty string, so even an infinite number of those matches will never fall back to matching an a. (This behavior is what Perl and PCRE do.)

Go's regexp package and RE2 have the same bug, which ultimately derives from a mismatch between the e* picture in my first regexp article and the backtracking prioritization matching in the second article. The implementation of e* needs to be a little different to get the match prioritization correct in the case where e has a preferred empty match.

I found this mismatch between RE2 and Perl in the e* case when running Glenn Fowler's regexp test cases against RE2 long ago, but at the time I thought the mismatch was unavoidable in the proper NFA approach. @BurntSushi's observation that e+ handles the empty match correctly proves otherwise. The correct fix is to leave the compilation of e+ alone and then compile e* as (e+)?. (I will add a note to the article as well.)

This is a tiny bug fix but may of course result in different matches for certain programs, causing problems for programs that rely on the current buggy behavior. That is always the case for any bug fix, of course. Fixing the bug only in Go would also make Go diverge from RE2 and Rust.

@junyer, @BurntSushi, and I discussed both the potential for breakage and the goal of keeping Go, RE2, and Rust aligned. We decided the potential breakage is minimal and outweighed by the benefits of fixing the bug, to better match Perl and PCRE, and to reestablish the properties that e* never matches more than e+ and that e+ and ee* are always the same. We agreed to make the change in all of our implementations.

/cc @charlesmunger and @adonovan for RE2/J as well.

@rsc rsc added the NeedsFix The path to resolution is known, but the work has not been done. label May 12, 2021
@rsc rsc added this to the Go1.17 milestone May 12, 2021
@rsc rsc self-assigned this May 12, 2021
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/318750 mentions this issue: regexp: fix repeat of preferred empty match

sjamesr added a commit to sjamesr/re2j that referenced this issue May 27, 2021
Original go fix:

golang/go@2a61b3c

Change description from that commit:

In Perl mode, (|a)* should match an empty string at the start of the
input. Instead it matches as many a's as possible.
Because (|a)+ is handled correctly, matching only an empty string,
this leads to the paradox that e* can match more text than e+
(for e = (|a)) and that e+ is sometimes different from ee*.

The current code treats e* and e+ as the same structure, with
different entry points. In the case of e* the preference list ends up
not quite in the right order, in part because the "before e" and
"after e" states are the same state. Splitting them apart fixes the
preference list, and that can be done by compiling e* as if it were
(e+)?.

Fixes google#136.
sjamesr added a commit to sjamesr/re2j that referenced this issue May 27, 2021
Ports fix of golang/go#46123 to RE2/J.

Original go fix:

golang/go@2a61b3c

Change description from that commit:

In Perl mode, (|a)* should match an empty string at the start of the
input. Instead it matches as many a's as possible.
Because (|a)+ is handled correctly, matching only an empty string,
this leads to the paradox that e* can match more text than e+
(for e = (|a)) and that e+ is sometimes different from ee*.

The current code treats e* and e+ as the same structure, with
different entry points. In the case of e* the preference list ends up
not quite in the right order, in part because the "before e" and
"after e" states are the same state. Splitting them apart fixes the
preference list, and that can be done by compiling e* as if it were
(e+)?.

Fixes google#136.
sjamesr added a commit to sjamesr/re2j that referenced this issue May 27, 2021
Ports fix of golang/go#46123 to RE2/J.

Original go fix:

golang/go@2a61b3c

Change description from that commit:

In Perl mode, (|a)* should match an empty string at the start of the
input. Instead it matches as many a's as possible.
Because (|a)+ is handled correctly, matching only an empty string,
this leads to the paradox that e* can match more text than e+
(for e = (|a)) and that e+ is sometimes different from ee*.

The current code treats e* and e+ as the same structure, with
different entry points. In the case of e* the preference list ends up
not quite in the right order, in part because the "before e" and
"after e" states are the same state. Splitting them apart fixes the
preference list, and that can be done by compiling e* as if it were
(e+)?.

Fixes google#136.
sjamesr added a commit to sjamesr/re2j that referenced this issue May 27, 2021
Ports fix of golang/go#46123 to RE2/J.

Original go fix:

golang/go@2a61b3c

Change description from that commit:

In Perl mode, (|a)* should match an empty string at the start of the
input. Instead it matches as many a's as possible.
Because (|a)+ is handled correctly, matching only an empty string,
this leads to the paradox that e* can match more text than e+
(for e = (|a)) and that e+ is sometimes different from ee*.

The current code treats e* and e+ as the same structure, with
different entry points. In the case of e* the preference list ends up
not quite in the right order, in part because the "before e" and
"after e" states are the same state. Splitting them apart fixes the
preference list, and that can be done by compiling e* as if it were
(e+)?.

Fixes google#136.
sjamesr added a commit to google/re2j that referenced this issue May 28, 2021
Ports fix of golang/go#46123 to RE2/J.

Original go fix:

golang/go@2a61b3c

Change description from that commit:

In Perl mode, (|a)* should match an empty string at the start of the
input. Instead it matches as many a's as possible.
Because (|a)+ is handled correctly, matching only an empty string,
this leads to the paradox that e* can match more text than e+
(for e = (|a)) and that e+ is sometimes different from ee*.

The current code treats e* and e+ as the same structure, with
different entry points. In the case of e* the preference list ends up
not quite in the right order, in part because the "before e" and
"after e" states are the same state. Splitting them apart fixes the
preference list, and that can be done by compiling e* as if it were
(e+)?.

Fixes #136.
kvch pushed a commit to elastic/beats that referenced this issue Oct 11, 2021
* format of conditional build tags has changed
* matching of * in regexes was fixed, thus breaking some of our code: golang/go#46123
* iproute package was missing from the new Golang Docker image, thus, we had to add it for our tests
* go.mod file contains separate require directive for transitive dependencies
kvch pushed a commit to elastic/beats that referenced this issue Oct 11, 2021
* format of conditional build tags has changed
* matching of * in regexes was fixed, thus breaking some of our code: golang/go#46123
* iproute package was missing from the new Golang Docker image, thus, we had to add it for our tests
* go.mod file contains separate require directive for transitive dependencies

(cherry picked from commit e6839ab)
newly12 pushed a commit to newly12/beats that referenced this issue Oct 13, 2021
* format of conditional build tags has changed
* matching of * in regexes was fixed, thus breaking some of our code: golang/go#46123
* iproute package was missing from the new Golang Docker image, thus, we had to add it for our tests
* go.mod file contains separate require directive for transitive dependencies
kvch added a commit to elastic/beats that referenced this issue Oct 14, 2021
…28335)

* Update go release version 1.17.1 (#27543)

* format of conditional build tags has changed
* matching of * in regexes was fixed, thus breaking some of our code: golang/go#46123
* iproute package was missing from the new Golang Docker image, thus, we had to add it for our tests
* go.mod file contains separate require directive for transitive dependencies

(cherry picked from commit e6839ab)

* Fix build tags for Go 1.17 (#28338)

* additional fixes

* Disable generator tests temporarily (#28362)

* Fix docker import generator for 1.17.1 (#28374)

* Fix liblogparser

* Update go.mod

* update notice

Co-authored-by: apmmachine <[email protected]>
Co-authored-by: Aleksandr Maus <[email protected]>
Co-authored-by: Noémi Ványi <[email protected]>
Co-authored-by: Noémi Ványi <[email protected]>
Co-authored-by: Alex K <[email protected]>
Co-authored-by: Marc Guasch <[email protected]>
fearful-symmetry pushed a commit to elastic/beats that referenced this issue Oct 20, 2021
* singleton sysinfo host to avoid frequently collecting host info

* add Host object to Stats object

* update changelog

* set procStats.host to nil if any error calling sysinfo.Host()

* Update aws-lambda-go library version to 1.13.3 (#28236)

* [cloud][docker] use the private docker namespace (#28286)

* [7.x] [DOCS] Update api_key example on elasticsearch output (#28288)

* packetbeat/protos/dns: don't render missing A and AAAA addresses from truncated records (#28297)

* seccomp: allow clone3 syscall for x86 (#28117)

clone3 is a linux syscall that is now used by glibc starting version
2.34. It is used when pthread_create() gets called. Current seccomp
filters do not allow this syscall leading to crashes like
runtime/cgo: pthread_create failed: Operation not permitted

See elastic/apm-server#6238 for more details

* Osquerybeat: Improve handling of osquery.autoload file, allow customizations (#28289)

Previously the osquery.autoload file was overwritten every time on
osquerybeat start and stamped with our extension.
After the change we check the content of the file and do not overwrite it on
each osquerybeat start. This allows the user to deploy their own
extensions if their want and start osquery with that.

* Osquerybeat: Runner and Fetcher unit tests (#28290)

* Runner and Fetcher unit tests

* Fix header formatting

* Tweak test

* Update go release version 1.17.1 (#27543)

* format of conditional build tags has changed
* matching of * in regexes was fixed, thus breaking some of our code: golang/go#46123
* iproute package was missing from the new Golang Docker image, thus, we had to add it for our tests
* go.mod file contains separate require directive for transitive dependencies

* Move labels and annotations under kubernetes.namespace. (#27917)

* Move labels and annotations under kubernetes.namespace.

* Remove GCP support from Functionbeat (#28253)

* Fix build tags for Go 1.17 (#28338)

* [Elastic Agent] Add ability to communicate with Kibana through service token (#28096)

* Add ability to communicate with Kibana through service token. Add ability to pass service token to container subcommand.

* Add changelog entry.

* Fix go fmt.

* Add username to ASA Security negotiation log (#26975)

* Add username to ASA Security negotiation log

I added the username user.name field to ASA Security negotiation log line.

* adding support for both formats

* adding changelog entry

* updating geo fields in expected output files

* reverse formatting

* reverting to older version of file

* reverting formatting again

* regenrate golden files again

* remove formatting, ready for review

* fixing missing message due to no newline

* fix dissect pattern to fit correctly

Co-authored-by: Marius Iversen <[email protected]>

* x-pack/filebeat/module/cisco: loosen time parsing and add group and session type capture (#28325)

* Redis: remove deprecated fields (#28246)

* Redis: remove deprecated fields

* Disable generator tests temporarily (#28362)

* Windows/perfmon metricset -  remove deprecated perfmon.counters configuration (#28282)

* remove deprecated config

* changelog

* [Filebeat] - S3 Input - Add support for only iterating/accessing only… (#28252)

* [Filebeat] - S3 Input - Add support for only iterating/accessing only specific folders or datapaths

* Breaking change for 8.0, namespace_annotations replaced by namespace.annotations (#28230)

* Breaking change for 8.0, namespace_annotations replaced by namespace.annotations

* Take care of namespace being nil

* [Heartbeat] Setuid to regular user / lower capabilities when possible (#27878)

partial fix for #27648 , this PR:

Detects if the user is running as root then:
Checks to see if an environment variable BEAT_SETUID_AS (set in our Docker.tmpl) is present
Attempts to Setuid , Setgid and Setgroups to that user / groups
Invokes setcap to drop all privileges except NET_RAW+ep
This PR also fixes the broken syscall filtering in heartbeat, some non-syscall strings were breaking that.

With the changes here elastic-agent can still run as root, but the subprocesses can lower their privileges ASAP. This should also make it possible for heartbeat to safely run ICMP pings and synthetics. Synthetics must run as non-root, but ICMP requires NET_RAW. This lets us be consistent in our docs with the recommendation that elastic-agent run as root.

* mage fmt

Co-authored-by: kaiyan-sheng <[email protected]>
Co-authored-by: Victor Martinez <[email protected]>
Co-authored-by: Ugo Sangiorgi <[email protected]>
Co-authored-by: Dan Kortschak <[email protected]>
Co-authored-by: Arnaud Lefebvre <[email protected]>
Co-authored-by: Aleksandr Maus <[email protected]>
Co-authored-by: apmmachine <[email protected]>
Co-authored-by: Michael Katsoulis <[email protected]>
Co-authored-by: Noémi Ványi <[email protected]>
Co-authored-by: Blake Rouse <[email protected]>
Co-authored-by: LaZyDK <[email protected]>
Co-authored-by: Marius Iversen <[email protected]>
Co-authored-by: Andrea Spacca <[email protected]>
Co-authored-by: Mariana Dima <[email protected]>
Co-authored-by: Andrew Cholakian <[email protected]>
Icedroid pushed a commit to Icedroid/beats that referenced this issue Nov 1, 2021
* format of conditional build tags has changed
* matching of * in regexes was fixed, thus breaking some of our code: golang/go#46123
* iproute package was missing from the new Golang Docker image, thus, we had to add it for our tests
* go.mod file contains separate require directive for transitive dependencies
Icedroid pushed a commit to Icedroid/beats that referenced this issue Nov 1, 2021
* singleton sysinfo host to avoid frequently collecting host info

* add Host object to Stats object

* update changelog

* set procStats.host to nil if any error calling sysinfo.Host()

* Update aws-lambda-go library version to 1.13.3 (elastic#28236)

* [cloud][docker] use the private docker namespace (elastic#28286)

* [7.x] [DOCS] Update api_key example on elasticsearch output (elastic#28288)

* packetbeat/protos/dns: don't render missing A and AAAA addresses from truncated records (elastic#28297)

* seccomp: allow clone3 syscall for x86 (elastic#28117)

clone3 is a linux syscall that is now used by glibc starting version
2.34. It is used when pthread_create() gets called. Current seccomp
filters do not allow this syscall leading to crashes like
runtime/cgo: pthread_create failed: Operation not permitted

See elastic/apm-server#6238 for more details

* Osquerybeat: Improve handling of osquery.autoload file, allow customizations (elastic#28289)

Previously the osquery.autoload file was overwritten every time on
osquerybeat start and stamped with our extension.
After the change we check the content of the file and do not overwrite it on
each osquerybeat start. This allows the user to deploy their own
extensions if their want and start osquery with that.

* Osquerybeat: Runner and Fetcher unit tests (elastic#28290)

* Runner and Fetcher unit tests

* Fix header formatting

* Tweak test

* Update go release version 1.17.1 (elastic#27543)

* format of conditional build tags has changed
* matching of * in regexes was fixed, thus breaking some of our code: golang/go#46123
* iproute package was missing from the new Golang Docker image, thus, we had to add it for our tests
* go.mod file contains separate require directive for transitive dependencies

* Move labels and annotations under kubernetes.namespace. (elastic#27917)

* Move labels and annotations under kubernetes.namespace.

* Remove GCP support from Functionbeat (elastic#28253)

* Fix build tags for Go 1.17 (elastic#28338)

* [Elastic Agent] Add ability to communicate with Kibana through service token (elastic#28096)

* Add ability to communicate with Kibana through service token. Add ability to pass service token to container subcommand.

* Add changelog entry.

* Fix go fmt.

* Add username to ASA Security negotiation log (elastic#26975)

* Add username to ASA Security negotiation log

I added the username user.name field to ASA Security negotiation log line.

* adding support for both formats

* adding changelog entry

* updating geo fields in expected output files

* reverse formatting

* reverting to older version of file

* reverting formatting again

* regenrate golden files again

* remove formatting, ready for review

* fixing missing message due to no newline

* fix dissect pattern to fit correctly

Co-authored-by: Marius Iversen <[email protected]>

* x-pack/filebeat/module/cisco: loosen time parsing and add group and session type capture (elastic#28325)

* Redis: remove deprecated fields (elastic#28246)

* Redis: remove deprecated fields

* Disable generator tests temporarily (elastic#28362)

* Windows/perfmon metricset -  remove deprecated perfmon.counters configuration (elastic#28282)

* remove deprecated config

* changelog

* [Filebeat] - S3 Input - Add support for only iterating/accessing only… (elastic#28252)

* [Filebeat] - S3 Input - Add support for only iterating/accessing only specific folders or datapaths

* Breaking change for 8.0, namespace_annotations replaced by namespace.annotations (elastic#28230)

* Breaking change for 8.0, namespace_annotations replaced by namespace.annotations

* Take care of namespace being nil

* [Heartbeat] Setuid to regular user / lower capabilities when possible (elastic#27878)

partial fix for elastic#27648 , this PR:

Detects if the user is running as root then:
Checks to see if an environment variable BEAT_SETUID_AS (set in our Docker.tmpl) is present
Attempts to Setuid , Setgid and Setgroups to that user / groups
Invokes setcap to drop all privileges except NET_RAW+ep
This PR also fixes the broken syscall filtering in heartbeat, some non-syscall strings were breaking that.

With the changes here elastic-agent can still run as root, but the subprocesses can lower their privileges ASAP. This should also make it possible for heartbeat to safely run ICMP pings and synthetics. Synthetics must run as non-root, but ICMP requires NET_RAW. This lets us be consistent in our docs with the recommendation that elastic-agent run as root.

* mage fmt

Co-authored-by: kaiyan-sheng <[email protected]>
Co-authored-by: Victor Martinez <[email protected]>
Co-authored-by: Ugo Sangiorgi <[email protected]>
Co-authored-by: Dan Kortschak <[email protected]>
Co-authored-by: Arnaud Lefebvre <[email protected]>
Co-authored-by: Aleksandr Maus <[email protected]>
Co-authored-by: apmmachine <[email protected]>
Co-authored-by: Michael Katsoulis <[email protected]>
Co-authored-by: Noémi Ványi <[email protected]>
Co-authored-by: Blake Rouse <[email protected]>
Co-authored-by: LaZyDK <[email protected]>
Co-authored-by: Marius Iversen <[email protected]>
Co-authored-by: Andrea Spacca <[email protected]>
Co-authored-by: Mariana Dima <[email protected]>
Co-authored-by: Andrew Cholakian <[email protected]>
@golang golang locked and limited conversation to collaborators May 13, 2022
@rsc rsc removed their assignment Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

2 participants