Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Space in re's {min,max} should raise an error, rather than fail silently #86635

Open
reuven mannequin opened this issue Nov 26, 2020 · 9 comments
Open

Space in re's {min,max} should raise an error, rather than fail silently #86635

reuven mannequin opened this issue Nov 26, 2020 · 9 comments
Labels
3.10 only security fixes topic-regex type-bug An unexpected behavior, bug, or error

Comments

@reuven
Copy link
Mannequin

reuven mannequin commented Nov 26, 2020

BPO 42469
Nosy @rhettinger, @terryjreedy, @ezio-melotti, @serhiy-storchaka, @reuven, @akulakov

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2020-11-26.07:09:54.395>
labels = ['expert-regex', 'type-bug', '3.10']
title = "Space in re's {min,max} should raise an error, rather than fail silently"
updated_at = <Date 2021-07-17.14:52:02.804>
user = 'https://github.com/reuven'

bugs.python.org fields:

activity = <Date 2021-07-17.14:52:02.804>
actor = 'andrei.avk'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Regular Expressions']
creation = <Date 2020-11-26.07:09:54.395>
creator = 'reuven'
dependencies = []
files = []
hgrepos = []
issue_num = 42469
keywords = []
message_count = 5.0
messages = ['381879', '381919', '381994', '382045', '397725']
nosy_count = 7.0
nosy_names = ['rhettinger', 'terry.reedy', 'ezio.melotti', 'mrabarnett', 'serhiy.storchaka', 'reuven', 'andrei.avk']
pr_nums = []
priority = 'low'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue42469'
versions = ['Python 3.10']

@reuven
Copy link
Mannequin Author

reuven mannequin commented Nov 26, 2020

I just discovered that having whitespace inside of {min,max} causes the regexp to report no matches, rather than an error:

>>> import re
>>> s = 'abbcccddddeeeee'
>>> re.findall('d{1, 4}', s)
[]
>>> re.findall('d{1,4}', s)
['dddd']

Ruby and JavaScript have the same behavior, so maybe this is standard in some way. But I find it hard to believe that it's desirable. (My post on Twitter about this confirmed that a whole lot of people were bitten by this bug in the past.)

BSD grep, GNU grep, and GNU Emacs all raise an error upon encountering this whitespace, which strikes me as less surprising and more useful behavior.

@reuven reuven mannequin added 3.9 only security fixes topic-regex labels Nov 26, 2020
@rhettinger
Copy link
Contributor

If this can be fixed without doing brain surgery on the code, it would be nice to have this flagged as an error. People should learn as early as possible that regex patterns tend to be unforgiving about extra spaces.

@rhettinger rhettinger added 3.10 only security fixes type-bug An unexpected behavior, bug, or error and removed 3.9 only security fixes labels Nov 26, 2020
@terryjreedy
Copy link
Member

Pending further information, I believe that expecting '{1, 4}' to be interpreted as an re quantifier involves two mental errors: a) thinking that the domain specific re language allows optional whitespace and b) thinking that '{' is a special character only used for quantifier patterns. I propose changes directed at both errors.

In Python code, a space ' ' is always special (indent, required separator, optional separator) except in comments and strings. In the latter, a space is ordinary except as seen otherwise by reader, including humans. Functions that read a string as Python code make them special again. AFAIK, re functions always see spaces as ordinary unless the re.VERBOSE compile flag is passed, and even then they are only sometimes ignored.

Suggestion 1. Because this is contrary to human habit, put a special disclaimer at the end of the 2-sentence paragraph beginning "Some characters, like '|' or '(', are special." Add something like "Space ' ' is always ordinary, like 'c' is. Do not put one where a 'c' could not go."

"The special characters are:" is misleading because the special bracketed quantifier patterns that follow are composed of ordinary characters. (See below.)

Suggestion 2. Add 'and patterns' after 'characters'. Or put the quantifier patterns after a separate header.

'[' is special in that it always begins a set pattern. ']' is always special when preceded by '['. It is an error if the set is not closed with ']'. In particular, compile('[') raises.

'{' and }' are different.  They are not special by themselves anymore than digits and ',' are, whereas within quantifier patterns, all characters, not just '{' and '}', have special interpretations. In particular, compile('{') matches instead of raising like '[' does.
>>> re.findall('{', 'a{b')
['{']

Only a complete quantifier pattern is special. As near as I can tell, re.compile act at least as if it tries to match '{\d*(,\d*)?}' (with re.ASCII flag to only allow ascii digits?) when it encounters '{'. If this fails, it continues with '{' treated as ordinary. So '{1, 4}', '{1,c4}', '{ 1,4}', '{1,4x}', and '{0x33}' are all compiled as sequences of 6 ordinary characters that match themselves.

Suggestion 3. Somewhere say that '{' and '}' are ordinary unless there is a complete quantifier match, with nothing but digits[,digits] between '{}', with nothing else, including ' ', added.
---

Turning '{' into a special character by making it an error when it does not begin a quantifier would break all intentional uses of '{' as ordinary. I don't see making '{' sort-of special by sometimes raising and sometime not by a new special rule to be much better. In particular, I don't like having only extra ' 's or other whitespace raise. AFAIK, re is unforgiving of any extra chars, not just spaces.

The re.VERBOSE causes whitespace to be sometimes be ignored. The list of situations when not ignored does not, to me, obviously include quantifiers. But it should.

>>> re.findall("""d{1,4}""", 'dddd', re.X)
['dddd']
>>> re.findall("""d{1, 4}""", 'dddd', re.X)
[]

Suggestion 4. Say so, by adding 'or quantifier' after 'character class'

@serhiy-storchaka
Copy link
Member

"{" has special meaning only when it is used in the special forms "\{[0-9]+\}", "\{,[0-9]+\}" or "\{[0-9]+,[0-9]+\}". In all other cases it means literal "{". I once tried to emit warnings when non-escaped "{" is used in literal meaning, but it did break a lot of code (for example in string.Template). I decided that it is too destructive change.

Now, we can emit a warning (and later error) if there are extra spaces. It is a breaking change, but I hope it will not break much user code. But which cases should be recognized as error and which silently accepted with literal meaning of "{"? Should we only handle spaces after comma, or before and after digits? Should we check spaces only or tabs and newlines and other whitespaces?

What about the verbose mode? Currently the third-party regex module ignores whitespace and comments in the verbose mode in more places than the re module. The re module consider some sequences atomic and does not allow whitespaces inside them even in the verbose mode, for simplicity and performance.

@akulakov
Copy link
Contributor

I think the biggest issue here is not how to explain it clearly in the docs, but that a lot of users will not confirm this in the docs based on the combination of:

  1. c{1,5} is similar to slicing, also similar to python list definition, and python sets, all of which allow spaces for readability.

  2. there is no error or warning

  3. c{1,5} having one meaning and c{1, 5} having completely different meaning with no warning is deeply weird unless you think thoroughly through all of the implications.

I would prefer a warning to be added. This leads to the question, where do you draw the line after which a warning is no longer added?

I think I would allow this form as the most extreme where warning is still shown:

c{ 1, 5 }

For any extra commas, newlines, etc, I would not show warnings because they look/feel different enough vs. {1,5}.

It seems to me that VERBOSE should have no effect inside of {1,5}.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@serhiy-storchaka
Copy link
Member

See also #91760. Currently Python is more lenient for group numbers. \g< 1 >, \g<01>, \g<1_0>, \g<+1>, \g<¹>, \g<१> are accepted. This is dues to different implementations for {n,m} and \g<n>. For {n,m} it scans while character is in the range [0-9]. For \g<n> it scans until character >, and therefore allows spaces, _, + and non-ASCII digits.

So should we recognize {01,02}, {1_0,2_0}, {+1,+2}, {¹,²}, {१,२} as the repetition quantifier with a warning or silently accept them as literals?

@ezio-melotti
Copy link
Member

"{" has special meaning only when it is used in the special forms "{[0-9]+}", "{,[0-9]+}" or "{[0-9]+,[0-9]+}". In all other cases it means literal "{".

It seems that \{[0-9]*,[0-9]*\} is interpreted as a quantifier (i.e. {m}, {m,n}, {,n}, {m,}, {,} are all valid), and other variations are interpreted as literals. I find this somewhat surprising, since I thought the { always had to be escaped unless it was within a character sets ([...]), and I would have expected an error from this too:

>>> re.findall(r'\w{x,y}', 'a{x,y}')  # x,y is not a valid range!
['a{x,y}']
>>> re.findall(r'\w{1,2}', 'a{1,2}')
['a', '1', '2']

I once tried to emit warnings when non-escaped "{" is used in literal meaning, but it did break a lot of code (for example in string.Template). I decided that it is too destructive change.

Do you remember what else broke? I would be curious to see a few examples if you still have the patch around to see better evaluate the trade-offs.

It seems that the cons of requiring { to be escaped are:

  • existing code will need to be updated or break
  • certain regex will become more complex (due to the additional escapes)
  • the behavior will diverge from other implementation

And the pros are:

  • some silent failures will be exposed
  • certain regex will become more explicit (due to the additional escapes)

A warning might be a good compromise, since it exposes the problems without breaking the code or diverging from other implementation.

As for the the extra spaces within {...}, I think it's reasonable to raise a warning too if we find something that looks like a repetition quantifier (i.e. \{\s*\d*\s*,\s*\d*\s*\}) but it's not. If doing this is too complicated and/or negatively impacts performance, then we might add a warning in the docs instead (this could be done regardless).

@serhiy-storchaka
Copy link
Member

No, seems I have not saved it. But I have recreated it, the following patch fixes most of test failures:

diff --git a/Lib/imaplib.py b/Lib/imaplib.py
index fa4c0f8f62..682da1d844 100644
--- a/Lib/imaplib.py
+++ b/Lib/imaplib.py
@@ -111,7 +111,7 @@
         br' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
         br'"')
 # Literal is no longer used; kept for backward compatibility.
-Literal = re.compile(br'.*{(?P<size>\d+)}$', re.ASCII)
+Literal = re.compile(br'.*\{(?P<size>\d+)\}$', re.ASCII)
 MapCRLF = re.compile(br'\r\n|\r|\n')
 # We no longer exclude the ']' character from the data portion of the response
 # code, even though it violates the RFC.  Popular IMAP servers such as Gmail
@@ -127,7 +127,7 @@
 Untagged_status = re.compile(
     br'\* (?P<data>\d+) (?P<type>[A-Z-]+)( (?P<data2>.*))?', re.ASCII)
 # We compile these in _mode_xxx.
-_Literal = br'.*{(?P<size>\d+)}$'
+_Literal = br'.*\{(?P<size>\d+)\}$'
 _Untagged_status = br'\* (?P<data>\d+) (?P<type>[A-Z-]+)( (?P<data2>.*))?'
 
 
diff --git a/Lib/logging/__init__.py b/Lib/logging/__init__.py
index d6315b0473..ca6588a6a2 100644
--- a/Lib/logging/__init__.py
+++ b/Lib/logging/__init__.py
@@ -455,7 +455,7 @@ class StrFormatStyle(PercentStyle):
     asctime_format = '{asctime}'
     asctime_search = '{asctime'
 
-    fmt_spec = re.compile(r'^(.?[<>=^])?[+ -]?#?0?(\d+|{\w+})?[,_]?(\.(\d+|{\w+}))?[bcdefgnosx%]?$', re.I)
+    fmt_spec = re.compile(r'^(.?[<>=^])?[+ -]?#?0?(\d+|\{\w+\})?[,_]?(\.(\d+|\{\w+\}))?[bcdefgnosx%]?$', re.I)
     field_spec = re.compile(r'^(\d+|\w+)(\.\w+|\[[^]]+\])*$')
 
     def _format(self, record):
diff --git a/Lib/string.py b/Lib/string.py
index 2eab6d4f59..5b3f0e1356 100644
--- a/Lib/string.py
+++ b/Lib/string.py
@@ -78,7 +78,7 @@ def __init_subclass__(cls):
             {delim}(?:
               (?P<escaped>{delim})  |   # Escape sequence of two delimiters
               (?P<named>{id})       |   # delimiter and a Python identifier
-              {{(?P<braced>{bid})}} |   # delimiter and a braced identifier
+              \{{(?P<braced>{bid})\}} | # delimiter and a braced identifier
               (?P<invalid>)             # Other ill-formed delimiter exprs
             )
             """
diff --git a/Lib/sysconfig.py b/Lib/sysconfig.py
index e21b7303fe..275c45c1b3 100644
--- a/Lib/sysconfig.py
+++ b/Lib/sysconfig.py
@@ -179,7 +179,7 @@ def joinuser(*args):
 # like old-style Setup files).
 _variable_rx = r"([a-zA-Z][a-zA-Z0-9_]+)\s*=\s*(.*)"
 _findvar1_rx = r"\$\(([A-Za-z][A-Za-z0-9_]*)\)"
-_findvar2_rx = r"\${([A-Za-z][A-Za-z0-9_]*)}"
+_findvar2_rx = r"\$\{([A-Za-z][A-Za-z0-9_]*)\}"
 
 
 def _safe_realpath(path):
diff --git a/Lib/test/test_argparse.py b/Lib/test/test_argparse.py
index 1f03b7fb24..383d424d5a 100644
--- a/Lib/test/test_argparse.py
+++ b/Lib/test/test_argparse.py
@@ -2175,7 +2175,7 @@ def test_required_subparsers_no_destination_error(self):
             parser.parse_args(())
         self.assertRegex(
             excinfo.exception.stderr,
-            'error: the following arguments are required: {foo,bar}\n$'
+            r'error: the following arguments are required: \{foo,bar\}\n$'
         )
 
     def test_wrong_argument_subparsers_no_destination_error(self):
@@ -2187,7 +2187,7 @@ def test_wrong_argument_subparsers_no_destination_error(self):
             parser.parse_args(('baz',))
         self.assertRegex(
             excinfo.exception.stderr,
-            r"error: argument {foo,bar}: invalid choice: 'baz' \(choose from 'foo', 'bar'\)\n$"
+            r"error: argument \{foo,bar\}: invalid choice: 'baz' \(choose from 'foo', 'bar'\)\n$"
         )
 
     def test_optional_subparsers(self):
diff --git a/Lib/test/test_string.py b/Lib/test/test_string.py
index 824b89ad51..e6e9f38c9a 100644
--- a/Lib/test/test_string.py
+++ b/Lib/test/test_string.py
@@ -335,7 +335,7 @@ class MyPattern(Template):
             pattern = r"""
             (?P<escaped>@{2})                   |
             @(?P<named>[_a-z][._a-z0-9]*)       |
-            @{(?P<braced>[_a-z][._a-z0-9]*)}    |
+            @\{(?P<braced>[_a-z][._a-z0-9]*)\}  |
             (?P<invalid>@)
             """
         m = Mapping()
@@ -351,7 +351,7 @@ class BadPattern(Template):
             (?P<badname>.*)                     |
             (?P<escaped>@{2})                   |
             @(?P<named>[_a-z][._a-z0-9]*)       |
-            @{(?P<braced>[_a-z][._a-z0-9]*)}    |
+            @\{(?P<braced>[_a-z][._a-z0-9]*)\}  |
             (?P<invalid>@)                      |
             """
         s = BadPattern('@bag.foo.who likes to eat a bag of @bag.what')
@@ -490,7 +490,7 @@ class BadPattern(Template):
             (?P<badname>.*)                  |
             (?P<escaped>@{2})                   |
             @(?P<named>[_a-z][._a-z0-9]*)       |
-            @{(?P<braced>[_a-z][._a-z0-9]*)}    |
+            @\{(?P<braced>[_a-z][._a-z0-9]*)\}  |
             (?P<invalid>@)                      |
             """
         s = BadPattern('@bag.foo.who likes to eat a bag of @bag.what')
@@ -520,7 +520,7 @@ class BadPattern(Template):
             (?P<badname>.*)                  |
             (?P<escaped>@{2})                   |
             @(?P<named>[_a-z][._a-z0-9]*)       |
-            @{(?P<braced>[_a-z][._a-z0-9]*)}    |
+            @\{(?P<braced>[_a-z][._a-z0-9]*)\}  |
             (?P<invalid>@)                      |
             """
         s = BadPattern('@bag.foo.who likes to eat a bag of @bag.what')
diff --git a/Tools/scripts/texi2html.py b/Tools/scripts/texi2html.py
index c06d812ab3..6af085b20c 100755
--- a/Tools/scripts/texi2html.py
+++ b/Tools/scripts/texi2html.py
@@ -1703,7 +1703,7 @@ class HTMLHelp:
     Favorites tabs.
     """
 
-    codeprog = re.compile('@code{(.*?)}')
+    codeprog = re.compile(r'@code\{(.*?)\}')
 
     def __init__(self,helpbase,dirname):
         self.helpbase    = helpbase

Affected modules are imaplib, logging, string, sysconfig and script texi2html. Also some code is failed in pip at attempt to compile b'(\x1b|~{)', and test_xxtestfuzz is crashed for some causes.

It is usually something like {(\w+)}. Looking at these examples I am sure that a lot of user code would be broken by this change.

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Apr 30, 2022
… {n,m}

* For numbers surrounded with spaces.
* For negative numbers.
@serhiy-storchaka
Copy link
Member

#92077 emits a warning if the the pattern looks like a {min,max} repetition with additional spaces or with negative numbers.

The latter can happen when we generate pattern: "x{%d,%d}" % (min, max).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.10 only security fixes topic-regex type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants