This linter searches for regular expressions that may, under certain inputs,
exhibit catastrophic backtracking in the Python re
module. For more information on catastrophic backtracking see:
- Runaway Regular Expressions: Catastrophic Backtracking
- Preventing Regular Expression Denial of Service (ReDoS)
- Regex Performance
- Regular Expression Denial of Service (ReDoS) and Catastrophic Backtracking
- Javascript Catastrophic Backtracking
There are many ways that ReDoS can occur. The following examples show ReDoS via nested quantifier and mutually inclusive alternation, respectively:
import re
subject = 'x' * 64
re.search(r'(x+x+)+y', subject) # Boom
import re
subject = 'a' * 64
re.search(r'(.|[abc])+z', subject) # Boom
The above examples can be corrected with the following changes:
import re
subject = 'x' * 64
re.search(r'xx+y', subject)
import re
subject = 'a' * 64
re.search(r'.+z', subject)
Rarely will there be one-size-fits-all solutions to catastrophic backtracking. Developing a deeper understanding of the issue, and regular expressions in general, is often the best solution. Whenenever you're working with regular expressions you must always ask yourself if this is a potentially catastrophic expression.
Catastrophic backtracking will often lead to denial-of-service. Catastrophic cases may take days, weeks, or years to complete which may leave your service degraded or unusable.
- Nested quantifiers with small maximums may be okay (e.g.
{1,3}
). However, sensible runtimes for your code are application-dependent. Even with small maximums the runtime will depend on many factors including subject length, machine hardware, and repetition size. Proceed with nested quantifiers at your own risk.
The Dlint ReDoS module (dlint.redos
) comes with a CLI interface for quick
debugging and testing. E.g.
$ python -m dlint.redos --pattern '(.|[abc])+z'
('(.|[abc])+z', True)
$ python -m dlint.redos --pattern '.+z'
('.+z', False)
You can dump the pattern in a more digestible format:
$ python -m dlint.redos --pattern '(.|[abc])+z' --dump
MAX_REPEAT 1 MAXREPEAT
SUBPATTERN 1 0 0
BRANCH
ANY None
OR
IN
LITERAL 97
LITERAL 98
LITERAL 99
LITERAL 122
Or dump the internal tree representation that Dlint uses:
$ python -m dlint.redos --pattern '(.|[abc])+z' --dump-tree
None: ()
MAX_REPEAT: (1, MAXREPEAT)
SUBPATTERN: (1, 0, 0)
BRANCH: ()
ANY: (None,)
IN: ((LITERAL, 97), (LITERAL, 98), (LITERAL, 99))
LITERAL: (122,)
You can also extend this functionality to analyze patterns from other regular expression modules or languages by piping a pattern to this interface:
$ echo -n '(.|[abc])+z' | python -m dlint.redos --pattern -
('(.|[abc])+z', True)