Skip to content

Tutorial on named capture regular expressions in R and Python

Notifications You must be signed in to change notification settings

tdhock/regex-tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tutorial on named capture regular expressions in R and Python

In this 60 minute tutorial I will explain how to use named capture regular expressions to extract data from several different kinds structured text data.

For additional reading see my R journal articles about namedCapture and nc.

Motivation for using named capture regular expressions, 5 minutes

Why would you want to use named capture regular expressions? They are useful when you want to extract groups of substrings from text data which has some structure, but no consistent delimiter such as tabs or commas between groups. They make it easy to convert such loosely structured text data into regular CSV/TSV data.

  • The regular expression 5 foo bar matches any string that contains 5 foo bar as a substring.
  • The regular expression foo|bar matches any string that contains foo or bar. The vertical bar indicates alternation – if any one of the options is present, then there is a match.
  • Square brackets are used to indicate a character class. The regular expression [0-9] foo bar means match any digit, followed by a space, followed by foo bar.
  • A capturing regular expression includes parentheses for extracting data when there is a match. For example if we apply the regular expression ([0-9]) (foo|bar) to the string prefix 8 foo suffix, we put 8 in the first capture group and foo in the second.
  • A named capture regular expression includes group names. For example if we apply the regular expression (?<number>[0-9]) (?<string>foo|bar) to the string prefix 8 foo suffix, we put 8 in the capture group named number, and foo in the capture group named string.

Named capture regular expressions are better than simple capturing regular expressions, since you can refer to the extracted data by name rather than by an arbitrary index. That results in code that is a bit more verbose, but much easier to understand. For example in Python,

import re
subject = 'chr10:213,054,000-213,055,000'
# Without named capture:
group_tuple = re.search("(chr.*?):(.*?)-([0-9,]*)", subject).groups()
print group_tuple[1]
# With named capture:
group_dict = re.search(r"""
(?P<chrom>chr.*?)
:
(?P<chromStart>.*?)
-
(?P<chromEnd>[0-9,]*)
""", subject, re.VERBOSE).groupdict()
print group_dict["chromStart"]

Both print statements show the same thing, but the intent of the second is clearer for two reasons:

  • The group names in the regular expression serve to document their purpose. Regular expressions have a bad reputation as a write-only language but named capture can be used to make them more readable: “Hmmm… what was the second group .*? supposed to match? Oh yeah, the chromStart!”
  • We can extract the data by group name (chromStart) rather than an arbitrary index (1), clarifying the intent of the Python code.

History, 5 minutes

WhoWhenFirst
Kleene1956Regular expression on paper
Thompson1968Regular expression in a program
Thompson1974grep
Wall1994Perl5 (? extensions
Hazel1997PCRE
Kuchling et al1997Named capture in Python1.5
R core2002PCRE in R
Hazel2003Named capture in PCRE
Hocking2011Named capture in R
Hocking2016extractall in Python pandas

Regular sets and regular expressions were introduced on paper by Stephen Cole Kleene in 1956 (including the “Kleene star” * for zero or more). Among the first uses of a regular expression in a program was Ken Thompson (Bachelors 1965, Masters 1966, UC Berkeley) for his version of the QED (1968) and ed (1969) text editors, developed at Bell Labs for Unix. In ed, g/re/p means “Global Regular Expression Print,” which gave the name to the grep program, also written by Thompson (1974). I’m not sure about the origin of capture groups, but Friedl claimed that “The regular expressions supported by grep and other early tools were quite limited…grep’s capturing metacharacters were \(...\), with unescaped parentheses representing literal text.” Larry Wall wrote Perl version 1 in 1987 while working at Unisys Corporation, and it had capturing regular expressions. Perl version 5 in 1994 introduced many extensions using the (? notation. Sources: wikipedia:Regular_expression and “A Casual Stroll Across the Regex Landscape,” in Ch.3 of Friedl’s book Mastering Regular Expressions.

Philip Hazel started writing the Perl-Compatible Regular Expressions (PCRE) library for the exim mail program in 1997. Python used PCRE starting with version 1.5 in 1997. Source: Python-1.5/Misc/HISTORY.

From 1.5a3 to 1.5a4...
- A completely new re.py module is provided (thanks to Andrew
Kuchling, Tim Peters and Jeffrey Ollie) which uses Philip Hazel's
"pcre" re compiler and engine.

Python 1.5 introduced named capture groups and the (?P<name>subpattern) syntax. Source: Python-1.5/Doc/libre.tex.

\item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
the text matched by the group is accessible via the symbolic group
name \var{name}.

PCRE support for named capture was introduced in 2003. Source: PCRE changelog (my copy).

Version 4.0 17-Feb-03...
36. Added support for named subpatterns. The Python syntax (?P<name>...) is
used to name a group. Names consist of alphanumerics and underscores, and must
be unique. Back references use the syntax (?P=name) and recursive calls use
(?P>name) which is a PCRE extension to the Python extension. Groups still have
numbers.

PCRE supports alternative syntax for named capture in 2006:

Version 7.0 19-Dec-06
...
34. Added a number of extra features that are going to be in Perl 5.10. On the
    whole, these are just syntactic alternatives for features that PCRE had
    previously implemented using the Python syntax or my own invention. The
    other formats are all retained for compatibility.

    (a) Named groups can now be defined as (?<name>...) or (?'name'...) as well
        as (?P<name>...). The new forms, as well as being in Perl 5.10, are
        also .NET compatible.

R includes PCRE starting with version 1.6.0 in 2002. Source: R-src/NEWS.1.

CHANGES IN R VERSION 1.6.0...
    o	grep(), (g)sub() and regexpr() have a new argument `perl'
	which if TRUE uses Perl-style regexps from PCRE (if installed).

I wrote the code in https://svn.r-project.org/R/trunk/src/main/grep.c which implements named capture regular expression support for R. It was merged into base R in 2011 https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=14518, and has been included with every copy of R since version 2.14.

I wrote the str.extractall method in pandas, first included with release version 0.18.0 (my Pull Request was merged in Feb 2016).

Current usage in R and Python, 10 minutes

When developing complex regular expressions, it is useful to have interactive visual feedback about what parts of your subject string matches. This functionality is provided by M-x re-builder (mastering emacs) and web pages such as pythex.

For a subject s (R character vector or Python pandas Series) and a regular expression pattern p,

R nc packagePython pandasreturns
capture_first_vec(s, p)s.str.extract(p)first match
capture_all_str(s, p)s.str.extractall(p)all matches

Note that in R/nc, the pattern p should be defined as a list, with named elements used as capture groups (without literal parentheses); in Python pandas, p is a string with capture groups defined as (P?<name>pattern).

Named capture in R

Base R supports named capture regular expressions via C code that interfaces the Perl-Compatible Regular Expressions (PCRE) library. The base functions regexpr(p, s) and gregexpr(p, s) use PCRE when given the perl=TRUE argument. The first argument p is a single regular expression pattern (character vector of length 1), and the second argument s is the character vector of subjects (strings to parse). However their output is a bunch of integers and group names, which is not very user-friendly.

Instead I recommend using the nc package, which provides the capture_first_vec(s, p) and capture_all_str(s, p) functions. They are a user-friendly interface to the base regexpr and gregexpr functions. They return data tables with column names as defined in the regular expression. To install the nc package, run the following command in R:

install.packages("nc")

Notes on related functions/packages: read my research paper about namedCapture.

Named capture in Python

The re module of the Python Standard Library implements named capture regular expression support via the m.groupdict() method for a match object m.

For data analysis I recommend using the pandas library, which supports named capture regular expressions via the s.str.extract(p) and s.str.extractall(p) methods for a subject Series s and a regular expression pattern p. Both methods are an interface to the re module, and return a DataFrame with one row per match and one column per capture group. To install pandas, execute the shell command:

pip install pandas

Some exercises

Exercisesdatacode/solutionfunctions
chr.pos.Rcapture_first_vec, capture_all_str, gsub
differences_from_R.pyre.search, re.compile
chr_pos.pystr.extract, str.findall, re.subn
qsub exercisesqsub dataqsub-out.Rcapture_first_vec
trackDb exercisestrackDb.txt, trackDb2.txttrackDb.Rcapture_all_str
R-Forge exercisesR-Forge/capture_first_vec, capture_all_str
SweeD exercisesSweeD/capture_all_str
NEWS conversionNEWS/old/capture_all_str
CRAN check logscran-check-logs/blogcapture_all_str
torchvision docstorchvision-docs/TODOcapture_all_str
universalmutatormutate/mutate/mutate.Rcapture_first_vec, capture_all_str
Key names and hex codesHTMLblogcapture_all_str
Variable name stylesnaming-styles/expected.Rissuegrepl
Bibliography parsingbiblioblogcapture_first_vec, capture_all_str
Collaboration not allowedHTMLblogcapture_all_str, strsplit, etc.

Questions from the audience, 10 minutes

How do you ever extracted data from text files? Show us how you extracted some data from a particular text file, and we will try to suggest improvements.

Polynomial time named capture

Russ Cox’s ”Regular Expression Matching Can Be Simple And Fast” explains that due to backreference support, several common regular expression engines can have an exponential runtime (including PCRE which is used by R and Python). One way to achieve a speedup is to drop backreference support and use the re2 C++ library, which supports named capture. If you need a provably fast (polynomial time) regular expression engine in R, I recommend using the corresponding R package, re2. One example of when re2 is useful: validating email using a specific problematic pattern. See also speed benchmark figure in my namedCapture paper.

R functionlibrarynamed capturecomplexity
regexpr(perl=FALSE)TREnopolynomial
stringi::stri_match()ICUyesexponential
regexpr(perl=TRUE)PCREyesexponential
re2::re2_match()RE2yespolynomial
if(!require(re2))install.packages("re2")
if(!require(stringi))install.packages("stringi")
if(!require(atime))install.packages("atime")
max.N <- 25
atime.list <- atime::atime(
  N=1:max.N, 
  setup={
    subject <- paste(rep("a", N), collapse="")
    pattern <- paste(rep(c("a?", "a"), each=N), collapse="")
  },
  seconds.limit=0.1,
  ICU=stringi::stri_match(subject, regex=pattern),
  PCRE=regexpr(pattern, subject, perl=TRUE),
  TRE=regexpr(pattern, subject, perl=FALSE),
  RE2=re2::re2_match(pattern, subject))
plot(atime.list)

figure-complexity-log.png

References

http://www.regular-expressions.info has some basic reference on how to write regular expressions in several languages. However it discusses neither named capture in R, nor pandas in Python.

The definitive reference on current regular expression implementations is the book “Mastering Regular Expressions,” by Jeffrey E.F. Freidl. It contains a discussion of Python and named capture, but discusses neither pandas nor R.

About

Tutorial on named capture regular expressions in R and Python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published