Skip to content

lucleroy/php-regex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FluentRegex

PHP library with fluent interface to build regular expressions.

Table of contents

Introduction

Here is a simple example that creates a regular expression to recognize a PHP hexadecimal number (example: 0x1ff).

$regex = Regex::create()
    ->literal('0')->chars('xX')->digit(16)->atLeastOne()
    ->getRegex();

This code is equivalent to:

$regex = '/0[xX][0-9a-fA-F]+/m';

Requirements

PHP 5.5 or more.

Installation (with Composer)

Add the following to the require section of your composer.json file

"lucleroy/php-regex": "*"

and run composer update.

Usage

Workflow

Create a Regex object with Regex::create:

use LucLeroy\Regex;

require 'vendor/autoload.php';

$regex = Regex::create();

Build the regular expression:

$regex->literal('0')->chars('xX')->digit(16)->atLeastOne();

Retrieve the PHP Regular Expression string:

echo $regex->getRegex();              // /0[xX][0-9a-fA-F]+/m
echo $regex->getUtf8Regex();          // /0[xX][0-9a-fA-F]+/mu
echo $regex->getOptimizedRegex();     // /0[xX][0-9a-fA-F]+/mS
echo $regex->getUtf8OptimizedRegex(); // /0[xX][0-9a-fA-F]+/muS

By default, the resulting string is surrounded with '/'. You can change this character:

echo $regex->getRegex('%');              // %0[xX][0-9a-fA-F]+%m
echo $regex->getUtf8Regex('%');          // %0[xX][0-9a-fA-F]+%mu
echo $regex->getOptimizedRegex('%');     // %0[xX][0-9a-fA-F]+%mS
echo $regex->getUtf8OptimizedRegex('%'); // %0[xX][0-9a-fA-F]+%muS

The choosen character is automatically escaped:

$regex = Regex::create()
    ->digit()->atLeastOne()->literal('%/')->digit()->atLeastOne()->literal('%');

echo $regex->getRegex();    // /\d+%\/\d+%/m
echo $regex->getRegex('%'); // %\d+\%/\d+\%%m

Note: when you convert a Regex instance to a string, you get the raw regular expression string. With the preceding example :

echo "$regex"; // \d+%/\d+%

Literal Characters

Use Regex::literal to match literal characters. Special characters are automatically escaped:

echo Regex::create()
    ->literal('1+1=2'); // 1\+1\=2

The expression created by Regex::literal is indivisible: when you put a quantifier next to it, it applies to the whole expression and not only to the last character:

echo Regex::create()
    ->literal('ab')->anyTimes(); // (?:ab)*

echo Regex::create()
    ->literal('a')->literal('b')->anyTimes(); // ab*

Character Sets

Use Regex::chars to match chars in a character set. Use two dots to specify a range of characters.

echo Regex::create()
    ->chars('0..9-A..Z'); // [0-9\-A-Z]

If you want to match characters that are not in a specified set, use Regex::notChars:

echo Regex::create()
    ->notChars('0..9'); // [^0-9]

If you need to add special characters to a character set, you can provide an instance of Charset to the methods Regex::chars and Regex::notChars. For example, the following code matches letters and tabulations:

echo Regex::create()
    ->chars(Charset::create()->chars('a..zA..Z')->tab()); // [a-zA-Z\t]

You can use the following methods to match non-printable characters:

Character ASCII Method
tab 0x09 tab
carriage return 0x0D cr
line feed 0x0A lf
bell 0x07 bell
escape 0x1B esc
form feed 0x0C ff
vertical tab 0x0B vtab
backspace 0x08 backspace

You can use shorthands for common character classes:

Character Class Method
digit digit
word character wordChar
whitespace character whitespace
not digit notDigit
not word character notWordChar
not whitespace character notWhitespace

In addition, you can pass a base (from 2 to 26) to Charset::digit and Charset::notDigit:

echo Regex::create()
    ->chars(Charset::create()->digit()); // [\d]

echo Regex::create()
    ->chars(Charset::create()->digit(2)); // [01]

echo Regex::create()
    ->chars(Charset::create()->digit(16)); // [0-9a-fA-F]

You can match control characters (ASCII codes from 1 to 26) with Charset::control:

echo Regex::create()
    ->chars(Charset::create()->control('C')); // [\cC]

You can match an ANSI character with Charset::ansi:

echo Regex::create()
    ->chars(Charset::create()->ansi(0x7f)); // [\x7F]

You can match a range of ANSI characters with Charset::ansiRange:

echo Regex::create()
    ->chars(Charset::create()->ansiRange(0x20, 0x7f)); // [\x20-\x7F]

Finally, Charset provides some methods to work with Unicode characters.

Use Charset::extendedUnicode to match a Unicode grapheme:

echo Regex::create()
    ->chars(Charset::create()->extendedUnicode()); // [\X]

Use Charset::unicodeChar to match a specific unicode point:

echo Regex::create()
    ->chars(Charset::create()->unicodeChar(0x2122)); // [\x{2122}]

Use Charset::unicodeCharRange to match a range of unicode points:

echo Regex::create()
    ->chars(Charset::create()->unicodeCharRange(0x80, 0xff)); // [\x{80}-\x{FF}]

Use Charset::unicode to match a a Unicode class or category. For your convenience, a Unicode class with Unicode properties is provided:

echo Regex::create()
    ->chars(Charset::create()->unicode(Unicode::Letter)); // [\pL]

Note : all the methods of Charset are available in Regex:

echo Regex::create()
    ->digit();        // \d

echo Regex::create()
    ->digit(8);       // [0-7]

Match any character

If you want to match any character, use Regex::anyChar:

echo Regex::create()
    ->anyChar(); // (?s:.)

Note that the regular expression generated by the previous method matches also newlines. If you don't want to match newlines, use the method Regex::notNewline:

echo Regex::create()
    ->notNewline();   // .

Anchors

To match at the start of the string or at the end of the string, use Regex:startOfString and Regex::endOfString.

echo Regex::create()
    ->startOfString()->literal('123')->endOfString(); // \A123\z

The preceding method matches only at the string ends. If you want to match at the start of a line or at the end of a line, use Regex:startOfLine and Regex::endOfLine.

echo Regex::create()
    ->startOfLine()->literal('123')->endOfLine(); // ^123$

You can match at a word boundary with Regex::wordLimit. To match a position which is not a word boundary, use Regex::notWordLimit:

echo Regex::create()
    ->wordLimit();    // \b

echo Regex::create()
    ->notWordLimit(); // \B

Alternation

Use Regex::alt to create an alternation. There are several ways to provide each choice.

Firstly, you can pass choices as arguments:

$choices = [
    Regex::create()->literal('b'),
    Regex::create()->literal('c')
];

echo Regex::create()
    ->literal('a')
    ->alt($choices);  // a(?:b|c)

Secondly, you can give to the method the number of choices, which are taken from the previous expressions:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->literal('c')
    ->alt(2);       // a(?:b|c)

Finally, you can mark the position of the first choice with Regex::start and give no argument to the Regex::alt method:

echo Regex::create()
    ->literal('a')
    ->start()
    ->literal('b')
    ->literal('c')
    ->alt();       // a(?:b|c)

If you want to create an alternation with literals only, you can use Regex::literalAlt:

echo Regex::create()
    ->literalAlt(['one', 'two', 'three']);  // one|two|three

Quantifiers

Use Regex::optional to match an optional expression:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->optional();     // ab?

Use Regex::anyTimes to match any number of consecutive occurences of the previous expression:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->anyTimes();     // ab*

Use Regex::atLeastOne to match at least one occurences of the previous expression:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->atLeastOne();   // ab+

Use Regex::atLeast to match a minimum number of occurences of the previous expression:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->atLeast(2);     // ab{2,}

Use Regex::between to match a number of occurences of the previous expression between two numbers:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->between(2,5);   // ab{2,5}

Use Regex::times to match a precise number of occurences of the previous expression:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->times(2);   // ab{2}

Note: instead of add the quantifier to the previous expression, you can provide a Regex instance as last argument of each of these methods.

Greedy, Lazy, Possessive Quantifiers

In the previous examples, the quantifiers are greedy. This is the default behavior. More precisely, a quantifier can have 4 modes: GREEDY, LAZY, POSSESSIVE, and UNDEFINED. When the regular expression string is generated, a quantifier with the UNDEFINED mode is considered as GREEDY. UNDEFINED is the default mode but you can use Regex::greedy, Regex::lazy and Regex::possessive on an empty Regex (just after the creation) to modify the default behavior:

echo Regex::create()
    ->lazy()
    ->literal('a')
    ->anyTimes()
    ->literal('b')
    ->anyTimes();     // a*?b*?

The same methods can be used after a quantifier to change its behavior:

echo Regex::create()
    ->lazy()
    ->literal('a')
    ->anyTimes()
    ->greedy()
    ->literal('b')
    ->anyTimes();    // a*b*?

You can also change the behavior of all quantifiers of a group:

echo Regex::create()
    ->literal('a')->literal('b')->optional()->group(2)->anyTimes()
    ->literal('c')->anyTimes()
    ->alt(2)
    ->lazy();  // (?:ab?)*?|c*?

In the previous example, you can notice that the behavior does not apply to the optional quantifier. You can use Regex::greedyRecursive, Regex::lazyRecursive and Regex::possessiveRecursive to apply the behavior recursively:

echo Regex::create()
    ->literal('a')->literal('b')->optional()->group(2)->anyTimes()
    ->literal('c')->anyTimes()
    ->alt(2)
    ->lazyRecursive();  // (?:ab??)*?|c*?

When applied to a group, all these methods modify the behavior of a quantifier only if it has the UNDEFINED mode. In the example, if the optional quantifier is set to GREEDY, it retains its behavior:

echo Regex::create()
    ->literal('a')->literal('b')->optional()->greedy()->group(2)->anyTimes()
    ->literal('c')->anyTimes()
    ->alt(2)
    ->lazyRecursive();  // (?:ab?)*?|c*?

Grouping and Capturing

By default, when the library needs to create a group, it is not captured. To capture an expression, you must use Regex::capture:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->literal('c')
    ->alt(2)->capture();  // a(b|c)

To create a named group, give an argument to Regex::capture:

echo Regex::create()
    ->literal('a')->capture('myname'); // (?P<myname>a)

You can group several expressions with Regex::group. As with Regex::alt, you can specify the expressions to group by using the Regex::start method or by giving the number of expressions to group or by giving directly the expression (a Regex instance):

echo Regex::create()
    ->literal('a')
    ->start()
    ->literal('b')
    ->literal('c')
    ->group()->capture();        // a(bc)

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->literal('c')
    ->group(2)->capture();       // a(bc)

$group = Regex::create()->literal('b')->literal('c');
echo Regex::create()
    ->literal('a')
    ->group($group)->capture();  // a(bc)

Backreferences

Use Regex::ref to make a backreference:

echo Regex::create()
    ->literal('a')->anyTimes()->capture()
    ->literal('-')
    ->ref(1);  // (a*)\-\g{1}

echo Regex::create()
    ->literal('a')->anyTimes()->capture('myname')
    ->literal('-')
    ->ref('myname');  // (?P<myname>a*)\-(?P=myname)

Atomic grouping

Use Regex::atomic to make an atomic group:

echo Regex::create()
    ->literal('a')->anyTimes()
    ->atomic();  // (?>a*)

Lookahead, Lookbehind

Use Regex::after, Regex::notAfter, Regex::before, Regex::notBefore:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->after();        // a(?=b)

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->notAfter();     // a(?!b)

echo Regex::create()
    ->literal('a')
    ->before()
    ->literal('b');   // (?<=a)b

echo Regex::create()
    ->literal('a')
    ->notBefore()
    ->literal('b');   // (?<!a)b

Conditionals

Create a conditional with Regex::cond. This method must be preceded by a condition, an expression to match when the condition is true, and an optional expression to match when the condition is false.

Use Regex::match to check if a captured group matches:

echo Regex::create()
    ->literal('a')->capture()->optional()
    ->match(1)
    ->literal('b')
    ->literal('c')
    ->cond();       // (a)?(?(1)b|c)

echo Regex::create()
    ->literal('a')->capture('myname')->optional()
    ->match('myname')
    ->literal('b')
    ->literal('c')
    ->cond();       // (?P<myname>a)?(?(myname)b|c)

Regex::match can also be used outside of a conditional. In this case, the regular expression fails if captured group does not match:

echo Regex::create()
    ->literal('a')->capture()->optional()
    ->match(1);     // (a)?(?(1)|(?!))

The others allowed conditions are Regex::after, Regex::notAfter, Regex::before, Regex::notBefore:

echo Regex::create()
    ->literal('a')->before()
    ->literal('b')
    ->literal('c')
    ->cond();      // (?(?<=a)b|c)

If you want the 'else' expression to match nothing, you can remove the 'else' expression:

echo Regex::create()
    ->literal('a')->before()
    ->literal('b')
    ->cond();      // (?(?<=a)b|)

If you want the 'then' expression to match nothing, you can use Regex::notCond to inverse the condition:

echo Regex::create()
    ->literal('a')->before()
    ->literal('c')
    ->notCond();   // (?(?<=a)|c)

You can also use Regex::nothing:

echo Regex::create()
    ->literal('a')->before()
    ->nothing()
    ->literal('c')
    ->cond();      // (?(?<=a)|c)

Case sensitivity

By default, the regular expression is case sensitive. Use Regex::caseSensitive or Regex::caseInsensitive to change this behavior. Each of these methods accepts an optional boolean argument. If this argument is false, the behavior is inverted: $regex->caseSensitive(false) is equivalent to $regex->caseInsensitive().

These methods change the behavior of the last expression:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->caseInsensitive()
    ->literal('c');   // a(?i)b(?-i)c

When used at the beginning of the Regex, the whole expression is affected:

echo Regex::create()
    ->caseInsensitive()
    ->literal('a')
    ->literal('b')
    ->literal('c');   // (?i)abc(?-i)

Recursion

Use Regex::matchRecursive to match recursively the whole pattern. This example matches balanced parentheses:

echo Regex::create()
    ->literal('(')
    ->start()
        ->notChars('()')->atLeastOne()->atomic()
        ->matchRecursive()->anyTimes()
    ->alt()
    ->literal(')');   // \((?:(?>[^\(\)]+)|(?:(?R))*)\)

Special Expressions

Regex::crlf matches a Carriage Return followed by a Line Feed (Windows line breaks):

echo Regex::create()
    ->crlf();   // \r\n

Regex:unsignedIntRange matches a nonnegative integer in a given range. The third parameters specify how leading zeros are handled:

echo Regex::create()
    ->unsignedIntRange(1, 12);       // 1[0-2]|0?[1-9]   leadings zeros are optional

echo Regex::create()
    ->unsignedIntRange(1, 12, true); // 1[0-2]|0[1-9]    leadings zeros are required
    
echo Regex::create()
    ->unsignedIntRange(1, 12, false); // 1[0-2]|[1-9]    leadings zeros are not accepted

Note that in any case, the number of digits cannot exceed the number of digits of the maximum value.