Skip to content

Latest commit

 

History

History
43 lines (33 loc) · 2.89 KB

regular-expressions.md

File metadata and controls

43 lines (33 loc) · 2.89 KB

Using Regular Expressions

Regular expressions are a powerful tool. However, they are also very expensive in terms of memory. Ensuring correct and useful functionality is the priority but we have a few tips to minimize impact without affecting capabilities.

  • Consider non-regular expressions options. strings.Contains(), strings.Replace(), and strings.ReplaceAll() are dramatically faster and less memory intensive than regular expressions. If one of these will work equally well, use the non-regular expression option.

  • Order character classes consistently. We use regular expression caching to reduce our memory footprint. This is more effective if character classes are consistently ordered. Since a character class is a set, order does not affect functionality. We have many equivalent regular expressions that only differ by character class order. Below is the order we recommend for consistency:

    1. Numeric range, i.e., digits (e.g., 0-9)

    2. Uppercase alphabetic range (e.g., A-Z, A-F)

    3. Lowercase alphabetic range (e.g., a-z, a-f)

    4. Underscore (_)

    5. Everything else (except dash, -) in ASCII order: \t\n\r !"#$%&()*+,./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^abcdefghijklmnopqrstuvwxyz{|}~

    6. Last, dash (-)

      For example, consider the following expressions which are equivalent but vary character class ordering:

      `[_a-zA-Z0-9-,.]` // wrong ordering
      `[0-9A-Za-z_,.-]` // correct
      `[;a-z0-9]` // wrong ordering
      `[0-9a-z;]` // correct
  • Inside character classes, avoid unnecessary character escaping. Go does not complain about extra character escaping but avoid it to improve cache performance. Inside a character class, most characters do not need to be escaped, as Go assumes you mean the literal character.

    • These characters which normally have special meaning in regular expressions, inside character classes do not need to be escaped: $, (, ), *, +, ., ?, ^, {, |, }.

    • Dash (-), when it is last in the character class or otherwise unambiguously not part of a range, does not need to be escaped. If in doubt, place the dash last in the character class (e.g., [a-c-]) or escape the dash (e.g., \-).

    • Angle brackets ([, ]) always need to be escaped in a character class.

      For example, consider the following expressions which are equivalent but include unnecessary character escapes:

      `[\$\(\.\?\|]` // unnecessary escapes
      `[$(.?|]`      // correct
      `[a-z\-0-9_A-Z\.]` // unnecessary escapes, wrong order
      `[0-9A-Za-z_.-]`   // correct