Understanding regular expressions

Regular expressions - as any expressions - are just a bunch of operators with their operands. You learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regular expressions with very wide possibilities. Let's go?

Look at a simple math expression: x + y * 2. There are two operators: + and *. The operands of * are y and 2. The operands of + are x and the result of y * 2. Easy?

Thinking about that expression deeper we can find that there is a definite order of evaluation, defined by operator's precedence. The * has precedence over +, so it is evaluated first. You can change the evaluation order by using parentheses: (x + y) * 2 will evaluate + first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their arity - this is just the number of operands required for the operator. In the example above, + and * are binary operators - they both take two operands. Most of arithmetic operators are binary, but the - has also the unary (single operand) form, like in this equation: y = -x. Note that the unary and binary minuses work differently.

Now any expression is just a lego game, where you set a sequence of operators with correct number of operands for each, taking heed of their arity and precedence, using parentheses when needed. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators, but they are governed by the same rules of precedence and arity.

Regular expressions

Regular expressions is a powerful mechanism for searching in strings using patterns. So their operands are characters and sets of characters, that are allowed in particular position. A is a regular expressions that matches a single character A. The operators in regular expressions define the way to combine individual characters into patterns:

  1. sequence (concatenation operator)
  2. alternation
  3. repetition (quantifier operator)

The concatenation is so simple operator, that it doesn't have any character for it at all - just write a sequence of some characters, and they'll be concatenated.

Alternative is written as vertical bar. Regex A|BC matches strings A and BC.

There are few forms of quantifiers - most commonly used are ? (repeat zero or one times), * (zero or more times) and + (one or more times). You can specify mininimum and maximum number of repetitions in curly braces: a{4}, a{1,4}, a{4,}.

The special characters that define operators should be escaped when used as operands, i.e. preceded by \. Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+, -, etc), but when constructing a pattern for matching you should be able to use any character as an operand.

Character sets allow you to specify several possible characters for one place. They can be defined in many different ways: by enumeration of characters in square brackets [abc123], by ranges in square brackets [a-z], by special sequences (\d means any digit, \W means anything except a letter, digit and underscore, [[:alpha:]] means any letter etc). One important type of operands is a simple assertions: they allow you to test some conditions - start of the string ^, end of the string $ or word border \b.

Precedence and order of evaluation

A quantifier has precedence over concatenation and concatenation has precedence over alternative. Let's look what it means:

  1. quantifiers over concatenation means that quantifiers are executed first and will repeat only a single character if used without parentheses:
    • many times* matches many time followed by zero or more s
    • (many times)* matches many times zero or more times: changing the previous regex by using parentheses allows us to define a string repetition
  2. concatenation over alternative means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
    • first|second|third matches first, second and third
    • (first |second |)part matches first part, second part and part - typical use of an empty alternative
  3. quantifier over alternative means that you should use parentheses to repeat an alternative set:
    • first|second* matches first or secon followed by zero or more d like secondddddd
    • (first|second)* matches first or second, repeated zero or more time in any order, like firstsecondfirstfirst. Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
      • (1|2){2} matches 11, 12, 21 and 22, not just 11 and 22
      • 1{2}|2{2} matches 11 and 22 only

The internal structure of a regular expression can be viewed well on the syntax tree. The operands/operators that executed first are placed lower on the tree (or to the right in horizontal mode). The operator/operand that is executed last is the root of the tree. You can compare syntax trees and explaining graphs for the examples above if this section doesn't seems clear to you.

Anchoring

Anchoring is used to set restrictions on the matching process by using simple assertions:

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky: