Regular Expressions

Text Searching with Power

Regular Expressions allow us to efficiently apply a pattern to any amount of source text and:

Determine if the text matches the pattern.
Determine in how many places the text matches the pattern.
Retrieve the matched portion of the text.
Retrieve matched sub-patterns.
Replace the matched portions with new text.
Patterns

The pattern specificed by a Regular Expression - a.k.a. RegEx or RE - consists of literal text to search for as well as RE metacharacters representing pattern components and features. REs are by design compact, terse, and very precise.

You can find a number of online Regular Expression testers, and there are many command-line tools (grep, sed, awk, etc.) that process Regular Expressions.

Let's experiment with these using http://regex101.com/

Options

Global (g): By default, once a match is found the regex stops processing; the global option continues processing the RE over the rest of the input text.
Multiline (m): By default, ^ matches the very beginning of the input text, and $ the very end; if the input text is actually multiple lines (text with embedded newlines), the multiline option causes ^ to match the beginning and $ to match the end of each embedded line.
Case-insensitive (i): RE pattern matching is case-sensitive, unless you use the case-insensitive option.
Literals and Wild Cards

An ordinary letter or digit is interpreted literally (except when part of certain particular RE components... coming up)
. is a wildcard that matches any single character other than the newline \n.
A pair of square brackets containing a sequence of characters matches an occurence of any of those characters.
These are called character classes:
[aeiou] - match any vowel
[0123456789] - match any digit
[a-z] - match any lowercase letter.
In square brackets, a hyphen between to characters means a range; for a literal hyphen, include it next to either the opening or closing bracket
[a-z-] - match any lowercase letter or hyphen character;
A leading ^ inside the brackets inverts the meaning of the match:
[^0123456789] - match anything EXCEPT a digit
Special character-class shorthands include:
\w - match a "word" character (characters you'd use in a variable name): same as [a-zA-Z0-9_]
\W - match anything EXCEPT a word character.
\d - match a digit character: same as [0123456789] or [0-9]
\D - match anything EXCEPT a digit.
\s - match a white-space character: space, tab, carriage return, newline, etc
\S - match character EXCEPT white space
A character that would otherwise be an RE metacharacter can be escaped with a backslash:
\[ - match an actual opening square bracket
\. - match an actual dot character
\( - match an actual parenthesis
Assertions

Also called anchors or even zero-width assertions, these specify a position in the source text, not any text itself.

^ - match the beginning of the source text.
$ - match the end of the source text.
\b - match a boundary between word and non-word characters.
Quantifier

A quantifier specifies the number of occurances of the item that precedes it:

* - a sequence of zero or more if the item
^\d* - match text beginning with a sequence of digits, or with no digits at all. A quantifier specifies the number of occurances of the item that precedes it:
+ - a sequence of one or more if the item
^\d+ - match text beginning with a sequence at least one digit.
? - either zero or one of the item
foo\d? - match "foo", possibly followed by a digit.
{} - specify exact, minimum, or minimum and maximum occurances of the item
\w{4} - match a sequence of four word characters.
\w{4,} - match a sequence of at least four word characters.
\w{4,7} - match a sequence of at least four but no more than seven word characters.
Grouping and Alternatives

| - match one of a list of pipe-separated alternatives
this|that|the other - match "this" or "that" or "the other"
Parenthesis define groups, which allow:
Subexpressions: th(is|at|e other)
Capture: \((d\{3})\) (\d\d\d-\d\d\d\d) - match a phone number of the form (303) 555-1234, and remember the area code and phone number after the match. Captured groups are numbered left-to-right (starting with 1) as you count the opening parentheses.
Area code will be in $1.
Phone number will be in $2.