Regular Expressions

Regular Expression: Set of characters that specify a pattern.

Tangent: “Regular” comes from term used to describe grammar and formal languages (“regular languages”).

Remember: Regular expressions are just patterns, what they do is determined by how a utility uses them.

Two Types of Regular Expressions:

  1. “Basic”
  2. “Extended”

Regular Expression Metacharacters

Regular Expression Class Type Meaning
. all Character Set A single character (except newline)
^ all Anchor Beginning of line
$ all Anchor End of line
[...] all Character Set Range of characters
* all Modifier zero or more duplicates
\< Basic Anchor Beginning of word
\> Basic Anchor End of word
\(..\) Basic Backreference Remembers pattern
\1..\9 Basic Reference Recalls pattern
\{M,N\} Basic Modifier M to N Duplicates
+ Extended Modifier One or more duplicates
? Extended Modifier Zero or one duplicate
(...|...) Extended Anchor Shows alteration

Note: This is just a summary, examples and detailed descriptions are further down.

Regular Expressions v.s. Shell Metacharacters

Regular expressions and shell metacharacters are often confused with each other, but are very different.

Shell Metacharacters:

Remember: Remember to wrap regular expressions in quotes to prevent them from being erroneously expanded by the shell!

Structure of a Regular Expression

Three Important Parts of a Regular Expression:

  1. Anchors: Specify position of pattern in relation to a line of text.
  2. Character Sets: Match one or more characters in a single position.
  3. Modifiers: Specify how many times the previous character set is repeated.

Examples: Regular expressions (matches are underlined)

cks

UNIX rocks
UNIX sucks
UNIX is okay

apple

Scrapple from the apple

^#*

UNIX rocks
UNIX sucks
UNIX is okay

b[eor]at

I tried to eat bratwurst on the boat, but was beaten to the punch by a seagull

b[^eo]at

I tried to eat bratwurst on the boat, but was beaten to the punch by a seagull

Utilities using Regular Expressions

Utility Regular Expression Type
vi Basic
sed Basic
grep Basic
more Basic
ed Basic
expr Basic
awk Extended
nawk Extended
egrep Extended
emacs Emacs regular expressions
perl Perl regular expressions

Character Classes

Named Character Classes

Commonly-used character classes can be referred to by name.

Format: [:name:]

Example: Using named character classes.
Character Class Named Character Class Equivalent
[a-zA-Z] [[:alpha:]]
[a-zA-Z0-9] [[:alnum:]]
[45a-z] [45[:lower:]]

More on Character Classes

Examples: Some more examples

  1. Ranges can also be specified in character classes

Examples: Using ranges

  1. You can also combine multiple ranges
Example: Combining ranges
  1. You can mix explicit characters and character ranges.
Example: Mixing character ranges and an explicit character

Note; The characters ] and - do not have a special meaning if they directly follow a [

Exercises: Reading Regular Expressions

Instructions: Read the following regular expressions and determine what they do/match.

Questions:

  1. ^T[a-z][aeiou]
  2. [-0-9]
  3. [0-9\-b\]]
  4. U[nN][iI]X

Answers:

  1. Matches for 3-character strings that begin with T, are followed by any lowercase letter, and end with a vowel.
  2. Matches for any character that is a digit or the symbol -
  3. Matches for any character that is a digit, the symbol -, the letter b, or the symbol ].
  4. Matches for the strings UniX, UnIX, UNiX, or UNIX.

More on Anchors

Anchors: Used to match at the beginning (^) or end ($) of a line (or both).

Pattern Matches
^A A” at the beginning of a line
A$ A” at the end of a line
A^ A^” anywhere on a line
$A $A” anywhere on a line
^^ ^” at the beginning of a line
$$ “$” at the end of a line

Examples: Using anchors

^bingus

bingus bingus bingus

bingus$

bingus bingus bingus

More on Modifiers

Matching a Single Character (.)

. is a special character that, by itself, matches any character (except the end of line character).

Example: Using .

o. (<- there is a space here) > Forme to loopon.

Repetition in Regular Expressions (*)

* is a special character is used to define zero or more occurrences of the single regular expression preceding it.

Examples: Using *

ya*y

I got mail, yaaaaaay!

oa*o

For me to poop on.

a.*e

Scrapple from the apple.

Repetition Ranges (\{ and \})

To specify a minimum and maximum number of repeats, put the minimum and maximum between \{ and \}.

Note: Unlike shell metacharacters, backslash turns on the special meaning for { and }.

Notation (\{ \}): Ways to specify a range of repetitions for the preceding regular expression.

Examples: Using \{ \}

Regular Expression Matches
.\{0,\} Exactly the same as .*
a\{2,\} Exactly the same as aaa*
* Any line with an asterisk
\* Any line with an asterisk
\\ Any line with a backslash
^* Any line starting with an asterisk
^A* Any line
^A\* Any line starting with an “A*
^AA* Any line if it starts with one “A
^AA*B Any line with one or more “A”’s followed by a “B
^A\{4,8\}B Any line starting with 4, 5, 6, 7 or 8A”’s followed by a “B
^A\{4,\}B Any line starting with 4 or more “A”’s followed by a “B
^A\{4\}B Any line starting with “AAAAB
\{4,8\} Any line with “{4,8}
A{4,8} Any line with “A{4,8}

Repetitions of Subexpressions (\( and \))

\( \) groups parts of an expression into subexpressions.

Note: Unlike shell metacharacters, backslash turns on the special meaning for ( and ).

Examples: Grouping expressions with \(\)

abc*

\(abc\)*

\(abc\)\{2,3\}

Backreferences (\(,\) and \1)

Backreferences let you remember what you found and see if the same pattern occurred again:

  1. Mark part of a pattern with \( and \).
  2. Recall the remembered pattern with \ followed by a single digit.

Examples: Using backreferences

\([a-z]\)\1

\([a-z]\)\1\1

\([a-z]\)\([a-z]\)[a-z]\2\1

Extended Regular Expression

egrep and awk use extended regular expressions.

Differences from Basic Regular Expressions:

Exercises: Reading Regular Expressions

Instructions: Read the following regular expressions and determine what they do/match.

Questions:

  1. [a-zA-Z_][a-zA-Z_0-9]*
  2. \$[0-9]+(\.[0-9][0-9])?
  3. (1[012]|[1-9]):[0-5][0-9] (am|pm)
  4. <[hH][1-4]>

Answers:

  1. Matches for variable names in C
  2. Matches for dollar amount with optional cents
  3. Matches for time of day
  4. Matches for HTML headers (<h1>, <H1>, <h2>, …)