Regular Expressions

Regular Expression: Set of characters that specify a pattern.

Tangent: “Regular” comes from term used to describe grammar and formal languages (“regular languages”).

Remember: Regular expressions are just patterns, what they do is determined by how a utility uses them.

Two Types of Regular Expressions:

  1. “Basic”
  2. “Extended”

Regular Expression Metacharacters

Regular ExpressionClassTypeMeaning
.allCharacter SetA single character (except newline)
^allAnchorBeginning of line
$allAnchorEnd of line
[...]allCharacter SetRange of characters
*allModifierzero or more duplicates
\<BasicAnchorBeginning of word
\>BasicAnchorEnd of word
\(..\)BasicBackreferenceRemembers pattern
\1..\9BasicReferenceRecalls pattern
\{M,N\}BasicModifierM to N Duplicates
+ExtendedModifierOne or more duplicates
?ExtendedModifierZero or one duplicate
(...|...)ExtendedAnchorShows alteration

Note: This is just a summary, examples and detailed descriptions are further down.

Regular Expressions v.s. Shell Metacharacters

Regular expressions and shell metacharacters are often confused with each other, but are very different.

Shell Metacharacters:

Remember: Remember to wrap regular expressions in quotes to prevent them from being erroneously expanded by the shell!

Structure of a Regular Expression

Three Important Parts of a Regular Expression:

  1. Anchors: Specify position of pattern in relation to a line of text.
  2. Character Sets: Match one or more characters in a single position.
  3. Modifiers: Specify how many times the previous character set is repeated.

Examples: Regular expressions (matches are underlined)

cks

UNIX rocks
UNIX sucks
UNIX is okay

apple

Scrapple from the apple

^#*

UNIX rocks
UNIX sucks
UNIX is okay

b[eor]at

I tried to eat bratwurst on the boat, but was beaten to the punch by a seagull

b[^eo]at

I tried to eat bratwurst on the boat, but was beaten to the punch by a seagull

Utilities using Regular Expressions

UtilityRegular Expression Type
viBasic
sedBasic
grepBasic
moreBasic
edBasic
exprBasic
awkExtended
nawkExtended
egrepExtended
emacsEmacs regular expressions
perlPerl regular expressions

Character Classes

Named Character Classes

Commonly-used character classes can be referred to by name.

Format: [:name:]

Example: Using named character classes.

Character ClassNamed Character Class Equivalent
[a-zA-Z][[:alpha:]]
[a-zA-Z0-9][[:alnum:]]
[45a-z][45[:lower:]]

More on Character Classes

Examples: Some more examples

  1. Ranges can also be specified in character classes

Examples: Using ranges

  1. You can also combine multiple ranges

Example: Combining ranges

  1. You can mix explicit characters and character ranges.

Example: Mixing character ranges and an explicit character

Note; The characters ] and - do not have a special meaning if they directly follow a [

Exercises: Reading Regular Expressions

Instructions: Read the following regular expressions and determine what they do/match.

Questions:

  1. ^T[a-z][aeiou]
  2. [-0-9]
  3. [0-9\-b\]]
  4. U[nN][iI]X

Answers:

  1. Matches for 3-character strings that begin with T, are followed by any lowercase letter, and end with a vowel.
  2. Matches for any character that is a digit or the symbol -
  3. Matches for any character that is a digit, the symbol -, the letter b, or the symbol ].
  4. Matches for the strings UniX, UnIX, UNiX, or UNIX.

More on Anchors

Anchors: Used to match at the beginning (^) or end ($) of a line (or both).

PatternMatches
^AA” at the beginning of a line
A$A” at the end of a line
A^A^” anywhere on a line
$A$A” anywhere on a line
^^^” at the beginning of a line
$$“$” at the end of a line

Examples: Using anchors

^bingus

bingus bingus bingus

bingus$

bingus bingus bingus

More on Modifiers

Matching a Single Character (.)

. is a special character that, by itself, matches any character (except the end of line character).

Example: Using .

o. (<- there is a space here) > Forme to loopon.

Repetition in Regular Expressions (*)

* is a special character is used to define zero or more occurrences of the single regular expression preceding it.

Examples: Using *

ya*y

I got mail, yaaaaaay!

oa*o

For me to poop on.

a.*e

Scrapple from the apple.

Repetition Ranges (\{ and \})

To specify a minimum and maximum number of repeats, put the minimum and maximum between \{ and \}.

Note: Unlike shell metacharacters, backslash turns on the special meaning for { and }.

Notation (\{ \}): Ways to specify a range of repetitions for the preceding regular expression.

Examples: Using \{ \}

Regular ExpressionMatches
.\{0,\}Exactly the same as .*
a\{2,\}Exactly the same as aaa*
*Any line with an asterisk
\*Any line with an asterisk
\\Any line with a backslash
^*Any line starting with an asterisk
^A*Any line
^A\*Any line starting with an “A*
^AA*Any line if it starts with one “A
^AA*BAny line with one or more “A”’s followed by a “B
^A\{4,8\}BAny line starting with 4, 5, 6, 7 or 8A”’s followed by a “B
^A\{4,\}BAny line starting with 4 or more “A”’s followed by a “B
^A\{4\}BAny line starting with “AAAAB
\{4,8\}Any line with “{4,8}
A{4,8}Any line with “A{4,8}

Repetitions of Subexpressions (\( and \))

\( \) groups parts of an expression into subexpressions.

Note: Unlike shell metacharacters, backslash turns on the special meaning for ( and ).

Examples: Grouping expressions with \(\)

abc*

\(abc\)*

\(abc\)\{2,3\}

Backreferences (\(,\) and \1)

Backreferences let you remember what you found and see if the same pattern occurred again:

  1. Mark part of a pattern with \( and \).
  2. Recall the remembered pattern with \ followed by a single digit.

Examples: Using backreferences

\([a-z]\)\1

\([a-z]\)\1\1

\([a-z]\)\([a-z]\)[a-z]\2\1

Extended Regular Expression

egrep and awk use extended regular expressions.

Differences from Basic Regular Expressions:

Exercises: Reading Regular Expressions

Instructions: Read the following regular expressions and determine what they do/match.

Questions:

  1. [a-zA-Z_][a-zA-Z_0-9]*
  2. \$[0-9]+(\.[0-9][0-9])?
  3. (1[012]|[1-9]):[0-5][0-9] (am|pm)
  4. <[hH][1-4]>

Answers:

  1. Matches for variable names in C
  2. Matches for dollar amount with optional cents
  3. Matches for time of day
  4. Matches for HTML headers (<h1>, <H1>, <h2>, …)