Regular Expression: Set of characters that specify a pattern.
Tangent: “Regular” comes from term used to describe grammar and formal languages (“regular languages”).
Remember: Regular expressions are just patterns, what they do is determined by how a utility uses them.
Two Types of Regular Expressions:
| Regular Expression | Class | Type | Meaning |
|---|---|---|---|
. |
all | Character Set | A single character (except newline) |
^ |
all | Anchor | Beginning of line |
$ |
all | Anchor | End of line |
[...] |
all | Character Set | Range of characters |
* |
all | Modifier | zero or more duplicates |
\< |
Basic | Anchor | Beginning of word |
\> |
Basic | Anchor | End of word |
\(..\) |
Basic | Backreference | Remembers pattern |
\1..\9 |
Basic | Reference | Recalls pattern |
\{M,N\} |
Basic | Modifier | M to N Duplicates |
+ |
Extended | Modifier | One or more duplicates |
? |
Extended | Modifier | Zero or one duplicate |
(...|...) |
Extended | Anchor | Shows alteration |
Note: This is just a summary, examples and detailed descriptions are further down.
Regular expressions and shell metacharacters are often confused with each other, but are very different.
Shell Metacharacters:
Remember: Remember to wrap regular expressions in quotes to prevent them from being erroneously expanded by the shell!
Three Important Parts of a Regular Expression:
Examples: Regular expressions (matches are underlined)
cks
UNIX rocks
UNIX sucks
UNIX is okay
cks” (no anchor or modifier).apple
Scrapple from the apple
apple”.
^#*
UNIX rocks
UNIX sucks
UNIX is okay
^), character set (#), and modifier (*).
^: Indicates beginning of the line.#: Matches the pound symbol (#).*: Specified that the previous character set can appear any number of times, including zero.b[eor]at
I tried to eat bratwurst on the boat, but was beaten to the punch by a seagull
b[eor]at”
[]) behave similarly to, but aren’t the same as the [] file substitution wildcard metacharacter.b[^eo]at
I tried to eat bratwurst on the boat, but was beaten to the punch by a seagull
b[^eo]at”.
[^] syntax.| Utility | Regular Expression Type |
|---|---|
vi |
Basic |
sed |
Basic |
grep |
Basic |
more |
Basic |
ed |
Basic |
expr |
Basic |
awk |
Extended |
nawk |
Extended |
egrep |
Extended |
emacs |
Emacs regular expressions |
perl |
Perl regular expressions |
Commonly-used character classes can be referred to by name.
alpha, lower, upper, alnum, digit, punct, cntrlFormat: [:name:]
| Character Class | Named Character Class Equivalent |
|---|---|
[a-zA-Z] |
[[:alpha:]] |
[a-zA-Z0-9] |
[[:alnum:]] |
[45a-z] |
[45[:lower:]] |
Examples: Some more examples
[aeiou]will match any of the charactersa,e,i,o, oru[kK]ornwill matchkornorKorn
Examples: Using ranges
[1-9]is the same as[123456789][abcde]is equivalent to[a-e]
- character has a special meaning in a character class, but only if it is used within a range
[-123] would match the characters -, 1, 2, or 3[abcde123456789] is equivalent to [a-e1-9][A-Za-z0-9_]: This pattern will match a single character that is a letter, number, or underscore:Note; The characters
]and-do not have a special meaning if they directly follow a[
- e.g.,
[]0-9]will match for any digit or the symbol]- Tip: You can use the backslash (
\) to escape]and-like you do for shell metacharacters.
Instructions: Read the following regular expressions and determine what they do/match.
Questions:
^T[a-z][aeiou][-0-9][0-9\-b\]]
\’s are escaping some special characters.U[nN][iI]XAnswers:
T, are followed by any lowercase letter, and end with a vowel.--, the letter b, or the symbol ].UniX, UnIX, UNiX, or UNIX.Anchors: Used to match at the beginning (^) or end ($) of a line (or both).
^, place the ^ at the beginning of the expression for it to be interpreted as an anchor.$, place the $ at the end of the expression for it to be interpreted as an anchor.| Pattern | Matches |
|---|---|
^A |
“A” at the beginning of a line |
A$ |
“A” at the end of a line |
A^ |
“A^” anywhere on a line |
$A |
“$A” anywhere on a line |
^^ |
“^” at the beginning of a line |
$$ |
“$” at the end of a line |
Examples: Using anchors
^bingusbingus bingus bingus
- Regex that matches for the string “
bingus” at the start of a linebingus$bingus bingus bingus
- Regex that matches for the string “
bingus” at the end of a line
.). is a special character that, by itself, matches any character (except the end of line character).
.
o. (<- there is a space here)
> Forme to loopon.
o and ends with a space.*)* is a special character is used to define zero or more occurrences of the single regular expression preceding it.
0* matches zero or more 0s, [0-9]* matches zero or more of any digit.
-Only acts as a modifier if it follows a character set.Examples: Using
*ya*yI got mail, yaaaaaay!
oa*oFor me to poop on.
a.*eScrapple from the apple.
- A match will be the longest string that satisfies the regular expression.
- Notice how
appleandapple from thealso matched the regex, but were shorter.
\{ and \})To specify a minimum and maximum number of repeats, put the minimum and maximum between \{ and \}.
\{1,3\}
-Only acts as a modifier if it follows a character set.Note: Unlike shell metacharacters, backslash turns on the special meaning for
{and}.
Notation (\{ \}): Ways to specify a range of repetitions for the preceding regular expression.
\{n\}: Specify exactly n occurrences\{n, \}: Specify at least n occurrences\{n,m\}: Specify at least n occurrences but no more than m occurrencesExamples: Using
\{ \}
Regular Expression Matches .\{0,\}Exactly the same as .*a\{2,\}Exactly the same as aaa**Any line with an asterisk \*Any line with an asterisk \\Any line with a backslash ^*Any line starting with an asterisk ^A*Any line ^A\*Any line starting with an “ A*”^AA*Any line if it starts with one “ A”^AA*BAny line with one or more “ A”’s followed by a “B”^A\{4,8\}BAny line starting with 4,5,6,7or8“A”’s followed by a “B”^A\{4,\}BAny line starting with 4or more “A”’s followed by a “B”^A\{4\}BAny line starting with “ AAAAB”\{4,8\}Any line with “ {4,8}”A{4,8}Any line with “ A{4,8}”
\( and \))\( \) groups parts of an expression into subexpressions.
* and \{\} to more than just the previous character.Note: Unlike shell metacharacters, backslash turns on the special meaning for
(and).
Examples: Grouping expressions with
\(\)
abc*
- Matches
ab,abc,abcc,abccc, …
\(abc\)*
- Matches
abc,abcabc,abcabcabc, …
\(abc\)\{2,3\}
- Matches
abcabcorabcabcabc
\(,\) and \1)Backreferences let you remember what you found and see if the same pattern occurred again:
\( and \).
\( starts a new pattern.\ followed by a single digit.
Examples: Using backreferences
\([a-z]\)\1
- Match for two consecutive identical letters.
\([a-z]\)\1\1
- Match for three consecutive identical letters.
\([a-z]\)\([a-z]\)[a-z]\2\1
- Match for a 5-letter palindrome (e.g.,
radar).
egrep and awk use extended regular expressions.
Differences from Basic Regular Expressions:
\{”, “\}”, “\<”, “\>”, “\(”, “\)”, “\”digit? matches 0 or 1 instances of the character set before.
\{0,1\}+ matches 1 or more copies of the character set.
\{1,\}(, | and ) to make a choice of patterns.
egrep '^(From|Subject): "/usr/spool/mail${USER}" prints all From: and Subject: lines from incoming mail.Instructions: Read the following regular expressions and determine what they do/match.
Questions:
[a-zA-Z_][a-zA-Z_0-9]*\$[0-9]+(\.[0-9][0-9])?(1[012]|[1-9]):[0-5][0-9] (am|pm)<[hH][1-4]>Answers:
<h1>, <H1>, <h2>, …)