Regular Expression: Set of characters that specify a pattern.
Tangent: “Regular” comes from term used to describe grammar and formal languages (“regular languages”).
Remember: Regular expressions are just patterns, what they do is determined by how a utility uses them.
Two Types of Regular Expressions:
| Regular Expression | Class | Type | Meaning |
|---|---|---|---|
. | all | Character Set | A single character (except newline) |
^ | all | Anchor | Beginning of line |
$ | all | Anchor | End of line |
[...] | all | Character Set | Range of characters |
* | all | Modifier | zero or more duplicates |
\< | Basic | Anchor | Beginning of word |
\> | Basic | Anchor | End of word |
\(..\) | Basic | Backreference | Remembers pattern |
\1..\9 | Basic | Reference | Recalls pattern |
\{M,N\} | Basic | Modifier | M to N Duplicates |
+ | Extended | Modifier | One or more duplicates |
? | Extended | Modifier | Zero or one duplicate |
(...|...) | Extended | Anchor | Shows alteration |
Note: This is just a summary, examples and detailed descriptions are further down.
Regular expressions and shell metacharacters are often confused with each other, but are very different.
Shell Metacharacters:
Remember: Remember to wrap regular expressions in quotes to prevent them from being erroneously expanded by the shell!
Three Important Parts of a Regular Expression:
Examples: Regular expressions (matches are underlined)
cksUNIX rocks
UNIX sucks
UNIX is okay
cks” (no anchor or modifier).appleScrapple from the apple
apple”.^#*UNIX rocks
UNIX sucks
UNIX is okay
^), character set (#), and modifier (*).^: Indicates beginning of the line.#: Matches the pound symbol (#).*: Specified that the previous character set can appear any number of times, including zero.b[eor]atI tried to eat bratwurst on the boat, but was beaten to the punch by a seagull
b[eor]at”[]) behave similarly to, but aren’t the same as the [] file substitution wildcard metacharacter.b[^eo]atI tried to eat bratwurst on the boat, but was beaten to the punch by a seagull
b[^eo]at”.[^] syntax.| Utility | Regular Expression Type |
|---|---|
vi | Basic |
sed | Basic |
grep | Basic |
more | Basic |
ed | Basic |
expr | Basic |
awk | Extended |
nawk | Extended |
egrep | Extended |
emacs | Emacs regular expressions |
perl | Perl regular expressions |
Commonly-used character classes can be referred to by name.
alpha, lower, upper, alnum, digit, punct, cntrlFormat: [:name:]Example: Using named character classes.
Character Class Named Character Class Equivalent [a-zA-Z][[:alpha:]][a-zA-Z0-9][[:alnum:]][45a-z][45[:lower:]]
Examples: Some more examples
[aeiou]will match any of the charactersa,e,i,o, oru[kK]ornwill matchkornorKorn
Examples: Using ranges
[1-9]is the same as[123456789][abcde]is equivalent to[a-e]
- character has a special meaning in a character class, but only if it is used within a range[-123] would match the characters -, 1, 2, or 3[abcde123456789] is equivalent to [a-e1-9][A-Za-z0-9_]: This pattern will match a single character that is a letter, number, or underscore:Note; The characters
]and-do not have a special meaning if they directly follow a[
- e.g.,
[]0-9]will match for any digit or the symbol]- Tip: You can use the backslash (
\) to escape]and-like you do for shell metacharacters.
Instructions: Read the following regular expressions and determine what they do/match.
Questions:
^T[a-z][aeiou][-0-9][0-9\-b\]]\’s are escaping some special characters.U[nN][iI]XAnswers:
T, are followed by any lowercase letter, and end with a vowel.--, the letter b, or the symbol ].UniX, UnIX, UNiX, or UNIX.Anchors: Used to match at the beginning (^) or end ($) of a line (or both).
^, place the ^ at the beginning of the expression for it to be interpreted as an anchor.$, place the $ at the end of the expression for it to be interpreted as an anchor.| Pattern | Matches |
|---|---|
^A | “A” at the beginning of a line |
A$ | “A” at the end of a line |
A^ | “A^” anywhere on a line |
$A | “$A” anywhere on a line |
^^ | “^” at the beginning of a line |
$$ | “$” at the end of a line |
Examples: Using anchors
^bingusbingus bingus bingus
- Regex that matches for the string “
bingus” at the start of a linebingus$bingus bingus bingus
- Regex that matches for the string “
bingus” at the end of a line
.). is a special character that, by itself, matches any character (except the end of line character).Example: Using
.o. (<- there is a space here)
> Forme to loopon.o and ends with a space.
*)* is a special character is used to define zero or more occurrences of the single regular expression preceding it.
0* matches zero or more 0s, [0-9]* matches zero or more of any digit.
-Only acts as a modifier if it follows a character set.Examples: Using
*ya*yI got mail, yaaaaaay!
oa*oFor me to poop on.
a.*eScrapple from the apple.
- A match will be the longest string that satisfies the regular expression.
- Notice how
appleandapple from thealso matched the regex, but were shorter.
\{ and \})To specify a minimum and maximum number of repeats, put the minimum and maximum between \{ and \}.
\{1,3\}
-Only acts as a modifier if it follows a character set.Note: Unlike shell metacharacters, backslash turns on the special meaning for
{and}.
Notation (\{ \}): Ways to specify a range of repetitions for the preceding regular expression.
\{n\}: Specify exactly n occurrences\{n, \}: Specify at least n occurrences\{n,m\}: Specify at least n occurrences but no more than m occurrencesExamples: Using
\{ \}
Regular Expression Matches .\{0,\}Exactly the same as .*a\{2,\}Exactly the same as aaa**Any line with an asterisk \*Any line with an asterisk \\Any line with a backslash ^*Any line starting with an asterisk ^A*Any line ^A\*Any line starting with an “ A*”^AA*Any line if it starts with one “ A”^AA*BAny line with one or more “ A”’s followed by a “B”^A\{4,8\}BAny line starting with 4,5,6,7or8“A”’s followed by a “B”^A\{4,\}BAny line starting with 4or more “A”’s followed by a “B”^A\{4\}BAny line starting with “ AAAAB”\{4,8\}Any line with “ {4,8}”A{4,8}Any line with “ A{4,8}”
\( and \))\( \) groups parts of an expression into subexpressions.
* and \{\} to more than just the previous character.Note: Unlike shell metacharacters, backslash turns on the special meaning for
(and).
Examples: Grouping expressions with
\(\)
abc*
- Matches
ab,abc,abcc,abccc, …
\(abc\)*
- Matches
abc,abcabc,abcabcabc, …
\(abc\)\{2,3\}
- Matches
abcabcorabcabcabc
\(,\) and \1)Backreferences let you remember what you found and see if the same pattern occurred again:
\( and \).\( starts a new pattern.\ followed by a single digit.Examples: Using backreferences
\([a-z]\)\1
- Match for two consecutive identical letters.
\([a-z]\)\1\1
- Match for three consecutive identical letters.
\([a-z]\)\([a-z]\)[a-z]\2\1
- Match for a 5-letter palindrome (e.g.,
radar).
egrep and awk use extended regular expressions.
Differences from Basic Regular Expressions:
\{”, “\}”, “\<”, “\>”, “\(”, “\)”, “\”digit? matches 0 or 1 instances of the character set before.\{0,1\}+ matches 1 or more copies of the character set.\{1,\}(, | and ) to make a choice of patterns.egrep '^(From|Subject): "/usr/spool/mail${USER}" prints all From: and Subject: lines from incoming mail.Instructions: Read the following regular expressions and determine what they do/match.
Questions:
[a-zA-Z_][a-zA-Z_0-9]*\$[0-9]+(\.[0-9][0-9])?(1[012]|[1-9]):[0-5][0-9] (am|pm)<[hH][1-4]>Answers:
<h1>, <H1>, <h2>, …)