Regular Expression: Set of characters that specify a pattern.
Tangent: “Regular” comes from term used to describe grammar and formal languages (“regular languages”).
Remember: Regular expressions are just patterns, what they do is determined by how a utility uses them.
Two Types of Regular Expressions:
Regular Expression | Class | Type | Meaning |
---|---|---|---|
. | all | Character Set | A single character (except newline) |
^ | all | Anchor | Beginning of line |
$ | all | Anchor | End of line |
[...] | all | Character Set | Range of characters |
* | all | Modifier | zero or more duplicates |
\< | Basic | Anchor | Beginning of word |
\> | Basic | Anchor | End of word |
\(..\) | Basic | Backreference | Remembers pattern |
\1..\9 | Basic | Reference | Recalls pattern |
\{M,N\} | Basic | Modifier | M to N Duplicates |
+ | Extended | Modifier | One or more duplicates |
? | Extended | Modifier | Zero or one duplicate |
(... |...) | Extended | Anchor | Shows alteration |
Note: This is just a summary, examples and detailed descriptions are further down.
Regular expressions and shell metacharacters are often confused with each other, but are very different.
Shell Metacharacters:
Remember: Remember to wrap regular expressions in quotes to prevent them from being erroneously expanded by the shell!
Three Important Parts of a Regular Expression:
Examples: Regular expressions (matches are underlined)
cks
UNIX rocks
UNIX sucks
UNIX is okay
cks
” (no anchor or modifier).apple
Scrapple from the apple
apple
”.^#*
UNIX rocks
UNIX sucks
UNIX is okay
^
), character set (#
), and modifier (*
).^
: Indicates beginning of the line.#
: Matches the pound symbol (#
).*
: Specified that the previous character set can appear any number of times, including zero.b[eor]at
I tried to eat bratwurst on the boat, but was beaten to the punch by a seagull
b[eor]at
”[]
) behave similarly to, but aren’t the same as the []
file substitution wildcard metacharacter.b[^eo]at
I tried to eat bratwurst on the boat, but was beaten to the punch by a seagull
b[^eo]at
”.[^]
syntax.Utility | Regular Expression Type |
---|---|
vi | Basic |
sed | Basic |
grep | Basic |
more | Basic |
ed | Basic |
expr | Basic |
awk | Extended |
nawk | Extended |
egrep | Extended |
emacs | Emacs regular expressions |
perl | Perl regular expressions |
Commonly-used character classes can be referred to by name.
alpha
, lower
, upper
, alnum
, digit
, punct
, cntrl
Format: [:name:]
Example: Using named character classes.
Character Class Named Character Class Equivalent [a-zA-Z]
[[:alpha:]]
[a-zA-Z0-9]
[[:alnum:]]
[45a-z]
[45[:lower:]]
Examples: Some more examples
[aeiou]
will match any of the charactersa
,e
,i
,o
, oru
[kK]orn
will matchkorn
orKorn
Examples: Using ranges
[1-9]
is the same as[123456789]
[abcde]
is equivalent to[a-e]
-
character has a special meaning in a character class, but only if it is used within a range[-123]
would match the characters -
, 1
, 2
, or 3
Example: Combining ranges
[abcde123456789]
is equivalent to[a-e1-9]
Example: Mixing character ranges and an explicit character
[A-Za-z0-9_]
: This pattern will match a single character that is a letter, number, or underscore:
Note; The characters
]
and-
do not have a special meaning if they directly follow a[
- e.g.,
[]0-9]
will match for any digit or the symbol]
- Tip: You can use the backslash (
\
) to escape]
and-
like you do for shell metacharacters.
Instructions: Read the following regular expressions and determine what they do/match.
Questions:
^T[a-z][aeiou]
[-0-9]
[0-9\-b\]]
\
’s are escaping some special characters.U[nN][iI]X
Answers:
T
, are followed by any lowercase letter, and end with a vowel.-
-
, the letter b
, or the symbol ]
.UniX
, UnIX
, UNiX
, or UNIX
.Anchors: Used to match at the beginning (^
) or end ($
) of a line (or both).
^
, place the ^
at the beginning of the expression for it to be interpreted as an anchor.$
, place the $
at the end of the expression for it to be interpreted as an anchor.Pattern | Matches |
---|---|
^A | “A ” at the beginning of a line |
A$ | “A ” at the end of a line |
A^ | “A^ ” anywhere on a line |
$A | “$A ” anywhere on a line |
^^ | “^ ” at the beginning of a line |
$$ | “$” at the end of a line |
Examples: Using anchors
^bingus
bingus bingus bingus
- Regex that matches for the string “
bingus
” at the start of a linebingus$
bingus bingus bingus
- Regex that matches for the string “
bingus
” at the end of a line
.
).
is a special character that, by itself, matches any character (except the end of line character).
Example: Using
.
o.
(<- there is a space here) > Forme to loopon.
- Regex that matches for any 3-letter string that starts with a
o
and ends with a space.
*
)*
is a special character is used to define zero or more occurrences of the single regular expression preceding it.
0*
matches zero or more 0
s, [0-9]*
matches zero or more of any digit.
-Only acts as a modifier if it follows a character set.Examples: Using
*
ya*y
I got mail, yaaaaaay!
oa*o
For me to poop on.
a.*e
Scrapple from the apple.
- A match will be the longest string that satisfies the regular expression.
- Notice how
apple
andapple from the
also matched the regex, but were shorter.
\{
and \}
)To specify a minimum and maximum number of repeats, put the minimum and maximum between \{
and \}
.
\{1,3\}
-Only acts as a modifier if it follows a character set.Note: Unlike shell metacharacters, backslash turns on the special meaning for
{
and}
.
Notation (\{ \}
): Ways to specify a range of repetitions for the preceding regular expression.
\{n\}
: Specify exactly n occurrences\{n, \}
: Specify at least n occurrences\{n,m\}
: Specify at least n occurrences but no more than m occurrencesExamples: Using
\{ \}
Regular Expression Matches .\{0,\}
Exactly the same as .*
a\{2,\}
Exactly the same as aaa*
*
Any line with an asterisk \*
Any line with an asterisk \\
Any line with a backslash ^*
Any line starting with an asterisk ^A*
Any line ^A\*
Any line starting with an “ A*
”^AA*
Any line if it starts with one “ A
”^AA*B
Any line with one or more “ A
”’s followed by a “B
”^A\{4,8\}B
Any line starting with 4
,5
,6
,7
or8
“A
”’s followed by a “B
”^A\{4,\}B
Any line starting with 4
or more “A
”’s followed by a “B
”^A\{4\}B
Any line starting with “ AAAAB
”\{4,8\}
Any line with “ {4,8}
”A{4,8}
Any line with “ A{4,8}
”
\(
and \)
)\( \)
groups parts of an expression into subexpressions.
*
and \{\}
to more than just the previous character.Note: Unlike shell metacharacters, backslash turns on the special meaning for
(
and)
.
Examples: Grouping expressions with
\(\)
abc*
- Matches
ab
,abc
,abcc
,abccc
, …
\(abc\)*
- Matches
abc
,abcabc
,abcabcabc
, …
\(abc\)\{2,3\}
- Matches
abcabc
orabcabcabc
\(,\)
and \1
)Backreferences let you remember what you found and see if the same pattern occurred again:
\(
and \)
.\(
starts a new pattern.\
followed by a single digit.Examples: Using backreferences
\([a-z]\)\1
- Match for two consecutive identical letters.
\([a-z]\)\1\1
- Match for three consecutive identical letters.
\([a-z]\)\([a-z]\)[a-z]\2\1
- Match for a 5-letter palindrome (e.g.,
radar
).
egrep
and awk
use extended regular expressions.
Differences from Basic Regular Expressions:
\{
”, “\}
”, “\<
”, “\>
”, “\(
”, “\)
”, “\
”digit?
matches 0 or 1 instances of the character set before.\{0,1\}
+
matches 1 or more copies of the character set.\{1,\}
(
, |
and )
to make a choice of patterns.egrep '^(From|Subject): "/usr/spool/mail${USER}"
prints all From:
and Subject:
lines from incoming mail.Instructions: Read the following regular expressions and determine what they do/match.
Questions:
[a-zA-Z_][a-zA-Z_0-9]*
\$[0-9]+(\.[0-9][0-9])?
(1[012]|[1-9]):[0-5][0-9] (am|pm)
<[hH][1-4]>
Answers:
<h1>
, <H1>
, <h2>
, …)