awk

“It seemed like a good idea at the time.”
— Brian Kernighan

awk: General purpose programmable filter that handles text as easily as numbers,

awk v.s. sed:

Running awk:

  1. awk 'program' inputfile(s), or
  2. awk 'program', or
  3. awk -f program_file inputfile(s)

Examples: Running awk

# Files
$ awk 'program' input-file1 input-file2 ...
$ awk -f program-file input-file1 input-file2 ...
# Redirection and pipes
$ ls | awk ‘program’ > foo
# Stdin
$ awk 'program'

Etymology: Named after inventors (Aho, Weinberge, Kernighan)

Variants:

Remember: awk is a filter, it doesn’t alter input files by itself.

Structure of an awk Program

General Structure of an awk Program:

BEGIN {action}
pattern {action}
...
pattern {action}
END {action}

An awk program consists of:

  1. Optional BEGIN segment
  2. pattern-action pairs
  3. Optional END segment

Patterns and Actions

On Patterns and Actions:

Pattern-Action Structure:

Default Pattern and Action Behavior:

Patterns

Pattern: Selector that determines whether an action should be executed.

Note: ! negates the pattern.

Actions

Action: Performed on every line that matches its respective pattern.

Example: An example awk command that filters for HTML files from ls.
ls | awk '
/\.html$/ { print }
'

Variables

Two kinds of awk variables:

  1. Built-In (predefined):
  2. User Defined:
Example: Creating a user-defined variable (prints number of lines in input)
BEGIN { sum = 0 }
{ sum ++ }
END { print sum }

Operators

Important: All numbers in awk are floating-point numbers, expressions like 5/3 won’t get truncated into integers!

Arithmetic Operators

(highest precedence to lowest)

Concatenation Operator

Concatenation: Combines strings.

Example: String concatenation
$ awk BEGIN {
  x = "HELLO"
  print (x " WORLD")
}
HELLO WORLD

Note on Undefined Behavior: The order of evaluation of expressions used for concatenation is undefined in the awk language, for example—

BEGIN {
x = "don’t"
print (x (x = " panic"))
}

—It’s not defined whether the expression (x = " panic") is supposed to be evaluated before or after the value of x is retrieved to produce the concatenated value.

Assignment Operator

Assignment: Expression that stores a value in a variable.

Examples: Using the assignment operator

$ awk '
  BEGIN {
  thing = "food"
  predicate = "good"
  message = "this " thing " is " predicate
  print message
  foo = 1
  foo = foo + 5
  print foo
  foo = "bar"
  print foo
}
'
this food is good
6
bar

Increment and Decrement Operators

++ and --: Increment and Decrement

Examples: Using increment and decrement operators

x = 3
x++
print x
x = 4
x--
print x

Pre-Increment/Decrement (++x/--x) v.s. Post-Increment/Decrement (x++/x--)

Whether you use pre-or-post increment/decrement doesn’t matter unless you’re doing wacky stuff like using the return values of the increment and decrement operators (e.g., print ++x versus x++; print x)

Examples: Using the increment operator

“Doctor, it hurts when I do this!
Then don’t do that!”

— Groucho Marx

x = 5
print ++x
x = 5
print x++
x = 6
print x += x++

Environmental Variables

Diagram demonstrating record and field separtion

Records (RS, NR)

RS: Stores the record separator.

NR: Stores the number of the current record.

Examples: Using NR (number of records) $ awk ‘ { if (NR > 100) { print NR, $0; } } ‘

$ awk ‘ { if (NR % 2 == 0) { print NR, $0; } } ‘

Fields (FS, NF, Positional Variables, OFS, ORS)

FS: Stores the field separator. Can be multiple characters.

NF: Stores the number of fields.

$digit: Positional variable that lets you access fields.

Note: A positional variable isn’t a special variable, but a function triggered by the dollar sign.

OFS: Stores the output field separator.

ORS: Stores the output record separator.

Example: Changing the field splitter
$ cat file.txt
ONE 1 I
TWO 2 II
#Colons
THREE:3:III
FOUR:4:IV
FIVE:5:V
#Spaces
SIX 6 VI
SEVEN 7 VII
$ awk '
{
  if ($1 == "#Colons") {
    FS=":";
  } else if ($1 == "#Spaces") {
    FS=" ";
  } else {
    print $3
  }
}' file.txt
I
II
III
IV
V
VI
VII
Example: Printing fields and using OFS

Recall: The default pattern is to perform an action on all lines, and the default action is to print to stdout.

{ print }
{ print $0 }
$ awk '
BEGIN {
  { print "Hello","World" }
}'
Hello World
$ awk '
BEGIN {
  OFS=", "
  { print "Hello","World" }
}'
Hello, World
Example: Using output record separator
BEGIN {
  ORS="\r\n"
}
{
  print
} 

Misc

FILENAME: Stores the name of the file being read.

Example: Using FILENAME
$ awk '
BEGIN {
  f = "";
}
{
  if (f != FILENAME) {
    f = FILENAME
    print "Now reading:", f
  }
}
' file.txt file2.txt file3.txt
Now reading: file.txt 
Now reading: file2.txt 
Now reading: file3.txt 

printf

Format: printf(format)
Format: printf(format,argument...)

awk uses the printf function to do formatted output like C.

Examples: Using printf

$ awk '
{
  printf("%s\n", $0)
}
' file.txt
ONE 1 I 
TWO 2 II 
#Colons 
THREE:3:III 
FOUR:4:IV 
FIVE:5:V 
#Spaces 
SIX 6 VI 
SEVEN 7 VII
$ awk '
{
  printf("%s (hello!) \n", $0)
}
' file.txt
ONE (hello!) 
TWO (hello!) 
#Colons (hello!) 
THREE:3:III (hello!) 
FOUR:4:IV (hello!) 
FIVE:5:V (hello!) 
#Spaces (hello!) 
SIX (hello!) 
SEVEN (hello!) 

Format Specifiers

Specifier Meaning
%c ASCII Character
%d Decimal integer
%e Floating Point number (engineering format)
%f Floating Point number (fixed point format)
%g The shorter of e or f, with trailing zeros removed
%o Octal
%s String
%x Hexadecimal
%% Literal %
Sequence Description
ASCII bell (NAWK/GAWK only)
Backspace
Formfeed
Newline
Carriage Return
Horizontal tab
Vertical tab (NAWK only)

Pattern Selection

awk patterns are good for selecting specific lies from the input for further processing.

Examples:

$2 >= 5 { print }
$2 * $3 > 50 { printf(%6.2f for %s\n”, $2 * $3, $1) }
$1 == "NYU"
$2 ~ /NYU/
$2 >= 4 || $3 >= 20
NR >= 10 && NR <= 20

User-Defined Variables

awk variables:

Example:
{
  HOURS_WORKED = $3
  HOURS_WORKED > 15 ( x = x + 1 )
}
END { print x, " employees worked more than 15 hours." }
{
  HOURLY_WAGE = $2
  HOURS_WORKED = $3
  pay += HOURLY_WAGE * HOURS_WORKED 
}
END {
  print “Employee Statistics:
  print- Total pay is:, pay
  print- Average pay is:, pay/NR
}

Control Structures

Overview:

Control Statement Description
If Statement Conditionally execute some awk statements.
While Statement Loop until some condition is satisfied.
Do Statement Do specified action while looping until some condition is satisfied.
For Statement Another looping statement, that provides initialization and increment clauses.
Switch Statement Switch/case evaluation for conditional execution of statements based on a value.
Break Statement Immediately exit the innermost enclosing loop (for, while, or do while).
Continue Statement Skip to the end of the innermost enclosing loop.
Next Statement Stop processing the current input record.
Nextfile Statement Stop processing the current file.
Exit Statement Stop execution of awk.

More on if statement: (Syntax)

The else keyword needs to either be on its own line—

x=64
if (x % 2 == 0)
  print "x is even"
else
  print "x is odd"

—Or the contents need to be surrounded by braces—

x=64
if (x % 2 == 0) {
  print "x is even"
} else {
  print "x is odd"
}

—Or a semi-colon must be used to separate the body of the then statement from the else statement.

x=64
if (x % 2 == 0)
  print "x is even"; else
  print "x is odd"

More on while and do-while: (Examples)

BEGIN {
        i=1
        while (i <= 3) {
                printf("%s", i)
                i++
        }
}
BEGIN {
  i=1
  do {
      printf("%s ", i)
      i++
  } while (i <= 10)
}

More on for: (Examples)

BEGIN {
  for (i = 1; i <= 3; i++)
      printf("%s ", i)
}
BEGIN {
  for (i = 1; i <= 100; i *= 2)
      print i
}
for (i in username) {
  print username[i], i;
}

More on switch: Example

Note: Control flow in switch statements work like they do in C.

NR > 1 {
  printf "The %s is classified as: ",$1
      switch ($1) {
          case "apple":
              print "a fruit, pome"
              break
          case "banana":
          case "grape":
          case "kiwi":
              print "a fruit, berry"
              break
          case "raspberry":
              print "a computer, pi"
              break
          case "plum":
              print "a fruit, drupe"
              break
          case "pineapple":
              print "a fruit, fused berries (syncarp)"
              break
          case "potato":
              print "a vegetable, tuber"
              break
          default:
              print "[unclassified]"
      }
}

More on break: (Example)

num = $1
for (divisor = 2; divisor * divisor <= num; divisor++) {
  if (num % divisor == 0)
  break
}
if (num % divisor == 0)
  printf "Smallest divisor of %d is %d\n", num, divisor
else
  printf "%d is prime\n", num

More on continue: (Example)

BEGIN {
  for (x = 0; x <= 20; x++) {
      if (x == 5)
          continue
      printf "%d ", x
  }
  print ""
}

More on next: Example

The next statement forces awk to immediately stop processing the current record and go on to the next one.

NF != 4 {
printf("%s:%d: skipped: NF != 4\n", FILENAME, FNR) > "/dev/stderr"
next
}

More on exit: Example

BEGIN {
  if (("date" | getline date_now) <= 0) {
      print "Can't get system date" > "/dev/stderr"
      exit 1
  }
  print "current date is", date_now close("date")
}

Built-In Functions

Name Function Variant
cos cosine GAWK,AWK,NAWK
cexp Exponent GAWK,AWK,NAWK
cint Integer GAWK,AWK,NAWK
clog Logarithm GAWK,AWK,NAWK
csin Sine GAWK,AWK,NAWK
csqrt Square Root GAWK,AWK,NAWK
catan2 Arctangent GAWK,NAWK
crand Random GAWK,NAWK
csrand Seed Random GAWK,NAWK
Function Variant
index(string,search) GAWK,NAWK,NAWK
length(string) GAWK,NAWK,NAWK
split(string,array,separator) GAWK,NAWK,NAWK
substr(string,position) GAWK,NAWK,NAWK
substr(string,position,max) GAWK,NAWK,NAWK