awk

“It seemed like a good idea at the time.”
— Brian Kernighan

awk: General purpose programmable filter that handles text as easily as numbers,

awk v.s. sed:

Running awk:

  1. awk 'program' inputfile(s), or
  2. awk 'program', or
  3. awk -f program_file inputfile(s)

Examples: Running awk

# Files
$ awk 'program' input-file1 input-file2 ...
$ awk -f program-file input-file1 input-file2 ...
# Redirection and pipes
$ ls | awk ‘program’ > foo
# Stdin
$ awk 'program'

Etymology: Named after inventors (Aho, Weinberge, Kernighan)

Variants:

Remember: awk is a filter, it doesn’t alter input files by itself.

Structure of an awk Program

General Structure of an awk Program:

BEGIN {action}
pattern {action}
...
pattern {action}
END {action}

An awk program consists of:

  1. Optional BEGIN segment
  2. pattern-action pairs
  3. Optional END segment

Patterns and Actions

On Patterns and Actions:

Pattern-Action Structure:

Default Pattern and Action Behavior:

Patterns

Pattern: Selector that determines whether an action should be executed.

Note: ! negates the pattern.

Actions

Action: Performed on every line that matches its respective pattern.

Example: An example awk command that filters for HTML files from ls.

ls | awk '
/\.html$/ { print }
'

Variables

Two kinds of awk variables:

  1. Built-In (predefined):
  2. User Defined:

Example: Creating a user-defined variable (prints number of lines in input)

BEGIN { sum = 0 }
{ sum ++ }
END { print sum }

Operators

Important: All numbers in awk are floating-point numbers, expressions like 5/3 won’t get truncated into integers!

Arithmetic Operators

(highest precedence to lowest)

Concatenation Operator

Concatenation: Combines strings.

Example: String concatenation

$ awk BEGIN {
  x = "HELLO"
  print (x " WORLD")
}
HELLO WORLD

Note on Undefined Behavior: The order of evaluation of expressions used for concatenation is undefined in the awk language, for example—

BEGIN {
x = "don’t"
print (x (x = " panic"))
}

—It’s not defined whether the expression (x = " panic") is supposed to be evaluated before or after the value of x is retrieved to produce the concatenated value.

Assignment Operator

Assignment: Expression that stores a value in a variable.

Examples: Using the assignment operator

$ awk '
  BEGIN {
  thing = "food"
  predicate = "good"
  message = "this " thing " is " predicate
  print message
  foo = 1
  foo = foo + 5
  print foo
  foo = "bar"
  print foo
}
'
this food is good
6
bar

Increment and Decrement Operators

++ and --: Increment and Decrement

Examples: Using increment and decrement operators

x = 3
x++
print x
x = 4
x--
print x

Pre-Increment/Decrement (++x/--x) v.s. Post-Increment/Decrement (x++/x--)

Whether you use pre-or-post increment/decrement doesn’t matter unless you’re doing wacky stuff like using the return values of the increment and decrement operators (e.g., print ++x versus x++; print x)

Examples: Using the increment operator

“Doctor, it hurts when I do this!
Then don’t do that!”

— Groucho Marx

x = 5
print ++x
x = 5
print x++
x = 6
print x += x++

Environmental Variables

Diagram demonstrating record and field separtion

Records (RS, NR)

RS: Stores the record separator.

NR: Stores the number of the current record.

Examples: Using NR (number of records) $ awk ‘ { if (NR > 100) { print NR, $0; } } ‘

$ awk ‘ { if (NR % 2 == 0) { print NR, $0; } } ‘

Fields (FS, NF, Positional Variables, OFS, ORS)

FS: Stores the field separator. Can be multiple characters.

NF: Stores the number of fields.

$digit: Positional variable that lets you access fields.

Note: A positional variable isn’t a special variable, but a function triggered by the dollar sign.

OFS: Stores the output field separator.

ORS: Stores the output record separator.

Example: Changing the field splitter

$ cat file.txt
ONE 1 I
TWO 2 II
#Colons
THREE:3:III
FOUR:4:IV
FIVE:5:V
#Spaces
SIX 6 VI
SEVEN 7 VII
$ awk '
{
  if ($1 == "#Colons") {
    FS=":";
  } else if ($1 == "#Spaces") {
    FS=" ";
  } else {
    print $3
  }
}' file.txt
I
II
III
IV
V
VI
VII

Example: Printing fields and using OFS

Recall: The default pattern is to perform an action on all lines, and the default action is to print to stdout.

{ print }
{ print $0 }
$ awk '
BEGIN {
  { print "Hello","World" }
}'
Hello World
$ awk '
BEGIN {
  OFS=", "
  { print "Hello","World" }
}'
Hello, World

Example: Using output record separator

BEGIN {
  ORS="\r\n"
}
{
  print
} 

Misc

FILENAME: Stores the name of the file being read.

Example: Using FILENAME

$ awk '
BEGIN {
  f = "";
}
{
  if (f != FILENAME) {
    f = FILENAME
    print "Now reading:", f
  }
}
' file.txt file2.txt file3.txt
Now reading: file.txt 
Now reading: file2.txt 
Now reading: file3.txt 

printf

Format: printf(format)
Format: printf(format,argument...)

awk uses the printf function to do formatted output like C.

Examples: Using printf

$ awk '
{
  printf("%s\n", $0)
}
' file.txt
ONE 1 I 
TWO 2 II 
#Colons 
THREE:3:III 
FOUR:4:IV 
FIVE:5:V 
#Spaces 
SIX 6 VI 
SEVEN 7 VII
$ awk '
{
  printf("%s (hello!) \n", $0)
}
' file.txt
ONE (hello!) 
TWO (hello!) 
#Colons (hello!) 
THREE:3:III (hello!) 
FOUR:4:IV (hello!) 
FIVE:5:V (hello!) 
#Spaces (hello!) 
SIX (hello!) 
SEVEN (hello!) 

Format Specifiers

SpecifierMeaning
%cASCII Character
%dDecimal integer
%eFloating Point number (engineering format)
%fFloating Point number (fixed point format)
%gThe shorter of e or f, with trailing zeros removed
%oOctal
%sString
%xHexadecimal
%%Literal %
SequenceDescription
ASCII bell (NAWK/GAWK only)
Backspace
Formfeed
Newline
Carriage Return
Horizontal tab
Vertical tab (NAWK only)

Pattern Selection

awk patterns are good for selecting specific lies from the input for further processing.

Examples:

$2 >= 5 { print }
$2 * $3 > 50 { printf(%6.2f for %s\n”, $2 * $3, $1) }
$1 == "NYU"
$2 ~ /NYU/
$2 >= 4 || $3 >= 20
NR >= 10 && NR <= 20

User-Defined Variables

awk variables:

Example:

{
  HOURS_WORKED = $3
  HOURS_WORKED > 15 ( x = x + 1 )
}
END { print x, " employees worked more than 15 hours." }
{
  HOURLY_WAGE = $2
  HOURS_WORKED = $3
  pay += HOURLY_WAGE * HOURS_WORKED 
}
END {
  print “Employee Statistics:
  print- Total pay is:, pay
  print- Average pay is:, pay/NR
}

Control Structures

Overview:

Control StatementDescription
If StatementConditionally execute some awk statements.
While StatementLoop until some condition is satisfied.
Do StatementDo specified action while looping until some condition is satisfied.
For StatementAnother looping statement, that provides initialization and increment clauses.
Switch StatementSwitch/case evaluation for conditional execution of statements based on a value.
Break StatementImmediately exit the innermost enclosing loop (for, while, or do while).
Continue StatementSkip to the end of the innermost enclosing loop.
Next StatementStop processing the current input record.
Nextfile StatementStop processing the current file.
Exit StatementStop execution of awk.

More on if statement: (Syntax)

The else keyword needs to either be on its own line—

x=64
if (x % 2 == 0)
  print "x is even"
else
  print "x is odd"

—Or the contents need to be surrounded by braces—

x=64
if (x % 2 == 0) {
  print "x is even"
} else {
  print "x is odd"
}

—Or a semi-colon must be used to separate the body of the then statement from the else statement.

x=64
if (x % 2 == 0)
  print "x is even"; else
  print "x is odd"

More on while and do-while: (Examples)

BEGIN {
        i=1
        while (i <= 3) {
                printf("%s", i)
                i++
        }
}
BEGIN {
  i=1
  do {
      printf("%s ", i)
      i++
  } while (i <= 10)
}

More on for: (Examples)

BEGIN {
  for (i = 1; i <= 3; i++)
      printf("%s ", i)
}
BEGIN {
  for (i = 1; i <= 100; i *= 2)
      print i
}
for (i in username) {
  print username[i], i;
}

More on switch: Example

Note: Control flow in switch statements work like they do in C.

NR > 1 {
  printf "The %s is classified as: ",$1
      switch ($1) {
          case "apple":
              print "a fruit, pome"
              break
          case "banana":
          case "grape":
          case "kiwi":
              print "a fruit, berry"
              break
          case "raspberry":
              print "a computer, pi"
              break
          case "plum":
              print "a fruit, drupe"
              break
          case "pineapple":
              print "a fruit, fused berries (syncarp)"
              break
          case "potato":
              print "a vegetable, tuber"
              break
          default:
              print "[unclassified]"
      }
}

More on break: (Example)

num = $1
for (divisor = 2; divisor * divisor <= num; divisor++) {
  if (num % divisor == 0)
  break
}
if (num % divisor == 0)
  printf "Smallest divisor of %d is %d\n", num, divisor
else
  printf "%d is prime\n", num

More on continue: (Example)

BEGIN {
  for (x = 0; x <= 20; x++) {
      if (x == 5)
          continue
      printf "%d ", x
  }
  print ""
}

More on next: Example

The next statement forces awk to immediately stop processing the current record and go on to the next one.

NF != 4 {
printf("%s:%d: skipped: NF != 4\n", FILENAME, FNR) > "/dev/stderr"
next
}

More on exit: Example

BEGIN {
  if (("date" | getline date_now) <= 0) {
      print "Can't get system date" > "/dev/stderr"
      exit 1
  }
  print "current date is", date_now close("date")
}

Built-In Functions

NameFunctionVariant
coscosineGAWK,AWK,NAWK
cexpExponentGAWK,AWK,NAWK
cintIntegerGAWK,AWK,NAWK
clogLogarithmGAWK,AWK,NAWK
csinSineGAWK,AWK,NAWK
csqrtSquare RootGAWK,AWK,NAWK
catan2ArctangentGAWK,NAWK
crandRandomGAWK,NAWK
csrandSeed RandomGAWK,NAWK
FunctionVariant
index(string,search)GAWK,NAWK,NAWK
length(string)GAWK,NAWK,NAWK
split(string,array,separator)GAWK,NAWK,NAWK
substr(string,position)GAWK,NAWK,NAWK
substr(string,position,max)GAWK,NAWK,NAWK