- Program Structure
- Fields
- Pattern Matching
- Variables And Expressions
- Strings
- System Variables
- Processing Multiline Records
- Special Processing: Based On Row Number And Using
next
Andexit
- Formatting Output: printf
- Passing Parameters Into a Script
- Conditional Statements
- Looping
- Arrays
- Standard Functions
- User Functions
- getline - Read the Data From Files and Pipes
- Output to Files and Pipes
- system() - Execute System Commands
- Reference Sources
BEGIN {} # executed once, optional
/pattern/ { } # executed for each input line matching pattern
END {} # executed once, optional
Where pattern
can be:
-
/regular expression/
-
relational expression (
$2 > $1
) -
pattern-matching expression (
$2 ~ /test/
,$2 !~ /test/
)
Patterns can be complex:
-
pattern && pattern - logical AND
-
pattern || pattern - logical OR
-
! pattern - logical NOT
-
pattern ? pattern : pattern - conditional operator, like in C
-
pattern1, pattern2 - a range pattern, matches all input records starting with a record that matches pattern1, and continuing until a record that matches pattern2, inclusive.
Each input record is split into fields, that are accessible via $..
variables:
$0 - the whole record (line) $1, $2, ... - fields
Default separator is space.
Example:
echo a b c d | awk '{ print $1 $2 }'
ab
Any expression that evaluates to integer can be used as field number, the following outputs c
:
echo a b c d | awk 'BEGIN { one = 1; two = 2 }
{ print $(one + two) }'
c
Default field separator is space, can be changed with command line flag awk -F:
(':' as separator),
or awk -F"\t" …
to use tab as a separator.
Can also be changed inside the script by setting FS
variable:
echo a,b,c,d | awk 'BEGIN { FS="," }
{ print $2 }'
b
Field separator can be an expression:
echo a_b:c d | awk 'BEGIN { FS="[_: ]" }
{ print $1 "-" $2 "-" $3 "-" $4}'
a-b-c-d
echo '1
test' | awk '
/[0-9]+/ { print "That is an integer" }
/[A-Za-z]+/ { print "This is a string" }
/^$/ { print "This is a blank line" }
{ print }
'
That is an integer
1
This is a blank line.
This is a string
test
We can match the specific field (by default each string is split into fields by space):
echo '1 test description
2 script description' | awk '
$2 ~ /script/ { print $1 ", " $3 }'
2, description
Reverse the meaning of the rule by using bang-tilde (!~): $2 !~ /script/
.
echo '1 test description
2 script description' | awk '
$2 !~ /script/ { print $1 ", " $3 }'
1, description
It is possible to use comparison operators too, for example NF == 6 { print $1, $6 }
will make sure that we have 6 fields before printing them:
echo '1 2 3 4 5 6
1 2 3
1 2 3 4 5' | awk '
NF == 6 { print $1, $6 }'
1, 6
More complex expressions can be used as well, for example NR > 1 && (NF >= 2 || $1 ̃ /\t/)
.
There are two types of constants: string or numeric ("red" or 1).
Variables:
-
assignment:
name = value
-
name is case sensitive
-
default value is zero
-
each variable has string and integer value
-
strings that are not numbers evaluate to zero
-
There are `/`-`, etc arithmetic operators.
There are `=
, -=
, ++
(both postfix and infix), --
assignment operators.
echo '1
2' | awk '
# Count blank lines.
/^$/ {
++x # Default value is 0, so we don't initialize x, just start incrementing
}
END {
print x
}'
1
Average calculation:
echo 'john 85 92 78 94 88
andrea 89 90 75 90 86
jasper 84 88 80 92 84' | awk '
# average five grades
{ total = $2 + $3 + $4 + $5 + $6
avg = total / 5
print $1, avg }'
john 87.4
andrea 86
jasper 85.6
We can use expression to define the part of the record to match, for example:
echo 'john 10 15
andrea 5 3
jasper 2 20' | awk '
# print only lines where $2 + $3 > 20
$2 + $3 > 20 { print $1 " " $2+$3}
'
john 25
jasper 22
A string must be quoted in an expression.
The space is the string concatenation operator:
# Assigns “HelloWorld” to the variable z. z = "Hello" "World"
Strings can make use of the escape sequences:
-
\a Alert character, usually ASCII BEL character
-
\b Backspace
-
\f Formfeed
-
\n Newline
-
\r Carriage return
-
\t Horizontal tab
-
\v Vertical tab
-
\ddd Character repr esented as 1 to 3 digit octal value
-
\xhex Character repr esented as hexadecimal value a
-
\c Any literal character c (e.g., \" for ") b
echo a_b:c d | awk 'BEGIN { FS="[_: ]" }
{ print $1 "\v" $2 "\t" $3 "\"" $4}'
a
b c"d
-
FS
- input field separator (space by default)-
Note: usually FS is assigned in the BEGIN block, but can be changed anywhere new FS value will take effect on the next line (not on the current line)
-
-
OFS
- output field separator (space by default) -
NF
- number of fields (so{ print $NF }
outputs last field)-
Note: NF is mutable, can be changed (as well as $0 or fields)
-
-
RS
- record separator, default is newline -
ORS
- output record separator -
NR
- current record number -
FILENAME
- current file name -
FNR
- current record number in current file (useful when there are many files) -
CONVFMT
-printf
-style number-to-string conversion format, "%.6g" by default-
Used when we do
str = (5.5 + 3.2) " is a nice value"
-
-
OFMT
-printf
style number-to-string conversion when number is printed-
Used when we do
print 5.5
-
-
ARGC
- the number of command line arguments (does not include options to awk) -
ARGIND
- the index in ARGV of the current file being processed. -
ARGV
- array of command line arguments indexed from 0 to ARGC - 1.-
Dynamically changing the contents of ARGV can control the files used for data.
-
-
ENVIRON
- array of environment variables.
See more in man awk
.
The SYMTAB
variable is an array whose indices are the names of all currently defined global variables and arrays in the program. The array may be used for indirect access to read or write the value of a variable:
foo = 5 SYMTAB["foo"] = 4 print foo # prints 4
The isarray() function may be used to test if an element in SYMTAB
is an array. You may not use the delete statement with the SYMTAB
array.
Example - average calculation with auto-numbering:
echo 'john 85 92 78 94 88
andrea 89 90 75 90 86
jasper 84 88 80 92 84' | awk '
# We will have tabs as output fields separator.
BEGIN { OFS = "\t" }
# average five grades
{
total = $2 + $3 + $4 + $5 + $6
avg = total / 5
print NR ".", $1, avg
}
END {
print ""
print NR, "records processed."
}'
1. john 87.4
2. andrea 86
3. jasper 85.6
3 records processed.
echo 'John Robinson
Boston MA 01760
Phyllis Chapman
Amesbury MA 01881' | awk '
# set field separator to a newline and record separator to the empty string
BEGIN { FS = "\n"; RS = "" }
{ print $1, $NF}'
John Robinson Boston MA 01760
Phyllis Chapman Amesbury MA 01881
Also split the output to multiple lines:
echo 'John Robinson
Boston MA 01760
Phyllis Chapman
Amesbury MA 01881' | awk '
# set field separator to a newline and record separator to the empty string
BEGIN { FS = "\n"; RS = ""; OFS = "\n"; ORS = "\n\n" }
{ print $1, $NF}'
John Robinson
Boston MA 01760
Phyllis Chapman
Amesbury MA 01881
We can use expression like NR == 1
to apply special rule for the first record.
Inside that rule we can use next
to skip following rules:
echo '1000
125 Market -125.45
126 Hardware Store -34.95156' | awk '
BEGIN { FS="\t" }
# First line is the initial balance.
NR == 1 {
balance=$1;
print "Initial balance: ", balance;
next # get the next record and start over (do not proceed with next rule)
}
# Update balance.
{ balance += $3 }
# Show the result.
END { print "Final balance: ", balance }'
Initial balance: 1000
Final balance: 839.598
The next
statement causes the next line to be read and resumes execution from the top of the script.
The nextfile
statement stops current file processing and moves to the next file.
The exit
statement exits the main loop and passes control to END
section (stops execution if used in END
of if there is no END
section).
The exit
takes an expression as an argument. It will be used as script exit status code, by default exit status is 0.
Similar example with interesting trick to remove header and footer (source: https://stackoverflow.com/a/7148801/4612064).
Here we extract a list of file names from the 7z l
output which looks like this:
7-Zip [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,4 CPUs)
Listing archive: output/folder/7z_1.7z
--
Path = output/folder/7z_1.7z
Type = 7z
Solid = -
Blocks = 0
Physical Size = 141
Headers Size = 141
Date Time Attr Size Compressed Name
------------------- ----- ------------ ------------ ------------------------
2017-11-10 17:33:18 ....A 0 0 (E).txt
2017-11-10 17:33:18 ....A 0 0 (J) [!].txt
2017-11-10 17:33:18 ....A 0 0 (J).txt
2017-11-10 17:33:18 ....A 0 0 (U) [!].txt
2017-11-10 17:33:18 ....A 0 0 (U).txt
------------------- ----- ------------ ------------ ------------------------
0 0 5 files, 0 folders
And the awk
script to get only file names:
/----/ {p = ++p % 2; print "p: ", p; next}
$NF == "Name" {pos = index($0,"Name")}
p {print p, substr($0,pos)}
Initially p
is zero, so the last rule with print
doesn’t work.
Second line cacluates the position where the file name starts (by checking the position of "Name" in the header.
Once we meet first "----", the p
value becomes 1 (1 % 2 = 1) and we start processing filenames.
And when we get to the next "----", the p
value becomes 0 (2 % 2 = 0) and we stop the processing.
Syntax:
printf ( format-expression [, arguments] )
The parentheses are optional.
Format specifiers:
-
c ASCII character
-
d Decimal integer
-
i Decimal integer. (Added in POSIX)
-
e Floating-point format ([-]d.pr ecisione[+-]dd)
-
E Floating-point format ([-]d.pr ecisionE[+-]dd)
-
f Floating-point format ([-]ddd.pr ecision)
-
g e or f conversion, whichever is shortest, with trailing zeros removed
-
G E or f conversion, whichever is shortest, with trailing zeros removed
-
o Unsigned octal value
-
s String
-
u Unsigned decimal value
-
x Unsigned hexadecimal number. Uses a-f for 10 to 15
-
X Unsigned hexadecimal number. Uses A-F for 10 to 15
-
% Literal %
A format expression can take three optional modifiers following “%” and preceding the format specifier:
%-width.precision format-specifier
-
width - numeric value, the contents will be right-justified, use '-' to get left-justification.
-
echo '5' | awk '{ printf("%20s", $1) }'
→* 5*
-
echo '5' | awk '{ printf("%-20s", $1) }'
→*5 *
-
-
precision:
-
for decimal or floating-point values - the number of digits to the right of the decimal point;
-
for string values - the maximum number of characters that will be printed.
-
echo '3.1415' | awk '{ printf("%.3g", $1) }'
3.14
Default format: %.6g
.
With and precision can be specified dynamically:
echo '3.1415' | awk '{ printf("%*.*g", 5, 3, $1) }'
3.14
Variables can be passed using var=value
parameters:
awk ’script’ var=value inputfile
For example:
$ awk -f scriptfile high=100 low=60 datafile
# Use env variable as value: $ awk ’{ ... }’ directory=$cwd file1 ...
# Use `pwd` output as value: $ awk ’{ ... }’ directory=‘pwd‘ file1 ...
It is possible to use command-line parameters to define system variables:
$ awk ’{ print NR, $0 }’ OFS=’. ’ names
Note: command-line parameters is that they are not available in the BEGIN procedure. BEGIN is evaluated before the input is read.
awk 'BEGIN {
# Here `n` is not set.
print "Begin: " n
}
{
# Will print "Reading the first file" for each line in `test` file.
if (n == 1) print "Reading the first file"
# Will print "Reading the second file" for each line in `test2` file.
if (n == 2) print "Reading the second file"
}' n=1 test n=2 test2
The -v
options allows to specify parameters that are evaluated early and available in BEGIN:
# The -v option must be specified before the script itself.
awk -v n=1 'BEGIN {
# prints "Begin: 1"
print "Begin: " n
}'
The -v
option can be used for system variables too (here we set RS
): awk -F"\n" -v RS="" '{ print }' …
.
echo 'test
test
test2
test2' | awk -F"\n" -v RS="" -v n=1 '{
# We use new line as filed separator and
# empty line as record separator
print n, $1, "-", $2
}'
1 test - test
1 test2 - test2
Awk also provides the system variables ARGC
and ARGV
, similar to C.
if ( expression ) action1 [else action2 ]
if ( expression ) action1 ; [else action2 ]
if (avg >= 90) grade = "A" else if (avg >= 80) grade = "B" else if (avg >= 70) grade = "C" else if (avg >= 60) grade = "D" else grade = "F"
Conditional operator:
expr ? action1 : action2
grade = (avg >= 65) ? "Pass" : "Fail"
# While loop while ( condition ) action
i = 1 while ( i <= 4 ) { print $i ++i }
# Do loop do action while ( condition )
do { ++x print x } while ( x <= 4 )
# For loop for ( set_counter ; test_counter ; increment_counter ) action
for ( i = 1; i <= NF; i++ ) print $i
Prompt the user for a number and calculate factorial:
awk '# factorial: return factorial of user-supplied number
BEGIN {
# prompt user; use printf, not print, to avoid the newline
printf("Enter number: ")
}
# check that user enters a number
$1 ~ /^[0-9]+$/ {
# assign value of $1 to number & fact
number = $1
if (number == 0)
fact = 1
else
fact = number
# loop to multiply fact*x until x = 1
for (x = number - 1; x > 1; x--)
fact *= x
printf("The factorial of %d is %g\n", number, fact)
# exit -- saves user from typing CRTL-D.
exit
}
# if not a number, prompt again.
{ printf(" \nInvalid entry. Enter a number: ")
}' -
Loops support break
(exit the loop) and continue
(start the next iteration).
array [ subscript ] = value
student_avg[NR] = avg ... END { for ( x = 1; x <= NR; x++ ) class_avg_total += student_avg[x] class_average = class_avg_total / NR }
All arrays are associative
- the index can either be a string or a number.
# grade = "A", "B", "C", "D" ++class_grade[grade] ... # To iterate the array we can use `for (item in array)` loop. for (letter_grade in class_grade) # We also pipe output to "sort". print letter_grade ":", class_grade[letter_grade] | "sort"
To iterate the array we can use for (item in array)
loop and to test for membership we can use if (item in array)
.
Multidimensional arrays doesn’t have to be rectangular as in C and C++:
a[1] = 5 a[2][1] = 6 a[2][2] = 7
file_array[NR, i] = $i file_array[2, 4]
Note: Multidimensional arrays are simulated, all indices are concatenated together separated by the value of the system variable SUBSEP (by default "\034", an unprintable character):
awk 'BEGIN { x[1][2] = 2; print x[1][2]; }'
2
$ awk 'BEGIN { x[1,2] = 2; print x[1,2]; }'
2
$ awk 'BEGIN { x[1,2] = 2; print x["1" "\034" "2"]; }'
2
The multidimensional array syntax is also supported in testing for array membership: if ((i, j) in array)
.
Looping over a multidimensional array is the same as with one-dimensional arrays: for (item in array)
, split( )
function can be used then to access individual subscript components: split(item, subscr, SUBSEP)
.
The split
function can be used to create arrays:
n = split(string, array, separator) where: n - number of items in the array string - the string to split array - the array (function output) separator - delimiter to use when splitting the string
z = split($1, array, " ") for (i = 1; i <= z; ++i) print i, array[i]
Remove an item from the array:
delete array [subscript]
An array of command-line parameters:
BEGIN { for (x = 0; x < ARGC; ++x) print ARGV[x] print ARGC }
Math:
-
cos(x) - cosine of x (x is in radians).
-
exp(x) - e to the power x.
-
int(x) - truncated value of x.
-
log(x) - natural logarithm (base-e) of x.
-
sin(x) - sine of x (x is in radians).
-
sqr t(x) - square root of x.
-
atan2(y,x) - arctangent of y/x in the range - π to π .
-
rand( ) - pseudo-random number r, wher e 0 ⇐ r < 1.
-
srand(x) Establishes new seed for rand( ). If no seed is specified, uses time of day. Returns the old seed.
Strings:
-
length(s) - length of string
s
or length of $0 if no string is supplied. -
index(s,t) - position of substring
t
in strings
or zero if not present.-
pos = index("Mississippi", "is")
-
-
split(s,a,sep) - parses string
s
into elements of arraya
using field separatorsep;
returns number of elements. Ifsep
is not supplied,FS
is used. Array splitting works the same way as field splitting. -
substr(s,p,n) - returns substring of string
s
at beginning positionp
up to a maximum length ofn.
Ifn
is not supplied, the rest of the string fromp
is used.-
awk 'BEGIN { print substr("707-555-1111", 5) }'
→555-1111
-
awk 'BEGIN { print substr("707-555-1111", 1, 3) }'
→707
-
-
tolower(s) - translates all uppercase characters in string s to lowercase and returns the new string.
-
toupper(s) - translates all lowercase characters in string s to uppercase and returns the new string.
-
sprintf("fmt",expr) - uses printf format specification for
expr.
-
match(s,r) - either the position in
s
where the regular expressionr
begins, or 0 if no occurrences are found. Sets the values ofRSTART
andRLENGTH.
-
gsub(r,s,t) - globally substitutes
s
for each match of the regular expressionr
in the stringt
. Returns the number of substitutions.-
If
t
is not supplied, defaults to $0, so by default it works on current input line.
-
-
sub(r,s,t) - substitutes
s
for first match of the regular expressionr
in the stringt
. Returns 1 if successful; 0 otherwise. Ift
is not supplied, defaults to$0
.
An example of match
usage:
echo 'test
match' | awk '
# match -- print string that matches line
# for lines match pattern
match($0, pattern) {
# extract string matching pattern using
# starting position and length of string in $0
# print string
print substr($0, RSTART, RLENGTH)
}' pattern="ma"
ma
The match()
function returns 0 if the pattern is not found, and a non-zero value (RSTART
) if it is found, allowing the return value to be used as a condition:
In gawk
there are additional functions:
-
gensub(r, s, h, t) - if
h
is a string starting withg
orG,
globally substitutes s forr
int.
Otherwise,h
is a number: substitutes for theh’th occurrence. Returns the new value, `t
is unchanged. Ift
is not supplied, defaults to $0.-
It improves gsub / sub: it is possible to replace Nth occurrence, source string is not changed - the result is returned instead,
-
The pattern can have subpatterns delimited by parentheses. For example, it can have
/(part) (one|two|three)/
. Within the replacement string, a backslash followed by a digit represents the text that matched the Nth sub-pattern:echo part two | gawk ’{ print gensub(/(part) (one|two|three)/, "\\2", "g") }
→two
-
-
systime( ) - returns the current time of day in seconds since the Epoch (00:00 a.m., January 1, 1970 UTC).
-
strftime(format, timestamp) - Formats timestamp (of the same form returned by
systime()
) according to format. Default format - similar to thedate
command, default timestamp - current time.
echo 'TeSt' | awk '
# lower - change upper case to lower case
# note: we could use `tolower` to convert the case.
#
# initialize strings
BEGIN {
upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
lower = "abcdefghijklmnopqrstuvwxyz"
}
# for each input line
{
# see if there is a match for all caps
while (match($0, /[A-Z]+/))
# get each cap letter
for (x = RSTART; x < RSTART+RLENGTH; ++x) {
CAP = substr($0, x, 1)
CHAR = index(upper, CAP)
# substitute lowercase for upper, we don't provide third
# parameter to `gsub`, so it acts on the input ($0).
gsub(CAP, substr(lower, CHAR, 1))
}
# print record
print $0
}'
test
function name (parameter-list) { statements return expression }
function insert(STRING, POS, INS) { before_tmp = substr(STRING, 1, POS) after_tmp = substr(STRING, POS + 1) return before_tmp INS after_tmp } # "Hello" -> "HellXXo" print insert($1, 4, "XX")
Note: variables declared inside the function are global (available outside the function). To make them local, we need to define them as parameters (and don’t use these parameters when we are calling the function):
function insert(STRING, POS, INS, before_tmp, after_tmp) { ... }
Note: there are some pre-defined "external" functions, under /user/share/awk
on my system:
$ ls /usr/share/awk assert.awk ftrans.awk inplace.awk ord.awk readable.awk shellquote.awk bits2str.awk getopt.awk join.awk passwd.awk readfile.awk strtonum.awk cliff_rand.awk gettime.awk libintl.awk processarray.awk rewind.awk walkarray.awk ctime.awk group.awk noassign.awk quicksort.awk round.awk zerofile.awk
To use external functions, pass the path to the source using -f
flag:
awk -f myscript.awk -f /usr/share/awk/ctime.awk input.txt
The getline
function is used to read another line of input.
It is similar to next
, but it doesn’t pass the control back to the top of the script.
It reads the line and returns: * 1 - If it was able to read a line. * 0 - If it encounters the end-of-file. * -1 - If it encounters an error.
echo 'first
test
second' | awk '
/test/ {
getline # get next line
print $1 # print $1 of new line.
}'
second
The getline
can also be used to read data from a file or a pipe:
# Read lines from the file "data" and print them. while ( (getline < "data") > 0 ) print
# Read from standard input (prompt the user to enter the name): BEGIN { printf "Enter your name: " getline < "-" print }
# We can also assign the data we read to the variable: BEGIN { printf "Enter your name: " # Here we assign the input to `name` variable getline name < "-" print name }
It is possible to pipe output of a command to getline
:
awk '# getname - print users fullname from /etc/passwd
BEGIN {
# `who am i` outputs single string, user name is the first word
"who am i" | getline
name = $1
FS = ":"
}
name ~$1 { print $5 }
' /etc/passwd
# subdate.awk -- replace @date with todays date
/@date/ {
"date +’%a., %h %d, %Y’" | getline today
gsub(/@date/, today)
}
{ print }
The close()
function allows to close open files and pipes, it takes single argument - same expression that was used to create the pipe:
close("who")
Using close
we free the resources; we can use the same command more than once; if we are using output pipe (like some processing of $0 | "sort > tmpfile"
), we need to do close("sort > tmpfile")
before using the tmpfile
(for example in getline < "tmpfile"
):
{ some processing of $0 | "sort > tmpfile" } END { close("sort > tmpfile") while ((getline < "tmpfile") > 0) { do more work } }
It is possible to redirect output to the file:
print "a =", a, "b =", b, "max =", (a > b ? a : b) > "data.out"
Similarly, the output can be redirected to the pipe:
print | command
awk 'BEGIN { print "test example" | "wc -w" }' 2
echo "test example" | awk '{ print | "wc -w" }' 2
The system( ) function executes a command supplied as an expression. It does not make the output of the command available within the program for processing. It returns the exit status of the command that was executed.
BEGIN { if (system("mkdir test") != 0) print "Command Failed" }
The command output goes to the script output:
echo 'test' | awk '
{
# print the line using `echo`
system("echo " $0)
}'
test
We can check the command result:
# test returns 1 if file does not exist (and 0 if exists). if (system("test -r " file)) { print file " not found" }
-
man awk