Skip to content

Latest commit

 

History

History
2703 lines (2164 loc) · 75.6 KB

gnu_awk.md

File metadata and controls

2703 lines (2164 loc) · 75.6 KB

GNU awk

Table of Contents


$ awk --version | head -n1
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)

$ man awk
GAWK(1)                        Utility Commands                        GAWK(1)

NAME
       gawk - pattern scanning and processing language

SYNOPSIS
       gawk [ POSIX or GNU style options ] -f program-file [ -- ] file ...
       gawk [ POSIX or GNU style options ] [ -- ] program-text file ...

DESCRIPTION
       Gawk  is  the  GNU Project's implementation of the AWK programming lan‐
       guage.  It conforms to the definition of  the  language  in  the  POSIX
       1003.1  Standard.   This version in turn is based on the description in
       The AWK Programming Language, by Aho, Kernighan, and Weinberger.   Gawk
       provides  the additional features found in the current version of Brian
       Kernighan's awk and a number of GNU-specific extensions.
...

Prerequisites and notes

  • familiarity with programming concepts like variables, printing, control structures, arrays, etc
  • familiarity with regular expressions
  • this tutorial is primarily focussed on short programs that are easily usable from command line, similar to using grep, sed, etc
  • see Gawk: Effective AWK Programming manual for complete reference, has information on other awk versions as well as notes on POSIX standard

Field processing


Default field separation

  • $0 contains the entire input record
    • default input record separator is newline character
  • $1 contains the first field text
    • default input field separator is one or more of continuous space, tab or newline characters
  • $2 contains the second field text and so on
  • $(2+3) result of expressions can be used, this one evaluates to $5 and hence gives fifth field
  • NF is a built-in variable which contains number of fields in the current record
    • so, $NF will give last field
    • $(NF-1) will give second last field and so on
$ cat fruits.txt
fruit   qty
apple   42
banana  31
fig     90
guava   6

$ # print only first field
$ awk '{print $1}' fruits.txt
fruit
apple
banana
fig
guava

$ # print only second field
$ awk '{print $2}' fruits.txt
qty
42
31
90
6

Specifying different input field separator

  • by using -F command line option
  • by setting FS variable
  • See FPAT and FIELDWIDTHS section for other ways of defining input fields
$ # second field where input field separator is :
$ echo 'foo:123:bar:789' | awk -F: '{print $2}'
123

$ # last field
$ echo 'foo:123:bar:789' | awk -F: '{print $NF}'
789

$ # first and last field
$ # note the use of , and space between output fields
$ echo 'foo:123:bar:789' | awk -F: '{print $1, $NF}'
foo 789

$ # second last field
$ echo 'foo:123:bar:789' | awk -F: '{print $(NF-1)}'
bar

$ # use quotes to avoid clashes with shell special characters
$ echo 'one;two;three;four' | awk -F';' '{print $3}'
three
  • Regular expressions based input field separator
$ echo 'Sample123string54with908numbers' | awk -F'[0-9]+' '{print $2}'
string

$ # first field will be empty as there is nothing before '{'
$ echo '{foo}   bar=baz' | awk -F'[{}= ]+' '{print $1}'

$ echo '{foo}   bar=baz' | awk -F'[{}= ]+' '{print $2}'
foo
$ echo '{foo}   bar=baz' | awk -F'[{}= ]+' '{print $3}'
bar
  • default input field separator is one or more of continuous space, tab or newline characters (will be termed as whitespace here on)
    • exact same behavior if FS is assigned single space character
  • in addition, leading and trailing whitespaces won't be considered when splitting the input record
$ printf ' a    ate b\tc   \n'
 a    ate b     c
$ printf ' a    ate b\tc   \n' | awk '{print $1}'
a
$ printf ' a    ate b\tc   \n' | awk '{print NF}'
4
$ # same behavior if FS is assigned to single space character
$ printf ' a    ate b\tc   \n' | awk -F' ' '{print $1}'
a
$ printf ' a    ate b\tc   \n' | awk -F' ' '{print NF}'
4

$ # for anything else, leading/trailing whitespaces will be considered
$ printf ' a    ate b\tc   \n' | awk -F'[ \t]+' '{print $2}'
a
$ printf ' a    ate b\tc   \n' | awk -F'[ \t]+' '{print NF}'
6
  • assigning empty string to FS will split the input record character wise
  • note the use of command line option -v to set FS
$ echo 'apple' | awk -v FS= '{print $1}'
a
$ echo 'apple' | awk -v FS= '{print $2}'
p
$ echo 'apple' | awk -v FS= '{print $NF}'
e

$ # detecting multibyte characters depends on locale
$ printf 'hi👍 how are you?' | awk -v FS= '{print $3}'
👍

Further Reading


Specifying different output field separator

  • by setting OFS variable
  • also gets added between every argument to print statement
  • default is single space
$ # statements inside BEGIN are executed before processing any input text
$ echo 'foo:123:bar:789' | awk 'BEGIN{FS=OFS=":"} {print $1, $NF}'
foo:789
$ # can also be set using command line option -v
$ echo 'foo:123:bar:789' | awk -F: -v OFS=':' '{print $1, $NF}'
foo:789

$ # changing a field will re-build contents of $0
$ echo ' a      ate b   ' | awk '{$2 = "foo"; print $0}' | cat -A
a foo b$

$ # $1=$1 is an idiomatic way to re-build when there is nothing else to change
$ echo 'foo:123:bar:789' | awk -F: -v OFS='-' '{print $0}'
foo:123:bar:789
$ echo 'foo:123:bar:789' | awk -F: -v OFS='-' '{$1=$1; print $0}'
foo-123-bar-789

$ # OFS is used to separate different arguments given to print
$ echo 'foo:123:bar:789' | awk -F: -v OFS='\t' '{print $1, $3}'
foo     bar

$ echo 'Sample123string54with908numbers' | awk -F'[0-9]+' '{$1=$1; print $0}'
Sample string with numbers

Filtering


Idiomatic print usage

  • print statement with no arguments will print contents of $0
  • if condition is specified without corresponding statements, contents of $0 is printed if condition evaluates to true
  • 1 is typically used to represent always true condition and thus print contents of $0
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

$ # displaying contents of input file(s) similar to 'cat' command
$ # equivalent to using awk '{print $0}' and awk '1'
$ awk '{print}' poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

Field comparison

  • Each block of statements within {} can be prefixed by an optional condition so that those statements will execute only if condition evaluates to true
  • Condition specified without corresponding statements will lead to printing contents of $0 if condition evaluates to true
$ # if first field exactly matches the string 'apple'
$ awk '$1=="apple"{print $2}' fruits.txt
42

$ # print first field if second field > 35
$ # NR>1 to avoid the header line
$ # NR built-in variable contains record number
$ awk 'NR>1 && $2>35{print $1}' fruits.txt
apple
fig

$ # print header and lines with qty < 35
$ awk 'NR==1 || $2<35' fruits.txt
fruit   qty
banana  31
guava   6
  • If the above examples are too confusing, think of it as syntactical sugar
  • Statements are grouped within {}
    • inside {}, we have a if control structure
    • Like C language, braces not needed for single statements within if, but consider that {} is used for clarity
    • From this explicit syntax, remove the outer {}, if and () used for if
  • As we'll see later, this allows to mash up few lines of program compactly on command line itself
    • Of course, for medium to large programs, it is better to put the code in separate file. See awk scripts section
$ # awk '$1=="apple"{print $2}' fruits.txt
$ awk '{
         if($1 == "apple"){
            print $2
         }
       }' fruits.txt
42

$ # awk 'NR==1 || $2<35' fruits.txt
$ awk '{
         if(NR==1 || $2<35){
            print $0
         }
       }' fruits.txt
fruit   qty
banana  31
guava   6

Further Reading


Regular expressions based filtering

$ # all lines containing the string 'are'
$ # same as: grep 'are' poem.txt
$ awk '/are/' poem.txt
Roses are red,
Violets are blue,
And so are you.

$ # negating REGEXP, same as: grep -v 'are' poem.txt
$ awk '!/are/' poem.txt
Sugar is sweet,

$ # same as: grep 'are' poem.txt | grep -v 'so'
$ awk '/are/ && !/so/' poem.txt
Roses are red,
Violets are blue,

$ # lines starting with 'a' or 'b'
$ awk '/^[ab]/' fruits.txt
apple   42
banana  31

$ # print last field of all lines containing 'are'
$ awk '/are/{print $NF}' poem.txt
red,
blue,
you.
  • strings can be used as well, which will be interpreted as REGEXP if necessary
  • Allows using shell variables instead of hardcoded REGEXP
    • that section also notes difference between using // and string
$ awk '$0 !~ "are"' poem.txt
Sugar is sweet,

$ awk '$0 ~ "^[ab]"' fruits.txt
apple   42
banana  31

$ # also helpful if search strings have the / delimiter character
$ cat paths.txt
/foo/a/report.log
/foo/y/power.log
$ awk '/\/foo\/a\//' paths.txt
/foo/a/report.log
$ awk '$0 ~ "/foo/a/"' paths.txt
/foo/a/report.log
  • REGEXP matching against specific field
$ # if first field contains 'a'
$ awk '$1 ~ /a/' fruits.txt
apple   42
banana  31
guava   6

$ # if first field contains 'a' and qty > 20
$ awk '$1 ~ /a/ && $2 > 20' fruits.txt
apple   42
banana  31

$ # if first field does NOT contain 'a'
$ awk '$1 !~ /a/' fruits.txt
fruit   qty
fig     90

Fixed string matching

  • to search a string literally, index function can be used instead of REGEXP
    • similar to grep -F
  • the function returns the starting position and 0 if no match found
$ cat eqns.txt
a=b,a-b=c,c*d
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b

$ # no output since '+' is meta character, would need '/a\+b/'
$ awk '/a+b/' eqns.txt
$ # same as: grep -F 'a+b' eqns.txt
$ awk 'index($0,"a+b")' eqns.txt
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b

$ # much easier than '/i\*\(t\+9-g\)/'
$ awk 'index($0,"i*(t+9-g)")' eqns.txt
i*(t+9-g)/8,4-a+b

$ # check only last field
$ awk -F, 'index($NF,"a+b")' eqns.txt
i*(t+9-g)/8,4-a+b
$ # index not needed if entire field/line is being compared
$ awk -F, '$1=="a+b"' eqns.txt
a+b,pi=3.14,5e12
  • return value is useful to match at specific position
  • for ex: at start/end of line
$ # start of line
$ awk 'index($0,"a+b")==1' eqns.txt
a+b,pi=3.14,5e12

$ # end of line
$ # length function returns number of characters, by default acts on $0
$ awk 'index($0,"a+b")==length()-length("a+b")+1' eqns.txt
i*(t+9-g)/8,4-a+b
$ # to avoid repetitions, save the search string in variable
$ awk -v s="a+b" 'index($0,s)==length()-length(s)+1' eqns.txt
i*(t+9-g)/8,4-a+b

Line number based filtering

  • Built-in variable NR contains total records read so far
  • Use FNR if you need line numbers separately for multiple file processing
$ # same as: head -n2 poem.txt | tail -n1
$ awk 'NR==2' poem.txt
Violets are blue,

$ # print 2nd and 4th line
$ awk 'NR==2 || NR==4' poem.txt
Violets are blue,
And so are you.

$ # same as: tail -n1 poem.txt
$ # statements inside END are executed after processing all input text
$ awk 'END{print}' poem.txt
And so are you.

$ awk 'NR==4{print $2}' fruits.txt
90
  • for large input, use exit to avoid unnecessary record processing
$ seq 14323 14563435 | awk 'NR==234{print; exit}'
14556

$ # sample time comparison
$ time seq 14323 14563435 | awk 'NR==234{print; exit}'
14556

real    0m0.004s
user    0m0.004s
sys     0m0.000s
$ time seq 14323 14563435 | awk 'NR==234{print}'
14556

real    0m2.167s
user    0m2.280s
sys     0m0.092s

Case Insensitive filtering

$ # same as: grep -i 'rose' poem.txt
$ awk -v IGNORECASE=1 '/rose/' poem.txt
Roses are red,

$ # for small enough set, can also use REGEXP character class
$ awk '/[rR]ose/' poem.txt
Roses are red,

$ # another way is to use built-in string function 'tolower'
$ awk 'tolower($0) ~ /rose/' poem.txt
Roses are red,

Changing record separators

  • RS to change input record separator
  • default is newline character
$ s='this is a sample string'

$ # space as input record separator, printing all records
$ printf "$s" | awk -v RS=' ' '{print NR, $0}'
1 this
2 is
3 a
4 sample
5 string

$ # print all records containing 'a'
$ printf "$s" | awk -v RS=' ' '/a/'
a
sample
  • ORS to change output record separator
  • gets added to every print statement
  • default is newline character
$ seq 3 | awk '{print $0}'
1
2
3
$ # note that there is empty line after last record
$ seq 3 | awk -v ORS='\n\n' '{print $0}'
1

2

3

$ # dynamically changing ORS
$ # ?: ternary operator to select between two expressions based on a condition
$ # can also use: seq 6 | awk '{ORS = NR%2 ? " " : RS} 1'
$ seq 6 | awk '{ORS = NR%2 ? " " : "\n"} 1'
1 2
3 4
5 6
$ seq 6 | awk '{ORS = NR%3 ? "-" : "\n"} 1'
1-2-3
4-5-6

Paragraph mode

  • When RS is set to empty string, one or more consecutive empty lines is used as input record separator
  • Can also use regular expression RS=\n\n+ but there are subtle differences, see gawk manual - multiline records. Important points from that link quoted below

However, there is an important difference between ‘RS = ""’ and ‘RS = "\n\n+"’. In the first case, leading newlines in the input data file are ignored, and if a file ends without extra blank lines after the last record, the final newline is removed from the record. In the second case, this special processing is not done

Now that the input is separated into records, the second step is to separate the fields in the records. One way to do this is to divide each of the lines into fields in the normal manner. This happens by default as the result of a special feature. When RS is set to the empty string and FS is set to a single character, the newline character always acts as a field separator. This is in addition to whatever field separations result from FS

When FS is the null string ("") or a regexp, this special feature of RS does not apply. It does apply to the default field separator of a single space: ‘FS = " "’

Consider the below sample file

$ cat sample.txt
Hello World

Good day
How are you

Just do-it
Believe it

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he
  • Filtering paragraphs
$ # print all paragraphs containing 'it'
$ # if extra newline at end is undesirable, can use
$ # awk -v RS= '/it/{print c++ ? "\n" $0 : $0}' sample.txt
$ awk -v RS= -v ORS='\n\n' '/it/' sample.txt
Just do-it
Believe it

Today is sunny
Not a bit funny
No doubt you like it too

$ # based on number of lines in each paragraph
$ awk -F'\n' -v RS= -v ORS='\n\n' 'NF==1' sample.txt
Hello World

$ awk -F'\n' -v RS= -v ORS='\n\n' 'NF==2 && /do/' sample.txt
Just do-it
Believe it

Much ado about nothing
He he he
  • Re-structuring paragraphs
$ # default FS is one or more of continuous space, tab or newline characters
$ # default OFS is single space
$ # so, $1=$1 will change it uniformly to single space between fields
$ awk -v RS= '{$1=$1} 1' sample.txt
Hello World
Good day How are you
Just do-it Believe it
Today is sunny Not a bit funny No doubt you like it too
Much ado about nothing He he he

$ # a better usecase
$ awk 'BEGIN{FS="\n"; OFS=". "; RS=""; ORS="\n\n"} {$1=$1} 1' sample.txt
Hello World

Good day. How are you

Just do-it. Believe it

Today is sunny. Not a bit funny. No doubt you like it too

Much ado about nothing. He he he

Further Reading


Multicharacter RS

  • Some marker like Error or Warning etc
$ cat report.log
blah blah
Error: something went wrong
more blah
whatever
Error: something surely went wrong
some text
some more text
blah blah blah

$ awk -v RS='Error:' 'END{print NR-1}' report.log
2
$ awk -v RS='Error:' 'NR==1' report.log
blah blah

$ # filter 'Error:' block matching particular string
$ # to preserve formatting, use: '/whatever/{print RS $0}'
$ awk -v RS='Error:' '/whatever/' report.log
 something went wrong
more blah
whatever

$ # blocks with more than 3 lines
$ # splitting string with 3 newlines will yield 4 fields
$ awk -F'\n' -v RS='Error:' 'NF>4{print RS $0}' report.log
Error: something surely went wrong
some text
some more text
blah blah blah
  • Regular expression based RS
    • the RT variable will contain string matched by RS
  • Note that entire input is treated as single string, so ^ and $ anchors will apply only once - not every line
$ s='Sample123string54with908numbers'
$ printf "$s" | awk -v RS='[0-9]+' 'NR==1'
Sample

$ # note the relationship between record and separators
$ printf "$s" | awk -v RS='[0-9]+' '{print NR " : " $0 " - " RT}'
1 : Sample - 123
2 : string - 54
3 : with - 908
4 : numbers - 

$ # need to be careful of empty records
$ printf '123string54with908' | awk -v RS='[0-9]+' '{print NR " : " $0}'
1 : 
2 : string
3 : with
$ # and newline at end of input
$ printf '123string54with908\n' | awk -v RS='[0-9]+' '{print NR " : " $0}'
1 : 
2 : string
3 : with
4 : 
  • Joining lines based on specific end of line condition
$ cat msg.txt
Hello there.
It will rain to-
day. Have a safe
and pleasant jou-
rney.

$ # join lines ending with - to next line
$ # by manipulating RS and ORS
$ awk -v RS='-\n' -v ORS= '1' msg.txt
Hello there.
It will rain today. Have a safe
and pleasant journey.

$ # by manipulating ORS alone, sub function covered in later sections
$ awk '{ORS = sub(/-$/,"") ? "" : "\n"} 1' msg.txt
Hello there.
It will rain today. Have a safe
and pleasant journey.
$ # easier: perl -pe 's/-\n//' msg.txt as newline is still part of input line
  • processing null terminated input
$ printf 'foo\0bar\0' | cat -A
foo^@bar^@$
$ printf 'foo\0bar\0' | awk -v RS='\0' '{print}'
foo
bar

Further Reading


Substitute functions

  • Use sub string function for replacing first occurrence
  • Use gsub for replacing all occurrences
  • By default, $0 which contains input record is modified, can specify any other field or variable as needed
$ # replacing first occurrence
$ echo '1-2-3-4-5' | awk '{sub("-", ":")} 1'
1:2-3-4-5

$ # replacing all occurrences
$ echo '1-2-3-4-5' | awk '{gsub("-", ":")} 1'
1:2:3:4:5

$ # return value for sub/gsub is number of replacements made
$ echo '1-2-3-4-5' | awk '{n=gsub("-", ":"); print n} 1'
4
1:2:3:4:5

$ # // format is better suited to specify search REGEXP
$ echo '1-2-3-4-5' | awk '{gsub(/[^-]+/, "abc")} 1'
abc-abc-abc-abc-abc

$ # replacing all occurrences only for third field
$ echo 'one;two;three;four' | awk -F';' '{gsub("e", "E", $3)} 1'
one two thrEE four
  • Use gensub to return the modified string unlike sub or gsub which modifies inplace
  • it also supports back-references and ability to modify specific match
  • acts upon $0 if target is not specified
$ # replace second occurrence
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(":", "-", 2)} 1'
foo:123-bar:baz
$ # use REGEXP as needed
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "XYZ", 2)} 1'
foo:XYZ:bar:baz

$ # or print the returned string directly
$ echo 'foo:123:bar:baz' | awk '{print gensub(":", "-", 2)}'
foo:123-bar:baz

$ # replace third occurrence
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "XYZ", 3)} 1'
foo:123:XYZ:baz

$ # replace all occurrences, similar to gsub
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "XYZ", "g")} 1'
XYZ:XYZ:XYZ:XYZ

$ # target other than $0
$ echo 'foo:123:bar:baz' | awk -F: -v OFS=: '{$1=gensub(/o/, "b", 2, $1)} 1'
fob:123:bar:baz
  • back-reference examples
  • use \" within double-quotes to represent " character in replacement string
  • use \\1 to represent \1 - the first captured group and so on
  • & or \0 will back-reference entire matched string
$ # replacing last occurrence without knowing how many occurrences are there
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/(.*):/, "\\1-", 1)} 1'
foo:123:bar-baz
$ echo 'foo and bar and baz land good' | awk '{$0=gensub(/(.*)and/, "\\1XYZ", 1)} 1'
foo and bar and baz lXYZ good

$ # use word boundaries as necessary
$ echo 'foo and bar and baz land good' | awk '{$0=gensub(/(.*)\<and\>/, "\\1XYZ", 1)} 1'
foo and bar XYZ baz land good

$ # replacing last but one
$ echo '456:foo:123:bar:789:baz' | awk '{$0=gensub(/(.*):(.*:)/, "\\1-\\2", 1)} 1'
456:foo:123:bar-789:baz

$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "\"&\"", "g")} 1'
"foo":"123":"bar":"baz"
  • saving quotes in variables - to avoid escaping double quotes or having to use octal code for single quotes
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "\047&\047", "g")} 1'
'foo':'123':'bar':'baz'
$ echo 'foo:123:bar:baz' | awk -v sq="'" '{$0=gensub(/[^:]+/, sq"&"sq, "g")} 1'
'foo':'123':'bar':'baz'

$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "\"&\"", "g")} 1'
"foo":"123":"bar":"baz"
$ echo 'foo:123:bar:baz' | awk -v dq='"' '{$0=gensub(/[^:]+/, dq"&"dq, "g")} 1'
"foo":"123":"bar":"baz"

Further Reading


Inplace file editing

  • Use this option with caution, preferably after testing that the awk code is working as intended
$ cat greeting.txt
Hi there
Have a nice day

$ awk -i inplace '{gsub("e", "E")} 1' greeting.txt
$ cat greeting.txt
Hi thErE
HavE a nicE day
  • Multiple input files are treated individually and changes are written back to respective files
$ cat f1
I ate 3 apples
$ cat f2
I bought two bananas and 3 mangoes

$ awk -i inplace '{gsub("3", "three")} 1' f1 f2
$ cat f1
I ate three apples
$ cat f2
I bought two bananas and three mangoes
  • to create backups of original file, set INPLACE_SUFFIX variable
$ awk -i inplace -v INPLACE_SUFFIX='.bkp' '{gsub("three", "3")} 1' f1
$ cat f1
I ate 3 apples
$ cat f1.bkp
I ate three apples

Using shell variables

  • when awk code is part of shell program and shell variable needs to be passed as input to awk code
  • for example:
    • command line argument passed to shell script, which is in turn passed on to awk
    • control structures in shell script calling awk with different search strings
  • See also stackoverflow - How do I use shell variables in an awk script?
$ # examples tested with bash shell

$ f='apple'
$ awk -v word="$f" '$1==word' fruits.txt
apple   42
$ f='fig'
$ awk -v word="$f" '$1==word' fruits.txt
fig     90

$ q='20'
$ awk -v threshold="$q" 'NR==1 || $2>threshold' fruits.txt
fruit   qty
apple   42
banana  31
fig     90
  • accessing shell environment variables
$ # existing environment variable
$ awk 'BEGIN{print ENVIRON["PWD"]}'
/home/learnbyexample
$ awk 'BEGIN{print ENVIRON["SHELL"]}'
/bin/bash

$ # defined along with awk code
$ word='hello world' awk 'BEGIN{print ENVIRON["word"]}'
hello world

$ # using ENVIRON also prevents awk's interpretation of escape sequences
$ s='a\n=c'
$ foo="$s" awk 'BEGIN{print ENVIRON["foo"]}'
a\n=c
$ awk -v foo="$s" 'BEGIN{print foo}'
a
=c
$ s='are'
$ # for: awk '!/are/' poem.txt
$ awk -v s="$s" '$0 !~ s' poem.txt
Sugar is sweet,
$ # for: awk '/are/ && !/so/' poem.txt
$ awk -v s="$s" '$0 ~ s && !/so/' poem.txt
Roses are red,
Violets are blue,

$ r='[^-]+'
$ echo '1-2-3-4-5' | awk -v r="$r" '{gsub(r, "abc")} 1'
abc-abc-abc-abc-abc

$ # escape sequence has to be doubled when string is interpreted as REGEXP
$ s='foo and bar and baz land good'
$ echo "$s" | awk '{$0=gensub("(.*)\\<and\\>", "\\1XYZ", 1)} 1'
foo and bar XYZ baz land good
$ # hence passing as variable should be
$ r='(.*)\\<and\\>'
$ echo "$s" | awk -v r="$r" '{$0=gensub(r, "\\1XYZ", 1)} 1'
foo and bar XYZ baz land good

$ # or use ENVIRON
$ r='(.*)\<and\>'
$ echo "$s" | r="$r" awk '{$0=gensub(ENVIRON["r"], "\\1XYZ", 1)} 1'
foo and bar XYZ baz land good

Multiple file input

  • Example to show difference between NR and FNR
$ # NR for overall record number
$ awk 'NR==1' poem.txt greeting.txt
Roses are red,

$ # FNR for individual file's record number
$ # same as: head -q -n1 poem.txt greeting.txt
$ awk 'FNR==1' poem.txt greeting.txt
Roses are red,
Hi thErE
  • Constructs to do some processing before starting each file as well as at the end
  • BEGINFILE - to add code to be executed before start of each input file
  • ENDFILE - to add code to be executed after processing each input file
  • FILENAME - file name of current input file being processed
$ # similar to: tail -n1 poem.txt greeting.txt
$ awk 'BEGINFILE{print "file: "FILENAME}
       ENDFILE{print $0"\n------"}' poem.txt greeting.txt
file: poem.txt
And so are you.
------
file: greeting.txt
HavE a nicE day
------
  • And of course, there can be usual awk code
$ awk 'BEGINFILE{print "file: "FILENAME}
       FNR==1;
       ENDFILE{print "------"}' poem.txt greeting.txt
file: poem.txt
Roses are red,
------
file: greeting.txt
Hi thErE
------

$ awk 'BEGINFILE{c++; print "file: "FILENAME}
       FNR==2;
       END{print "\nTotal input files: "c}' poem.txt greeting.txt
file: poem.txt
Violets are blue,
file: greeting.txt
HavE a nicE day

Total input files: 2

Further Reading


Control Structures

  • Syntax is similar to C language and single statements inside control structures don't require to be grouped within {}
  • See gawk manual - Control Statements for details

Remember that by default there is a loop that goes over all input records and constructs like BEGIN and END fall outside that loop

$ cat nums.txt
42
-2
10101
-3.14
-75
$ awk '{sum += $1} END{print sum}' nums.txt
10062.9

$ # uninitialized variables will have empty string
$ printf '' | awk '{sum += $1} END{print sum}'

$ # so either add '0' or use unary '+' operator to convert to number
$ printf '' | awk '{sum += $1} END{print +sum}'
0
$ awk '{sum += $1} END{print sum+0}' /dev/null
0

if-else and loops

$ # same as: sed -n '/are/ s/so/SO/p' poem.txt
$ # remember that sub/gsub returns number of substitutions made
$ awk '/are/{if(sub("so", "SO")) print}' poem.txt
And SO are you.
$ # of course, can also use
$ awk '/are/ && sub("so", "SO")' poem.txt
And SO are you.

$ # if-else example
$ awk 'NR>1{if($2>40) $0="+"$0; else $0="-"$0} 1' fruits.txt
fruit   qty
+apple   42
-banana  31
+fig     90
-guava   6
$ cat nums.txt
42
-2
10101
-3.14
-75

$ # changing -ve to +ve and vice versa
$ # same as: awk '{if($0 ~ /^-/) sub(/^-/,""); else sub(/^/,"-")} 1' nums.txt
$ awk '{$0 ~ /^-/ ? sub(/^-/,"") : sub(/^/,"-")} 1' nums.txt
-42
2
-10101
3.14
75
$ # can also use: awk '!sub(/^-/,""){sub(/^/,"-")} 1' nums.txt
$ awk 'BEGIN{for(i=2; i<11; i+=2) print i}'
2
4
6
8
10

$ # looping each field
$ s='scat:cat:no cat:abdicate:cater'
$ echo "$s" | awk -F: -v OFS=: '{for(i=1;i<=NF;i++) if($i=="cat") $i="CAT"} 1'
scat:CAT:no cat:abdicate:cater
$ # can also use sub function
$ echo "$s" | awk -F: -v OFS=: '{for(i=1;i<=NF;i++) sub(/^cat$/,"CAT",$i)} 1'
scat:CAT:no cat:abdicate:cater
  • while loop
  • do-while is also available
$ awk 'BEGIN{i=2; while(i<11){print i; i+=2}}'
2
4
6
8
10

$ # recursive substitution
$ # here again return value of sub/gsub is useful
$ echo 'titillate' | awk '{while( gsub(/til/, "") ) print}'
tilate
ate

next and nextfile

  • next will skip rest of statements and start processing next line of current file being processed
    • there is a loop by default which goes over all input records, next is applicable for that
    • it is similar to continue statement within loops
  • it is often used in Two file processing
$ # here 'next' is used to skip processing header line
$ awk 'NR==1{print; next} /a.*a/{$0="*"$0} /[eiou]/{$0="-"$0} 1' fruits.txt
fruit   qty
-apple   42
*banana  31
-fig     90
-*guava   6
  • nextfile is useful to skip remaining lines from current file being processed and move on to next file
$ # same as: head -q -n1 poem.txt greeting.txt fruits.txt
$ awk 'FNR>1{nextfile} 1' poem.txt greeting.txt fruits.txt
Roses are red,
Hi thErE
fruit   qty

$ # specific field
$ awk 'FNR>2{nextfile} {print $1}' poem.txt greeting.txt fruits.txt
Roses
Violets
Hi
HavE
fruit
apple

$ # similar to 'grep -il'
$ awk -v IGNORECASE=1 '/red/{print FILENAME; nextfile}' *
colors_1.txt
colors_2.txt
poem.txt
$ awk -v IGNORECASE=1 '$1 ~ /red/{print FILENAME; nextfile}' *
colors_1.txt
colors_2.txt

Multiline processing

  • Processing consecutive lines
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

$ # match two consecutive lines
$ awk 'p~/are/ && /is/{print p ORS $0} {p=$0}' poem.txt
Violets are blue,
Sugar is sweet,
$ # if only the second line is needed
$ awk 'p~/are/ && /is/; {p=$0}' poem.txt
Sugar is sweet,

$ # match three consecutive lines
$ awk 'p2~/red/ && p1~/blue/ && /is/{print p2} {p2=p1; p1=$0}' poem.txt
Roses are red,

$ # common mistake
$ sed -n '/are/{N;/is/p}' poem.txt
$ # would need something like this and not practical to extend for other cases
$ sed '$!N; /are.*\n.*is/p; D' poem.txt
Violets are blue,
Sugar is sweet,

Consider this sample input file

$ cat range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
END
baz
  • extracting lines around matching line
  • See also stackoverflow - lines around matching regexp
  • how n && n-- works:
    • need to note that right hand side of && is processed only if left hand side is true
    • so for example, if initially n=2, then we get
      • 2 && 2; n=1 - evaluates to true
      • 1 && 1; n=0 - evaluates to true
      • 0 && - evaluates to false ... no decrementing n and hence will be false until n is re-assigned non-zero value
$ # similar to: grep --no-group-separator -A1 'BEGIN' range.txt
$ awk '/BEGIN/{n=2} n && n--' range.txt
BEGIN
1234
BEGIN
a

$ # only print the line after matching line
$ # can also use: awk '/BEGIN/{n=1; next} n && n--' range.txt
$ awk 'n && n--; /BEGIN/{n=1}' range.txt
1234
a
$ # generic case: print nth line after match
$ awk 'n && !--n; /BEGIN/{n=3}' range.txt
END
c

$ # print second line prior to matched line
$ awk '/END/{print p2} {p2=p1; p1=$0}' range.txt
1234
b
$ # save all lines in an array for generic case
$ awk '/END/{print a[NR-3]} {a[NR]=$0}' range.txt
BEGIN
a
$ # or use the reversing trick
$ tac range.txt | awk 'n && !--n; /END/{n=3}' | tac
BEGIN
a
  • Checking if multiple strings are present at least once in entire input file
  • If there are lots of strings to check, use arrays
$ # can also use BEGINFILE instead of FNR==1
$ awk 'FNR==1{s1=s2=0} /is/{s1=1} /are/{s2=1} s1&&s2{print FILENAME; nextfile}' *
poem.txt
sample.txt

$ awk 'FNR==1{s1=s2=0} /foo/{s1=1} /report/{s2=1} s1&&s2{print FILENAME; nextfile}' *
paths.txt

Further Reading


Two file processing

  • We'll use awk's associative arrays (key-value pairs) here
  • Unlike comm the input files need not be sorted and comparison can be done based on certain field(s) as well

Comparing whole lines

Consider the following test files

$ cat colors_1.txt
Blue
Brown
Purple
Red
Teal
Yellow

$ cat colors_2.txt
Black
Blue
Green
Red
White
  • common lines and lines unique to one of the files
  • For two files as input, NR==FNR will be true only when first file is being processed
  • Using next will skip rest of code when first file is processed
  • a[$0] will create unique keys (here entire line content is used as key) in array a
    • just referencing a key will create it if it doesn't already exist, with value as empty string (will also act as zero in numeric context)
  • $0 in a will be true if key already exists in array a
$ # common lines
$ # same as: grep -Fxf colors_1.txt colors_2.txt
$ awk 'NR==FNR{a[$0]; next} $0 in a' colors_1.txt colors_2.txt
Blue
Red

$ # lines from colors_2.txt not present in colors_1.txt
$ # same as: grep -vFxf colors_1.txt colors_2.txt
$ awk 'NR==FNR{a[$0]; next} !($0 in a)' colors_1.txt colors_2.txt
Black
Green
White

$ # reversing the order of input files gives
$ # lines from colors_1.txt not present in colors_2.txt
$ awk 'NR==FNR{a[$0]; next} !($0 in a)' colors_2.txt colors_1.txt
Brown
Purple
Teal
Yellow

Comparing specific fields

Consider the sample input file

$ cat marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59
ECE     Om      92
CSE     Amy     67
  • single field
  • For ex: only first field comparison by using $1 instead of $0 as key
$ cat list1
ECE
CSE

$ # extract only lines matching first field specified in list1
$ awk 'NR==FNR{a[$1]; next} $1 in a' list1 marks.txt
ECE     Raj     53
ECE     Joel    72
CSE     Surya   81
ECE     Om      92
CSE     Amy     67

$ # if header is needed as well
$ awk 'NR==FNR{a[$1]; next} FNR==1 || $1 in a' list1 marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
CSE     Surya   81
ECE     Om      92
CSE     Amy     67
  • multiple fields
  • create a string by adding some character between the fields to act as key
    • for ex: to avoid matching two field values abc and 123 to match with two other field values ab and c123
    • by adding character, say _, the key would be abc_123 for first case and ab_c123 for second case
    • this can still lead to false match if input data has _
    • there is also a built-in way to do this using gawk manual - Multidimensional Arrays
$ cat list2
EEE Moi
CSE Amy
ECE Raj

$ # extract only lines matching both fields specified in list2
$ awk 'NR==FNR{a[$1"_"$2]; next} $1"_"$2 in a' list2 marks.txt
ECE     Raj     53
EEE     Moi     68
CSE     Amy     67

$ # uses SUBSEP as separator, whose default value is non-printing character \034
$ awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' list2 marks.txt
ECE     Raj     53
EEE     Moi     68
CSE     Amy     67
  • field and value comparison
$ cat list3
ECE 70
EEE 65
CSE 80

$ # extract line matching Dept and minimum marks specified in list3
$ awk 'NR==FNR{d[$1]; m[$1]=$2; next} $1 in d && $3 >= m[$1]' list3 marks.txt
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
ECE     Om      92

getline

  • getline is an alternative way to read from a file and could be faster than NR==FNR method for some cases
  • But use it with caution
  • getline return value: 1 if record is found, 0 if end of file, -1 for errors such as file not found (use ERRNO variable to get details)
$ # replace mth line in poem.txt with nth line from nums.txt
$ # return value handling is not shown here, but should be done ideally
$ awk -v m=3 -v n=2 'BEGIN{while(n-- > 0) getline s < "nums.txt"}
                     FNR==m{$0=s} 1' poem.txt
Roses are red,
Violets are blue,
-2
And so are you.

$ # without getline, but slower due to NR==FNR check for every line processed
$ awk -v m=3 -v n=2 'NR==FNR{if(FNR==n){s=$0; nextfile} next}
                     FNR==m{$0=s} 1' nums.txt poem.txt
Roses are red,
Violets are blue,
-2
And so are you.

$ # Note that if nums.txt has less than n lines:
$ # getline version will use last line of nums.txt if any
$ # NR==FNR version will give empty string as 's' would be uninitialized
  • Another use case is if two files are to be processed simultaneously
$ # print line from fruits.txt if corresponding line from nums.txt is +ve number
$ # the return value check ensures corresponding line number comparison
$ awk -v file='nums.txt' '(getline num < file)==1 && num>0' fruits.txt
fruit   qty
banana  31

$ # without getline, but has to save entire file in array
$ awk 'NR==FNR{n[FNR]=$0; next} n[FNR]>0' nums.txt fruits.txt
fruit   qty
banana  31
  • error handling
$ awk 'NR==FNR{n[FNR]=$0; next} n[FNR]>0' xyz.txt fruits.txt
awk: fatal: cannot open file 'xyz.txt' for reading (No such file or directory)

$ awk -v file='xyz.txt' '{ e=(getline num < file);
                           if(e<0){print file ": " ERRNO; exit} }
                         e==1 && num>0' fruits.txt
xyz.txt: No such file or directory

Further Reading


Creating new fields

  • Number of fields in input record can be changed by simply manipulating NF
$ # reducing fields
$ echo 'foo,bar,123,baz' | awk -F, -v OFS=, '{NF=2} 1'
foo,bar

$ # creating new empty field(s)
$ echo 'foo,bar,123,baz' | awk -F, -v OFS=, '{NF=5} 1'
foo,bar,123,baz,

$ # assigning to field greater than NF will create empty fields as needed
$ echo 'foo,bar,123,baz' | awk -F, -v OFS=, '{$7=42} 1'
foo,bar,123,baz,,,42
  • adding a field based on existing fields
$ # adding a new 'Grade' field
$ awk 'BEGIN{OFS="\t"; g[9]="S"; g[8]="A"; g[7]="B"; g[6]="C"; g[5]="D"}
      {NF++; $NF = NR==1 ? "Grade" : g[int($(NF-1)/10)]} 1' marks.txt
Dept    Name    Marks   Grade
ECE     Raj     53      D
ECE     Joel    72      B
EEE     Moi     68      C
CSE     Surya   81      A
EEE     Tia     59      D
ECE     Om      92      S
CSE     Amy     67      C

$ # can also use split (covered in a later section)
$ # array assignment: split("DCBAS",g,//)
$ # index adjustment: g[int($(NF-1)/10)-4]
  • two file example
$ cat list4
Raj class_rep
Amy sports_rep
Tia placement_rep

$ awk -v OFS='\t' 'NR==FNR{r[$1]=$2; next}
         {NF++; $NF = NR==1 ? "Role" : $NF=r[$2]} 1' list4 marks.txt
Dept    Name    Marks   Role
ECE     Raj     53      class_rep
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59      placement_rep
ECE     Om      92
CSE     Amy     67      sports_rep

Dealing with duplicates

  • default value of uninitialized variable is 0 in numeric context and empty string in text context
    • and evaluates to false when used conditionally

Illustration to show default numeric value and array in action

$ printf 'mad\n42\n42\ndam\n42\n'
mad
42
42
dam
42

$ printf 'mad\n42\n42\ndam\n42\n' | awk '{print $0 "\t" int(a[$0]); a[$0]++}'
mad     0
42      0
42      1
dam     0
42      2
$ # only those entries with second column value zero will be retained
$ printf 'mad\n42\n42\ndam\n42\n' | awk '!a[$0]++'
mad
42
dam
$ cat duplicates.txt
abc  7   4
food toy ****
abc  7   4
test toy 123
good toy ****

$ # whole line
$ awk '!seen[$0]++' duplicates.txt
abc  7   4
food toy ****
test toy 123
good toy ****

$ # particular column
$ awk '!seen[$2]++' duplicates.txt
abc  7   4
food toy ****

$ # total count
$ awk '!seen[$2]++{c++} END{print +c}' duplicates.txt
2
$ # avoid unnecessary counting altogether
$ awk '!($2 in seen); {seen[$2]}' duplicates.txt
abc  7   4
food toy ****

$ # use arbitrary-precision integers, limited only by available memory
$ awk -M '!($2 in seen){c++} {seen[$2]} END{print +c}' duplicates.txt
2
  • For multiple fields, separate them using , or form a string with some character in between
    • choose a character unlikely to appear in input data, else there can be false matches
$ awk '!seen[$2"_"$3]++' duplicates.txt
abc  7   4
food toy ****
test toy 123

$ # can also use simulated multidimensional array
$ # SUBSEP, whose default is \034 non-printing character, is used as separator
$ awk '!seen[$2,$3]++' duplicates.txt
abc  7   4
food toy ****
test toy 123
  • retaining specific numbered copy
$ # second occurrence of duplicate
$ awk '++seen[$2]==2' duplicates.txt
abc  7   4
test toy 123

$ # third occurrence of duplicate
$ awk '++seen[$2]==3' duplicates.txt
good toy ****
  • retaining only last copy of duplicate
$ # reverse the input line-wise, retain first copy and then reverse again
$ tac duplicates.txt | awk '!seen[$2]++' | tac
abc  7   4
good toy ****
$ # all duplicates based on 1st column
$ awk 'NR==FNR{a[$1]++; next} a[$1]>1' duplicates.txt duplicates.txt
abc  7   4
abc  7   4
$ # all duplicates based on 3rd column
$ awk 'NR==FNR{a[$3]++; next} a[$3]>1' duplicates.txt duplicates.txt
abc  7   4
food toy ****
abc  7   4
good toy ****

$ # more than 2 duplicates based on 2nd column
$ awk 'NR==FNR{a[$2]++; next} a[$2]>2' duplicates.txt duplicates.txt
food toy ****
test toy 123
good toy ****

$ # only unique lines based on 3rd column
$ awk 'NR==FNR{a[$3]++; next} a[$3]==1' duplicates.txt duplicates.txt
test toy 123

Lines between two REGEXPs

  • This section deals with filtering lines bound by two REGEXPs (referred to as blocks)
  • For simplicity the two REGEXPs usually used in below examples are the strings BEGIN and END

All unbroken blocks

Consider the below sample input file, which doesn't have any unbroken blocks (i.e BEGIN and END are always present in pairs)

$ cat range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
END
baz
  • Extracting lines between starting and ending REGEXP
$ # include both starting/ending REGEXP
$ # can also use: awk '/BEGIN/,/END/' range.txt
$ # which is similar to sed -n '/BEGIN/,/END/p'
$ # but not suitable to extend for other cases
$ awk '/BEGIN/{f=1} f; /END/{f=0}' range.txt
BEGIN
1234
6789
END
BEGIN
a
b
c
END

$ # exclude both starting/ending REGEXP
$ # can also use: awk '/BEGIN/{f=1; next} /END/{f=0} f' range.txt
$ awk '/END/{f=0} f; /BEGIN/{f=1}' range.txt
1234
6789
a
b
c
  • Include only start or end REGEXP
$ # include only starting REGEXP
$ awk '/BEGIN/{f=1} /END/{f=0} f' range.txt
BEGIN
1234
6789
BEGIN
a
b
c

$ # include only ending REGEXP
$ awk 'f; /END/{f=0} /BEGIN/{f=1}' range.txt
1234
6789
END
a
b
c
END
  • Extracting lines other than lines between the two REGEXPs
$ awk '/BEGIN/{f=1} !f; /END/{f=0}' range.txt
foo
bar
baz

$ # the other three cases would be
$ awk '/END/{f=0} !f; /BEGIN/{f=1}' range.txt
$ awk '!f; /BEGIN/{f=1} /END/{f=0}' range.txt
$ awk '/BEGIN/{f=1} /END/{f=0} !f' range.txt

Specific blocks

  • Getting first block
$ awk '/BEGIN/{f=1} f; /END/{exit}' range.txt
BEGIN
1234
6789
END

$ # use other tricks discussed in previous section as needed
$ awk '/END/{exit} f; /BEGIN/{f=1}' range.txt
1234
6789
  • Getting last block
$ # reverse input linewise, change the order of REGEXPs, finally reverse again
$ tac range.txt | awk '/END/{f=1} f; /BEGIN/{exit}' | tac
BEGIN
a
b
c
END

$ # or, save the blocks in a buffer and print the last one alone
$ # ORS contains output record separator, which is newline by default
$ seq 30 | awk '/4/{f=1; b=$0; next} f{b=b ORS $0} /6/{f=0} END{print b}'
24
25
26
  • Getting blocks based on a counter
$ # all blocks
$ seq 30 | sed -n '/4/,/6/p'
4
5
6
14
15
16
24
25
26

$ # get only 2nd block
$ # can also use: seq 30 | awk -v b=2 '/4/{c++} c==b{print; if(/6/) exit}'
$ seq 30 | awk -v b=2 '/4/{c++} c==b; /6/ && c==b{exit}'
14
15
16

$ # to get all blocks greater than 'b' blocks
$ seq 30 | awk -v b=1 '/4/{f=1; c++} f && c>b; /6/{f=0}'
14
15
16
24
25
26
  • excluding a particular block
$ # excludes 2nd block
$ seq 30 | awk -v b=2 '/4/{f=1; c++} f && c!=b; /6/{f=0}'
4
5
6
24
25
26

Broken blocks

  • If there are blocks with ending REGEXP but without corresponding start, awk '/BEGIN/{f=1} f; /END/{f=0}' will suffice
  • Consider the modified input file where starting REGEXP doesn't have corresponding ending
$ cat broken_range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
baz

$ # the file reversing trick comes in handy here as well
$ tac broken_range.txt | awk '/END/{f=1} f; /BEGIN/{f=0}' | tac
BEGIN
1234
6789
END
  • But if both kinds of broken blocks are present, accumulate the records and print accordingly
$ cat multiple_broken.txt
qqqqqqq
BEGIN
foo
BEGIN
1234
6789
END
bar
END
0-42-1
BEGIN
a
BEGIN
b
END
xyzabc

$ awk '/BEGIN/{f=1; buf=$0; next}
       f{buf=buf ORS $0}
       /END/{f=0; if(buf) print buf; buf=""}' multiple_broken.txt
BEGIN
1234
6789
END
BEGIN
b
END

Further Reading


Arrays

We've already seen examples using arrays, some more examples discussed in this section

  • array looping
$ # average marks for each department
$ awk 'NR>1{d[$1]+=$3; c[$1]++} END{for(i in d)print i, d[i]/c[i]}' marks.txt
ECE 72.3333
EEE 63.5
CSE 74
$ # by default, keys are traversed in random order
$ awk 'BEGIN{a["z"]=1; a["x"]=12; a["b"]=42; for(i in a)print i, a[i]}'
x 12
z 1
b 42

$ # index sorted ascending order as strings
$ awk 'BEGIN{PROCINFO["sorted_in"] = "@ind_str_asc";
       a["z"]=1; a["x"]=12; a["b"]=42; for(i in a)print i, a[i]}'
b 42
x 12
z 1

$ # value sorted ascending order as numbers
$ awk 'BEGIN{PROCINFO["sorted_in"] = "@val_num_asc";
       a["z"]=1; a["x"]=12; a["b"]=42; for(i in a)print i, a[i]}'
z 1
x 12
b 42
  • deleting array elements
$ cat list5
CSE     Surya   75
EEE     Jai     69
ECE     Kal     83

$ # update entry if a match is found
$ # else append the new entries
$ awk '{ky=$1"_"$2} NR==FNR{upd[ky]=$0; next}
        ky in upd{$0=upd[ky]; delete upd[ky]} 1;
        END{for(i in upd)print upd[i]}' list5 marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   75
EEE     Tia     59
ECE     Om      92
CSE     Amy     67
ECE     Kal     83
EEE     Jai     69
$ awk 'NR>1{d[$1][$2]=$3} END{for(i in d["ECE"])print i}' marks.txt
Joel
Raj
Om

$ awk -v f='CSE' 'NR>1{d[$1][$2]=$3} END{for(i in d[f])print i, d[f][i]}' marks.txt
Surya 81
Amy 67

Further Reading


awk scripts

  • For larger programs, save the code in a file and use -f command line option
  • ; is not needed to terminate a statement
  • See also gawk manual - Command-Line Options for other related options
$ cat buf.awk
/BEGIN/{
    f=1
    buf=$0
    next
}

f{
    buf=buf ORS $0
}

/END/{
    f=0
    if(buf)
        print buf
    buf=""
}

$ awk -f buf.awk multiple_broken.txt
BEGIN
1234
6789
END
BEGIN
b
END
  • Another advantage is that single quotes can be freely used
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "\047&\047", "g")} 1'
'foo':'123':'bar':'baz'

$ cat quotes.awk
{
    $0 = gensub(/[^:]+/, "'&'", "g")
}

1

$ echo 'foo:123:bar:baz' | awk -f quotes.awk
'foo':'123':'bar':'baz'
  • If the code has been first tried out on command line, add -o option to get a pretty printed version
$ awk -o -v OFS='\t' 'NR==FNR{r[$1]=$2; next}
         {NF++; if(FNR==1)$NF="Role"; else $NF=r[$2]} 1' list4 marks.txt
Dept    Name    Marks   Role
ECE     Raj     53      class_rep
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59      placement_rep
ECE     Om      92
CSE     Amy     67      sports_rep

File name can be passed along -o option, otherwise by default awkprof.out will be used

$ cat awkprof.out
        # gawk profile, created Tue Oct 24 15:10:02 2017

        # Rule(s)

        NR == FNR {
                r[$1] = $2
                next
        }

        {
                NF++
                if (FNR == 1) {
                        $NF = "Role"
                } else {
                        $NF = r[$2]
                }
        }

        1 {
                print $0
        }

$ # note that other command line options have to be provided as usual
$ # for ex: awk -v OFS='\t' -f awkprof.out list4 marks.txt

Miscellaneous


FPAT and FIELDWIDTHS

$ s='Sample123string54with908numbers'
$ # define fields to be one or more consecutive digits
$ echo "$s" | awk -v FPAT='[0-9]+' '{print $1, $2, $3}'
123 54 908
$ # define fields to be one or more consecutive alphabets
$ echo "$s" | awk -v FPAT='[a-zA-Z]+' '{print $1, $2, $3, $4}'
Sample string with numbers
  • For simpler csv input having quoted strings if fields themselves have , in them, using FPAT is reasonable approach
  • Use a proper parser if input can have other cases like newlines in fields
$ s='foo,"bar,123",baz,abc'
$ echo "$s" | awk -F, '{print $2}'
"bar
$ echo "$s" | awk -v FPAT='"[^"]*"|[^,]*' '{print $2}'
"bar,123"
  • if input has well defined fields based on number of characters, FIELDWIDTHS can be used to specify width of each field
$ awk -v FIELDWIDTHS='8 3' -v OFS= '/fig/{$2=35} 1' fruits.txt
fruit   qty
apple   42
banana  31
fig     35
guava   6

$ # without FIELDWIDTHS
$ awk '/fig/{$2=35} 1' fruits.txt
fruit   qty
apple   42
banana  31
fig 35
guava   6

Further Reading


String functions

  • length function - returns length of string, by default acts on $0
$ seq 8 13 | awk 'length()==1'
8
9

$ awk 'NR==1 || length($1)>4' fruits.txt
fruit   qty
apple   42
banana  31
guava   6

$ # character count and not byte count is calculated, similar to 'wc -m'
$ printf 'hi👍' | awk '{print length()}'
3

$ # use -b option if number of bytes are needed
$ printf 'hi👍' | awk -b '{print length()}'
6
$ # 1st argument is string to be split
$ # 2nd argument is array to save results, indexed from 1
$ # 3rd argument is separator, default is FS
$ s='foo,1996-10-25,hello,good'
$ echo "$s" | awk -F, '{split($2,d,"-"); print "Month is: " d[2]}'
Month is: 10

$ # using regular expression to define separator
$ # return value is number of fields after splitting
$ s='Sample123string54with908numbers'
$ echo "$s" | awk '{n=split($0,s,/[0-9]+/); for(i=1;i<=n;i++)print s[i]}'
Sample
string
with
numbers
$ # use 4th argument if separators are needed as well
$ echo "$s" | awk '{n=split($0,s,/[0-9]+/,seps); for(i=1;i<n;i++)print seps[i]}'
123
54
908

$ # single row to multiple rows based on splitting last field
$ s='foo,baz,12:42:3'
$ echo "$s" | awk -F, '{n=split($NF,a,":"); NF--; for(i=1;i<=n;i++) print $0,a[i]}'
foo baz 12
foo baz 42
foo baz 3
  • substr function allows to extract specified number of characters from given string
    • indexing starts with 1
  • See gawk manual - substr function for corner cases and details
$ # 1st argument is string to be worked on
$ # 2nd argument is starting position
$ # 3rd argument is number of characters to be extracted
$ echo 'abcdefghij' | awk '{print substr($0,1,5)}'
abcde
$ echo 'abcdefghij' | awk '{print substr($0,4,3)}'
def
$ # if 3rd argument is not given, string is extracted until end
$ echo 'abcdefghij' | awk '{print substr($0,6)}'
fghij

$ echo 'abcdefghij' | awk -v OFS=':' '{print substr($0,2,3), substr($0,6,3)}'
bcd:fgh

$ # if only few characters are needed from input line, can use empty FS
$ echo 'abcdefghij' | awk -v FS= '{print $3}'
c
$ echo 'abcdefghij' | awk -v FS= '{print $3, $5}'
c e

Executing external commands

  • External commands can be issued using system function
  • Output would be as usual on stdout unless redirected while calling the command
  • Return value of system depends on exit status of executed command, see gawk manual - Input/Output Functions for details
$ awk 'BEGIN{system("echo Hello World")}'
Hello World

$ wc poem.txt
 4 13 65 poem.txt
$ awk 'BEGIN{system("wc poem.txt")}'
 4 13 65 poem.txt

$ awk 'BEGIN{system("seq 10 | paste -sd, > out.txt")}'
$ cat out.txt
1,2,3,4,5,6,7,8,9,10

$ ls xyz.txt
ls: cannot access 'xyz.txt': No such file or directory
$ echo $?
2
$ awk 'BEGIN{s=system("ls xyz.txt"); print "Status: " s}'
ls: cannot access 'xyz.txt': No such file or directory
Status: 2

$ cat f2
I bought two bananas and three mangoes
$ echo 'f1,f2,odd.txt' | awk -F, '{system("cat " $2)}'
I bought two bananas and three mangoes

printf formatting

  • Similar to printf function in C and shell built-in command
  • use sprintf function to save result in variable instead of printing
  • See also gawk manual - printf
$ awk '{sum += $1} END{print sum}' nums.txt
10062.9

$ # note that ORS is not appended and has to be added manually
$ awk '{sum += $1} END{printf "%.2f\n", sum}' nums.txt
10062.86

$ awk '{sum += $1} END{printf "%10.2f\n", sum}' nums.txt
  10062.86

$ awk '{sum += $1} END{printf "%010.2f\n", sum}' nums.txt
0010062.86

$ awk '{sum += $1} END{printf "%d\n", sum}' nums.txt
10062

$ awk '{sum += $1} END{printf "%+d\n", sum}' nums.txt
+10062

$ awk '{sum += $1} END{printf "%e\n", sum}' nums.txt
1.006286e+04
  • to refer argument by positional number (starts with 1), use <num>$
$ # can also use: awk 'BEGIN{printf "hex=%x\noct=%o\ndec=%d\n", 15, 15, 15}'
$ awk 'BEGIN{printf "hex=%1$x\noct=%1$o\ndec=%1$d\n", 15}'
hex=f
oct=17
dec=15

$ # adding prefix to hex/oct numbers
$ awk 'BEGIN{printf "hex=%1$#x\noct=%1$#o\ndec=%1$d\n", 15}'
hex=0xf
oct=017
dec=15
  • strings
$ # prefix remaining width with spaces
$ awk 'BEGIN{printf "%6s:%5s\n", "foo", "bar"}'
   foo:  bar

$ # suffix remaining width with spaces
$ awk 'BEGIN{printf "%-6s:%-5s\n", "foo", "bar"}'
foo   :bar  

$ # truncate
$ awk 'BEGIN{printf "%.2s\n", "foobar"}'
fo
  • avoid using printf without format specifier
$ awk 'BEGIN{s="solve: 5 % x = 1"; printf s}'
awk: cmd. line:1: fatal: not enough arguments to satisfy format string
    `solve: 5 % x = 1'
               ^ ran out for this one

$ awk 'BEGIN{s="solve: 5 % x = 1"; printf "%s\n", s}'
solve: 5 % x = 1

Redirecting print output

  • redirecting to file instead of stdout using >
  • similar to behavior in shell, if file already exists it is overwritten
    • use >> to append to an existing file without deleting content
  • however, unlike shell, subsequent redirections to same file will append to it
  • See also gawk manual - Closing Input and Output Redirections if you have too many redirections
$ seq 6 | awk 'NR%2{print > "odd.txt"; next} {print > "even.txt"}'
$ cat odd.txt
1
3
5
$ cat even.txt
2
4
6

$ awk 'NR==1{col1=$1".txt"; col2=$2".txt"; next}
       {print $1 > col1; print $2 > col2}' fruits.txt
$ cat fruit.txt
apple
banana
fig
guava
$ cat qty.txt
42
31
90
6
  • redirecting to shell command
  • this is useful if you have different things to redirect to different commands, otherwise it can be done as usual in shell acting on awk's output
  • all redirections to same command gets combined as single input to that command
$ # same as: echo 'foo good 123' | awk '{print $2}' | wc -c
$ echo 'foo good 123' | awk '{print $2 | "wc -c"}'
5
$ # to avoid newline character being added to print
$ echo 'foo good 123' | awk -v ORS= '{print $2 | "wc -c"}'
4
$ # assuming no format specifiers in input
$ echo 'foo good 123' | awk '{printf $2 | "wc -c"}'
4

$ # same as: echo 'foo good 123' | awk '{printf $2 $3 | "wc -c"}'
$ echo 'foo good 123' | awk '{printf $2 | "wc -c"; printf $3 | "wc -c"}'
7

Further Reading


Gotchas and Tips

$ # wrong
$ awk -v word="apple" '$1==$word' fruits.txt

$ # right
$ awk -v word="apple" '$1==word' fruits.txt
apple   42
$ # no issue with unix style line ending
$ printf 'foo bar\n123 789\n' | awk '{print $2, $1}'
bar foo
789 123

$ # dos style line ending causes trouble
$ printf 'foo bar\r\n123 789\r\n' | awk '{print $2, $1}'
 foo
 123

$ # easy to deal by simply setting appropriate RS
$ # note that ORS would still be newline character only
$ printf 'foo bar\r\n123 789\r\n' | awk -v RS='\r\n' '{print $2, $1}'
bar foo
789 123
  • relying on default initial value
$ # step 1 - works for single file
$ awk '{sum += $1} END{print sum}' nums.txt
10062.9

$ # step 2 - change to work for multiple file
$ awk '{sum += $1} ENDFILE{print FILENAME, sum}' nums.txt
nums.txt 10062.9

$ # step 3 - check with multiple file input
$ # oops, default numerical value '0' for sum works only once
$ awk '{sum += $1} ENDFILE{print FILENAME, sum}' nums.txt <(seq 3)
nums.txt 10062.9
/dev/fd/63 10068.9

$ # step 4 - correctly initialize variables
$ awk 'BEGINFILE{sum=0} {sum += $1} ENDFILE{print FILENAME, sum}' nums.txt <(seq 3)
nums.txt 10062.9
/dev/fd/63 6
  • use unary operator + to force numeric conversion
$ awk '{sum += $1} END{print FILENAME, sum}' nums.txt
nums.txt 10062.9

$ awk '{sum += $1} END{print FILENAME, sum}' /dev/null
/dev/null 

$ awk '{sum += $1} END{print FILENAME, +sum}' /dev/null
/dev/null 0
  • concatenate empty string to force string comparison
$ echo '5 5.0' | awk '{print $1==$2 ? "same" : "different", "string"}'
same string

$ echo '5 5.0' | awk '{print $1""==$2 ? "same" : "different", "string"}'
different string
  • beware of expressions going -ve for field calculations
$ cat misc.txt
foo
good bad ugly
123 xyz
a b c d

$ # trying to delete last two fields
$ awk '{NF -= 2} 1' misc.txt
awk: cmd. line:1: (FILENAME=misc.txt FNR=1) fatal: NF set to negative value
$ # dynamically change it depending on number of fields
$ awk '{NF = (NF<=2) ? 0 : NF-2} 1' misc.txt

good

a b

$ # similarly, trying to access 3rd field from end
$ awk '{print $(NF-2)}' misc.txt
awk: cmd. line:1: (FILENAME=misc.txt FNR=1) fatal: attempt to access field -1
$ awk 'NF>2{print $(NF-2)}' misc.txt
good
b
$ # all words containing exactly 3 lowercase a
$ time awk -F'a' 'NF==4{cnt++} END{print +cnt}' /usr/share/dict/words
1019

real    0m0.075s

$ time LC_ALL=C awk -F'a' 'NF==4{cnt++} END{print +cnt}' /usr/share/dict/words
1019

real    0m0.045s

Further Reading