Skip to content
This repository has been archived by the owner on Jun 5, 2024. It is now read-only.

Latest commit

 

History

History
1453 lines (1184 loc) · 26.6 KB

sorting_stuff.md

File metadata and controls

1453 lines (1184 loc) · 26.6 KB

Sorting stuff

Table of Contents


sort

$ sort --version | head -n1
sort (GNU coreutils) 8.25

$ man sort
SORT(1)                          User Commands                         SORT(1)

NAME
       sort - sort lines of text files

SYNOPSIS
       sort [OPTION]... [FILE]...
       sort [OPTION]... --files0-from=F

DESCRIPTION
       Write sorted concatenation of all FILE(s) to standard output.

       With no FILE, or when FILE is -, read standard input.
...

Note: All examples shown here assumes ASCII encoded input file


Default sort

$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

$ sort poem.txt
And so are you.
Roses are red,
Sugar is sweet,
Violets are blue,
  • Well, that was easy. The lines were sorted alphabetically (ascending order by default) and it so happened that first letter alone was enough to decide the order
  • For next example, let's extract all the words and sort them
    • also allows to showcase sort accepting stdin
    • See GNU grep chapter if the grep command used below looks alien
$ # output might differ depending on locale settings
$ # note the case-insensitiveness of output
$ grep -oi '[a-z]*' poem.txt | sort
And
are
are
are
blue
is
red
Roses
so
Sugar
sweet
Violets
you
$ info sort | tail

   (1) If you use a non-POSIX locale (e.g., by setting ‘LC_ALL’ to
‘en_US’), then ‘sort’ may produce output that is sorted differently than
you’re accustomed to.  In that case, set the ‘LC_ALL’ environment
variable to ‘C’.  Note that setting only ‘LC_COLLATE’ has two problems.
First, it is ineffective if ‘LC_ALL’ is also set.  Second, it has
undefined behavior if ‘LC_CTYPE’ (or ‘LANG’, if ‘LC_CTYPE’ is unset) is
set to an incompatible value.  For example, you get undefined behavior
if ‘LC_CTYPE’ is ‘ja_JP.PCK’ but ‘LC_COLLATE’ is ‘en_US.UTF-8’.
  • Example to help show effect of locale setting
$ # note how uppercase is sorted before lowercase
$ grep -oi '[a-z]*' poem.txt | LC_ALL=C sort
And
Roses
Sugar
Violets
are
are
are
blue
is
red
so
sweet
you

Reverse sort

  • This is simply reversing from default ascending order to descending order
$ sort -r poem.txt
Violets are blue,
Sugar is sweet,
Roses are red,
And so are you.

Various number sorting

$ cat numbers.txt
20
53
3
101

$ sort numbers.txt
101
20
3
53
  • Whoops, what happened there? sort won't know to treat them as numbers unless specified
  • Depending on format of numbers, different options have to be used
  • First up is -n option, which sorts based on numerical value
$ sort -n numbers.txt
3
20
53
101

$ sort -nr numbers.txt
101
53
20
3
  • The -n option can handle negative numbers
  • As well as thousands separator and decimal point (depends on locale)
  • The <() syntax is Process Substitution
    • to put it simply - allows output of command to be passed as input file to another command without needing to manually create a temporary file
$ # multiple files are merged as single input by default
$ sort -n numbers.txt <(echo '-4')
-4
3
20
53
101

$ sort -n numbers.txt <(echo '1,234')
3
20
53
101
1,234

$ sort -n numbers.txt <(echo '31.24')
3
20
31.24
53
101
$ cat generic_numbers.txt
+120
-1.53
3.14e+4
42.1e-2

$ sort -g generic_numbers.txt
-1.53
42.1e-2
+120
3.14e+4
  • Commands like du have options to display numbers in human readable formats
  • sort supports sorting such numbers using the -h option
$ du -sh *
104K    power.log
746M    projects
316K    report.log
20K     sample.txt
$ du -sh * | sort -h
20K     sample.txt
104K    power.log
316K    report.log
746M    projects

$ # --si uses powers of 1000 instead of 1024
$ du -s --si *
107k    power.log
782M    projects
324k    report.log
21k     sample.txt
$ du -s --si * | sort -h
21k     sample.txt
107k    power.log
324k    report.log
782M    projects
  • Version sort - dealing with numbers mixed with other characters
  • If this sorting is needed simply while displaying directory contents, use ls -v instead of piping to sort -V
$ cat versions.txt
foo_v1.2
bar_v2.1.3
foobar_v2
foo_v1.2.1
foo_v1.3

$ sort -V versions.txt
bar_v2.1.3
foobar_v2
foo_v1.2
foo_v1.2.1
foo_v1.3
  • Another common use case is when there are multiple filenames differentiated by numbers
$ cat files.txt
file0
file10
file3
file4

$ sort -V files.txt
file0
file3
file4
file10
  • Can be used when dealing with numbers reported by time command as well
$ # different solving durations
$ cat rubik_time.txt
5m35.363s
3m20.058s
4m5.099s
4m1.130s
3m42.833s
4m33.083s

$ # assuming consistent min/sec format
$ sort -V rubik_time.txt
3m20.058s
3m42.833s
4m1.130s
4m5.099s
4m33.083s
5m35.363s

Random sort

$ cat nums.txt
1
10
10
12
23
563

$ # the two 10s will always be next to each other
$ sort -R nums.txt
563
12
1
10
10
23

$ # duplicates can end up anywhere
$ shuf nums.txt
10
23
1
10
563
12

Specifying output file

  • The -o option can be used to specify output file
  • Useful for in place editing
$ sort -R nums.txt -o rand_nums.txt
$ cat rand_nums.txt
23
1
10
10
563
12

$ sort -R nums.txt -o nums.txt
$ cat nums.txt
563
23
10
10
1
12
  • Use shell script looping if there multiple files to be sorted in place
  • Below snippet is for bash shell
$ for f in *.txt; do echo sort -V "$f" -o "$f"; done
sort -V files.txt -o files.txt
sort -V rubik_time.txt -o rubik_time.txt
sort -V versions.txt -o versions.txt

$ # remove echo once commands look fine
$ for f in *.txt; do sort -V "$f" -o "$f"; done

Unique sort

  • Keep only first copy of lines that are deemed to be same according to sort option used
$ cat duplicates.txt
foo
12 carrots
foo
12 apples
5 guavas

$ # only one copy of foo in output
$ sort -u duplicates.txt
12 apples
12 carrots
5 guavas
foo
  • According to option used, definition of duplicate will vary
  • For example, when -n is used, matching numbers are deemed same even if rest of line differs
    • Pipe the output to uniq if this is not desirable
$ # note how first copy of line starting with 12 is retained
$ sort -nu duplicates.txt
foo
5 guavas
12 carrots

$ # use uniq when entire line should be compared to find duplicates
$ sort -n duplicates.txt | uniq
foo
5 guavas
12 apples
12 carrots
  • Use -f option to ignore case of alphabets while determining duplicates
$ cat words.txt
CAR
are
car
Are
foot
are

$ # only the two 'are' were considered duplicates
$ sort -u words.txt
are
Are
car
CAR
foot

$ # note again that first copy of duplicate is retained
$ sort -fu words.txt
are
CAR
foot

Column based sorting

From info sort

‘-k POS1[,POS2]’
‘--key=POS1[,POS2]’
     Specify a sort field that consists of the part of the line between
     POS1 and POS2 (or the end of the line, if POS2 is omitted),
     _inclusive_.

     Each POS has the form ‘F[.C][OPTS]’, where F is the number of the
     field to use, and C is the number of the first character from the
     beginning of the field.  Fields and character positions are
     numbered starting with 1; a character position of zero in POS2
     indicates the field’s last character.  If ‘.C’ is omitted from
     POS1, it defaults to 1 (the beginning of the field); if omitted
     from POS2, it defaults to 0 (the end of the field).  OPTS are
     ordering options, allowing individual keys to be sorted according
     to different rules; see below for details.  Keys can span multiple
     fields.
  • By default, blank characters (space and tab) serve as field separators
$ cat fruits.txt
apple   42
guava   6
fig     90
banana  31

$ sort fruits.txt
apple   42
banana  31
fig     90
guava   6

$ # sort based on 2nd column numbers
$ sort -k2,2n fruits.txt
guava   6
banana  31
apple   42
fig     90
  • Using a different field separator
  • Consider the following sample input file having fields separated by :
$ # name:pet_name:no_of_pets
$ cat pets.txt
foo:dog:2
xyz:cat:1
baz:parrot:5
abcd:cat:3
joe:dog:1
bar:fox:1
temp_var:squirrel:4
boss:dog:10
  • Sorting based on particular column or column to end of line
  • In case of multiple entries, by default sort would use content of remaining parts of line to resolve
$ # only 2nd column
$ # -k2,4 would mean 2nd column to 4th column
$ sort -t: -k2,2 pets.txt
abcd:cat:3
xyz:cat:1
boss:dog:10
foo:dog:2
joe:dog:1
bar:fox:1
baz:parrot:5
temp_var:squirrel:4

$ # from 2nd column to end of line
$ sort -t: -k2 pets.txt
xyz:cat:1
abcd:cat:3
joe:dog:1
boss:dog:10
foo:dog:2
bar:fox:1
baz:parrot:5
temp_var:squirrel:4
  • Multiple keys can be specified to resolve ties
  • Note that if there are still multiple entries with specified keys, remaining parts of lines would be used
$ # default sort for 2nd column, numeric sort on 3rd column to resolve ties
$ sort -t: -k2,2 -k3,3n pets.txt
xyz:cat:1
abcd:cat:3
joe:dog:1
foo:dog:2
boss:dog:10
bar:fox:1
baz:parrot:5
temp_var:squirrel:4

$ # numeric sort on 3rd column, default sort for 2nd column to resolve ties
$ sort -t: -k3,3n -k2,2 pets.txt
xyz:cat:1
joe:dog:1
bar:fox:1
foo:dog:2
abcd:cat:3
temp_var:squirrel:4
baz:parrot:5
boss:dog:10
  • Use -s option to retain original order of lines in case of tie
$ sort -s -t: -k2,2 pets.txt
xyz:cat:1
abcd:cat:3
foo:dog:2
joe:dog:1
boss:dog:10
bar:fox:1
baz:parrot:5
temp_var:squirrel:4
  • The -u option, as seen earlier, will retain only first match
$ sort -u -t: -k2,2 pets.txt
xyz:cat:1
foo:dog:2
bar:fox:1
baz:parrot:5
temp_var:squirrel:4

$ sort -u -t: -k3,3n pets.txt
xyz:cat:1
foo:dog:2
abcd:cat:3
temp_var:squirrel:4
baz:parrot:5
boss:dog:10
$ # sort by number in 3rd column
$ sort -t: -k3,3n pets.txt
bar:fox:1
joe:dog:1
xyz:cat:1
foo:dog:2
abcd:cat:3
temp_var:squirrel:4
baz:parrot:5
boss:dog:10

$ # then get unique entry based on 2nd column
$ sort -t: -k3,3n pets.txt | sort -t: -u -k2,2
xyz:cat:1
joe:dog:1
bar:fox:1
baz:parrot:5
temp_var:squirrel:4
  • Specifying particular characters within fields
  • If character position is not specified, defaults to 1 for starting column and 0 (last character) for ending column
$ cat marks.txt
fork,ap_12,54
flat,up_342,1.2
fold,tn_48,211
more,ap_93,7
rest,up_5,63

$ # for 2nd column, sort numerically only from 4th character to end
$ sort -t, -k2.4,2n marks.txt
rest,up_5,63
fork,ap_12,54
fold,tn_48,211
more,ap_93,7
flat,up_342,1.2

$ # sort uniquely based on first two characters of line
$ sort -u -k1.1,1.2 marks.txt
flat,up_342,1.2
fork,ap_12,54
more,ap_93,7
rest,up_5,63
  • If there are headers
$ cat header.txt
fruit   qty
apple   42
guava   6
fig     90
banana  31

$ # separate and combine header and content to be sorted
$ cat <(head -n1 header.txt) <(tail -n +2 header.txt | sort -k2nr)
fruit   qty
fig     90
apple   42
banana  31
guava   6

Further reading for sort


uniq

$ uniq --version | head -n1
uniq (GNU coreutils) 8.25

$ man uniq
UNIQ(1)                          User Commands                         UNIQ(1)

NAME
       uniq - report or omit repeated lines

SYNOPSIS
       uniq [OPTION]... [INPUT [OUTPUT]]

DESCRIPTION
       Filter  adjacent matching lines from INPUT (or standard input), writing
       to OUTPUT (or standard output).

       With no options, matching lines are merged to the first occurrence.
...

Default uniq

$ cat word_list.txt
are
are
to
good
bad
bad
bad
good
are
bad

$ # adjacent duplicate lines are removed, leaving one copy
$ uniq word_list.txt
are
to
good
bad
good
are
bad

$ # To remove duplicates from entire file, input has to be sorted first
$ # also showcases that uniq accepts stdin as input
$ sort word_list.txt | uniq
are
bad
good
to

Only duplicates

$ # duplicates adjacent to each other
$ uniq -d word_list.txt
are
bad

$ # duplicates in entire file
$ sort word_list.txt | uniq -d
are
bad
good
  • To get only duplicates as well as show all duplicates
$ uniq -D word_list.txt
are
are
bad
bad
bad

$ sort word_list.txt | uniq -D
are
are
are
bad
bad
bad
bad
good
good
  • To distinguish the different groups
$ # using --all-repeated=prepend will add a newline before the first group as well
$ sort word_list.txt | uniq --all-repeated=separate
are
are
are

bad
bad
bad
bad

good
good

Only unique

$ # lines with no adjacent duplicates
$ uniq -u word_list.txt
to
good
good
are
bad

$ # unique lines in entire file
$ sort word_list.txt | uniq -u
to

Prefix count

$ # adjacent lines
$ uniq -c word_list.txt
      2 are
      1 to
      1 good
      3 bad
      1 good
      1 are
      1 bad

$ # entire file
$ sort word_list.txt | uniq -c
      3 are
      4 bad
      2 good
      1 to

$ # entire file, only duplicates
$ sort word_list.txt | uniq -cd
      3 are
      4 bad
      2 good
  • Sorting by count
$ # sort by count
$ sort word_list.txt | uniq -c | sort -n
      1 to
      2 good
      3 are
      4 bad

$ # reverse the order, highest count first
$ sort word_list.txt | uniq -c | sort -nr
      4 bad
      3 are
      2 good
      1 to
  • To get only entries with min/max count, bit of awk magic would help
$ # consider this result
$ sort colors.txt | uniq -c | sort -nr
      3 Red
      3 Blue
      2 Yellow
      1 Green
      1 Black

$ # to get all max count
$ # save 1st line 1st column value to c and then print if 1st column equals c
$ sort colors.txt | uniq -c | sort -nr | awk 'NR==1{c=$1} $1==c'
      3 Red
      3 Blue
$ # to get all min count
$ sort colors.txt | uniq -c | sort -n | awk 'NR==1{c=$1} $1==c'
      1 Black
      1 Green
  • Get rough count of most used commands from history file
$ # awk '{print $1}' will get the 1st column alone
$ awk '{print $1}' "$HISTFILE" | sort | uniq -c | sort -nr | head
   1465 echo
   1180 grep
    552 cd
    531 awk
    451 sed
    423 vi
    418 cat
    392 perl
    325 printf
    320 sort

$ # extract command name from start of line or preceded by 'spaces|spaces'
$ # won't catch commands in other places like command substitution though
$ grep -oP '(^| +\| +)\K[^ ]+' "$HISTFILE" | sort | uniq -c | sort -nr | head
   2006 grep
   1469 echo
    933 sed
    698 awk
    552 cd
    513 perl
    510 cat
    453 sort
    423 vi
    327 printf

Ignoring case

$ cat another_list.txt
food
Food
good
are
bad
Are

$ # note how first copy is retained
$ uniq -i another_list.txt
food
good
are
bad
Are

$ uniq -iD another_list.txt
food
Food

Combining multiple files

$ sort -f word_list.txt another_list.txt | uniq -i
are
bad
food
good
to

$ sort -f word_list.txt another_list.txt | uniq -c
      4 are
      1 Are
      5 bad
      1 food
      1 Food
      3 good
      1 to

$ sort -f word_list.txt another_list.txt | uniq -ic
      5 are
      5 bad
      2 food
      3 good
      1 to
  • If only adjacent lines (not sorted) is required, need to concatenate files using another command
$ uniq -id word_list.txt
are
bad

$ uniq -id another_list.txt
food

$ cat word_list.txt another_list.txt | uniq -id
are
bad
food

Column options

  • uniq has few options dealing with column manipulations. Not extensive as sort -k but handy for some cases
  • First up, skipping fields
    • No option to specify different delimiter
    • From info uniq: Fields are sequences of non-space non-tab characters that are separated from each other by at least one space or tab
    • Number of spaces/tabs between fields should be same
$ cat shopping.txt
lemon 5
mango 5
banana 8
bread 1
orange 5

$ # skips first field
$ uniq -f1 shopping.txt
lemon 5
banana 8
bread 1
orange 5

$ # use -f3 to skip first three fields and so on
  • Skipping characters
$ cat text
glue
blue
black
stack
stuck

$ # don't consider first 2 characters
$ uniq -s2 text
glue
black
stuck

$ # to visualize the above example
$ # assume there are two fields and uniq is applied on 2nd column
$ sed 's/^../& /' text
gl ue
bl ue
bl ack
st ack
st uck
  • Upto specified characters
$ # consider only first 2 characters
$ uniq -w2 text
glue
blue
stack

$ # to visualize the above example
$ # assume there are two fields and uniq is applied on 1st column
$ sed 's/^../& /' text
gl ue
bl ue
bl ack
st ack
st uck
  • Combining -s and -w
  • Can be combined with -f as well
$ # skip first 3 characters and then use next 2 characters
$ uniq -s3 -w2 text
glue
black

Further reading for uniq


comm

$ comm --version | head -n1
comm (GNU coreutils) 8.25

$ man comm
COMM(1)                          User Commands                         COMM(1)

NAME
       comm - compare two sorted files line by line

SYNOPSIS
       comm [OPTION]... FILE1 FILE2

DESCRIPTION
       Compare sorted files FILE1 and FILE2 line by line.

       When FILE1 or FILE2 (not both) is -, read standard input.

       With  no  options,  produce  three-column  output.  Column one contains
       lines unique to FILE1, column two contains lines unique to  FILE2,  and
       column three contains lines common to both files.
...

Default three column output

Consider below sample input files

$ # sorted input files viewed side by side
$ paste colors_1.txt colors_2.txt
Blue    Black
Brown   Blue
Purple  Green
Red     Red
Teal    White
Yellow
  • Without any option, comm gives 3 column output
    • lines unique to first file
    • lines unique to second file
    • lines common to both files
$ comm colors_1.txt colors_2.txt
        Black
                Blue
Brown
        Green
Purple
                Red
Teal
        White
Yellow

Suppressing columns

  • -1 suppress lines unique to first file
  • -2 suppress lines unique to second file
  • -3 suppress lines common to both files
$ # suppressing column 3
$ comm -3 colors_1.txt colors_2.txt
        Black
Brown
        Green
Purple
Teal
        White
Yellow
  • Combining options gives three distinct and useful constructs
  • First, getting only common lines to both files
$ comm -12 colors_1.txt colors_2.txt
Blue
Red
  • Second, lines unique to first file
$ comm -23 colors_1.txt colors_2.txt
Brown
Purple
Teal
Yellow
  • And the third, lines unique to second file
$ comm -13 colors_1.txt colors_2.txt
Black
Green
White
  • See also how the above three cases can be done using grep alone
    • Note input files do not need to be sorted for grep solution

If different sort order than default is required, use --nocheck-order to ignore error message

$ comm -23 <(sort -n numbers.txt) <(sort -n nums.txt)
3
comm: file 1 is not in sorted order
20
53
101

$ comm --nocheck-order -23 <(sort -n numbers.txt) <(sort -n nums.txt)
3
20
53
101

Files with duplicates

  • As many duplicate lines match in both files, they'll be considered as common
  • Rest will be unique to respective files
  • This is useful for cases like finding lines present in first but not in second taking in to consideration count of duplicates as well
    • This solution won't be possible with grep
$ paste list1 list2
a       a
a       b
a       c
b       c
b       d
c

$ comm list1 list2
                a
a
a
                b
b
                c
        c
        d

$ comm -23 list1 list2
a
a
b

Further reading for comm


shuf

$ shuf --version | head -n1
shuf (GNU coreutils) 8.25

$ man shuf
SHUF(1)                          User Commands                         SHUF(1)

NAME
       shuf - generate random permutations

SYNOPSIS
       shuf [OPTION]... [FILE]
       shuf -e [OPTION]... [ARG]...
       shuf -i LO-HI [OPTION]...

DESCRIPTION
       Write a random permutation of the input lines to standard output.

       With no FILE, or when FILE is -, read standard input.
...

Random lines

  • Without repeating input lines
$ cat nums.txt
1
10
10
12
23
563

$ # duplicates can end up anywhere
$ # all lines are part of output
$ shuf nums.txt
10
23
1
10
563
12

$ # limit max number of output lines
$ shuf -n2 nums.txt
563
23
  • Use -o option to specify output file name instead of displaying on stdout
  • Helpful for inplace editing
$ shuf nums.txt -o nums.txt
$ cat nums.txt
10
12
23
10
563
1
  • With repeated input lines
$ # -n3 for max 3 lines, -r allows input lines to be repeated
$ shuf -n3 -r nums.txt
1
1
563

$ seq 3 | shuf -n5 -r
2
1
2
1
2

$ # if a limit using -n is not specified, shuf will output lines indefinitely
  • use -e option to specify multiple input lines from command line itself
$ shuf -e red blue green
green
blue
red

$ shuf -e 'hi there' 'hello world' foo bar
bar
hi there
foo
hello world

$ shuf -n2 -e 'hi there' 'hello world' foo bar
foo
hi there

$ shuf -r -n4 -e foo bar
foo
foo
bar
foo

Random integer numbers

  • The -i option accepts integer range as input to be shuffled
$ shuf -i 3-8
3
7
6
4
8
5
  • Combine with other options as needed
$ shuf -n3 -i 3-8
5
4
7

$ shuf -r -n4 -i 3-8
5
5
7
8

$ shuf -r -n5 -i 0-1
1
0
0
1
1
  • Use seq input if negative numbers, floating point, etc are needed
$ seq 2 -1 -2 | shuf
2
-1
-2
0
1

$ seq 0.3 0.1 0.7 | shuf -n3
0.4
0.5
0.7

Further reading for shuf