verb /ˈrɛdʒɪmɛnt/ organize according to a strict system or pattern.
A command line tool for sorting standardized separated files (in the sense of RFC4180).
sort
a standardized separated file.
# specify:
# column to sort on (mandatory)
# number-of-columns in file (mandatory)
# field-separator (mandatory)
#
# default output is to stdout
# default newline is LF
# default format of input-file is delimited
# default upper bound for memory is (1024 * 1024) bytes
regiment sort --key 5 --number-columns 15 --field-separator ',' input-file
regiment sort -k 5 -c 15 -f ',' input-file
# specify multiple columns to sort on
regiment sort --key 1 --key 5 --number-columns 15 --field-separator ',' input-file
regiment sort -k 1 -k 5 -c 15 -f ',' input-file
# explicitly specify that format of input-file is standardized
regiment sort --key 1 --number-columns 15 --standardized --field-separator ',' input-file
regiment sort -k 1 -c 15 -f ',' --standardized input-file
# explicitly specify memory upper bound of 2 GB (default unit is MB - use G to specify GB)
regiment sort --mem-max 2G --key 1 --number-columns 15 --field-separator ',' input-file
regiment sort -m 2G -k 1 -c 15 -f ',' input-file
regiment sort -m 2000 -k 1 -c 15 -f ',' input-file
# explicitly specify newline -- one of LF, CR or CRLF (defaults to LF)
regiment sort --key 5 --number-columns 15 --crlf --field-separator ',' input-file
regiment sort -k 5 -c 15 -f ',' --crlf input-file
# explicitly specify path to output file -- defaults to stdout
regiment sort --key 5 --number-columns 15 --field-separator ',' --output "path/to/output-file" input-file
regiment sort -k 5 -c 15 -f ',' -o "path/to/output-file" input-file
# all the things
regiment sort -f ',' -k 1 -k 4 -k 5 -c 26 -m 10G --crlf --standardized -o "path/to/output-file" input-file
split
a standardized separated file into a set of temporary files, each of which is sorted,
and is in regiment's binary format
# specify:
# same options as for sort (except for --output)
# a directory within which to write the sorted splits (mandatory)
# NOTE: this directory must not exist, it will be created for you
regiment split <same opts as sort> --dir "path/to/output-dir" input-file
regiment split <same opts as sort> -d "path/to/output-dir" input-file
Given the format of an input standardized separated file, merge a set of sorted temporary files (in regiment's binary format) into an output-file (that has the same format as the input standardized separated file).
# specify:
# directories containing sorted splits that require merging (typically outputs of running split)
# output file (optional) -- defaults to stdout
regiment merge-tmps dir1 dir2 ... dirn
# explicity specify path to output file -- defaults to stdout
regiment merge-tmps --output "path/to/output-file" dir1 dir2 ... dirn
regiment merge-tmps -o "path/to/output-file" dir1 dir2 ... dirn
Relationship between sort
, split
and merge-tmps
regiment sort -k 1 -c 5 -f ',' --standardized input-file
generates the same output as
regiment split -k 1 -c 5 -f ',' --standardized -d "/foo/bar/baz" input-file
regiment merge-tmps "/foo/bar/baz"
Note: regiment
requires local storage roughly equivalent to the size of the inputs,
and follows unix TMPDIR
conventions for that storage.