Currently aggregators are WIP. The new ones are in cpp/bin
. They are automatically built during setup_pash.sh
and the unit tests in cpp/tests
are run during run_tests.sh
. The interface is like the following:
aggregator inputFile1 inputFile2 args
Where args
are the arguments that were passed to the command that produced the input files. The aggregator outputs to stdout
.
Let's assume that the aggregator being implemented is for a command called cmd
.
-
Create a folder named
cmd
insidecpp/aggregators
-
For each
OS
supported by PaSh:2.1 Create a file named
OS-agg.h
inside that folder2.2. Implement the aggregator inside that file using the instructions provided in
cpp/common/main.h
or use a different aggregator as an example. Remember about the include guard.2.3 You may create additional files in the aggregator directory. This can be used to share code between aggregator implementations for different
OS
es. When#include
ing, assume that the aggregator directory is in the include path. -
Add unit tests for the created aggregator in
cpp/tests/test-OS.sh
for eachOS
. Consult the instructions in that file. Remember to test all options and flags of the aggregator.
Note: after completing these steps the aggregator will automatically be built by the Makefile
with no changes to it required.
- Command-specific aggregators for POSIX commands
/agg-synthesis
:- Aggregators:
grep
,wc
,sort
,uniq
/tail_head
:tail
,head
(under development)/grep-n
: under development -- not used by any current benchmark scripts
- Util Functions: read, write, settings (locale and padding length)
/Benchmarks
: test correctness and identify implemented/not implemented aggregators- covid-mts
- nlp
- oneliners
- unix50
- Aggregators:
/agg-mult-input
:- draft for agg combinging results when a single cmd takes in multiple inputs
- Development Journal
- Aggregates parallel results when commands are applied to single file input (i.e.
wc hi.txt
) - How to run:
./s_wc.py -c [parallel output result 1] [parallel output result 2] ...
Script | Additional info. needed | Description | Notes |
---|---|---|---|
./s_wc.py |
No | -l, -c, -w, -m |
|
./s_grep.py |
No | grep results -c , flags that don't change concat nature (-i , -e ...) |
|
./s_uniq.py |
No | uniq , merge same lines at end of files/beginning of files -c |
|
./s_sort.py |
No | sort results -n , -k , -r , -u , -f |
|
./s_head.py |
No | head results by always returning former split document when given multiple split documents |
Under development |
./s_tail.py |
No | tail results by always returning later split document when given multiple split documents |
Under development |
./inputs.sh
: retrieve all required inputs./run.sh
: run benchmark scripts withbash
andagg
./verify.sh --generate
: generate hashes for all outputs to verify correctness./cleanup.sh
: removes all output + intermediate files generated by current run
Directory | Description | Notes |
---|---|---|
unix50 | Collection of oneline scripts to run on input txt files |
use --reg flag for current available inputs retrieved by input script |
oneliners | Collection of oneline scripts to run on input txt files |
some scripts involving mkfifo cannot be tested currently due to current parsing simplicity |
covid-mts | Script to process data on covid mass-transports | |
nlp | Collection of oneline scripts to run on input txt files |
./agg_run.sh [script] [input]
: applies availableagg
on individual commands parsed out with|
as delimiter- parse script into
CMDLIST
, running below with each cmd:- if current cmd has implemented
agg
, split file intoSIZE=2
and apply./test-par-driver.sh
to run each split file with cmd and applyagg
Parallel: cat file-0 | $CMD > file-0-par cat file-1 | $CMD > file-1-par agg file-0-par file-1-par > file-par.txt
- if current cmd doesn't have implemented
agg
, run script through this command sequentially with./test-seq-driver.sh
Sequential: cat file | $CMD > file-seq.txt
- output becomes the new input to next iteration (next command in in
CMDLIST
)
- if current cmd has implemented
- records script + input ran and whether each cmd has a
agg
tolog.txt
- parse script into
./find-missing.sh [log.txt]
: output cmd that doesn't have aagg
implemented;log.txt
is produced with each run of an entire benchmark suiterun-all.sh
: Run all current benchmark suites through one script (check script for flags)
- Linux Distributions: Ubuntu, Debian
- BSD Utils: MacOS
- Commands when ran on single file input vs. multiple file input often produce different results as file name often gets appended to the result
- Multiple inputs to a command looks like:
wc hi.txt bye.txt
and would produce outputs that looks like
559 4281 25733 inputs/hi.txt
354 2387 14041 inputs/bye.txt
913 6668 39774 total
- directly takes input argument from system argument; for example, enter in your terminal
python m_wc.py [parallel output file 1] [parallel output file 2]
File To Run | Additional info. needed | Description | Notes |
---|---|---|---|
m_wc.py |
N/A | -l, -c, -w, -m |
Discripancy with combining byte size (might be due to manually splitting file to create parallel input in testing) |
m_grep.py |
after parallel output args: full [path to original file 1] [path to original file 2] <more if needed> |
grep results, sort output based on source file |
|
m_grep_c.py |
N/A | grep -c , apprend prefix source file name, includes total count |
|
m_grep_n.py |
Yes | grep -n , makes line correction accordingly to file |
Needs to be refactored still |
Note: all multiple argument combiners requires a [file_list] argument that is a list of all the full files utilized in the call
- testing scripts produce all relevant files directed to
/outputs
when given files in/inputs
to produce sequential / parallel results on - Run
./test-mult.sh
intest-old
directory:- manually split files (multiple) into 2 -- put in
/input
- apply command to entire file for sequential output (expected)
- apply command to file-1 > output/output-1 apply command to file-2 > output/output-2
- apply aggregators to combine output/output-1 output/output-2 for parallel outpus (requires path of the full files for functions such as line correction in
grep -n
) - eye check that parallel outputs = sequential output
NOTE: use
m_combine
from the [cmd].py file as aggregators
- manually split files (multiple) into 2 -- put in