- Ensure
benchmarks
directory is current root directory. If not, use from directory root,
cd `benchmarks`
- Main run script:
./simple_infra/infra_run.py
- Run
./simple_infra/infra_run.py -h
for flags and options. - Test and measure aggregator performance with,
- Explaination of files generated from run
First, download all inputs.
./run-all.sh --inputs # Download all input files.
Here are the configurations to the scripts.
--small
: use small input--inf
: input inflation between stages--all
: use lean and python aggregators--lean
: use lean aggregators (default is python aggregators)
For example, to run all benchmarks with python and lean aggregators with input inflation on smaller input,
./run-all.sh --small --all --inf
Cleanup all intermediate files.
./run-all.sh --clean
Below, we show how to run the oneliners
benchmark only. First cd
into the benchmark set and download input:
cd oneliners
./inputs.sh # Download input files.
Example configuration to run suite with.
./run.sh --small # Run with default python on 1M input without input inflation.
./cleanup.sh # Remove all intermediate files.
Other configurations. Ensure to save results and use clean up script before running new configuration.
./run.sh --small --all # Run with both lean and python aggregators on 1M input without input inflation.
./run.sh --small --all --inf # Run with both lean and python aggregators on 1M input with input inflation.
Check and prints out if there are incorrect aggregators.
./run.sh --check
Running from one directory ensures all intermediate files are organized. Here, we will create and run from the run
directory.
# Use python aggregator without input inflation.
mkdir run
cd run
../simple_infra/infra_run.py -n 2 -i ../oneliners/inputs/1M.txt -s ../oneliners/scripts/sort.sh -id 1 -agg python -o out.txt
# Use lean aggregator with input inflation.
mkdir run
cd run
../simple_infra/infra_run.py -n 2 -i ../oneliners/inputs/1M.txt -s ../oneliners/scripts/sort.sh -inflate -id 1 -agg lean -o out.txt
# Use specified aggregator without input inflation.
mkdir run
cd run
../simple_infra/infra_run.py -n 2 -i ../oneliners/inputs/1M.txt -s ../oneliners/scripts/sort.sh -id 1 -agg ../../py-2/s_sort.py -o out.txt
- infra_metrics.csv: CSV file with main metric results; Header is as follows: script,input,input size,adj input size,cmd,agg,agg time,agg correct,cmd seq time
- infra_debug.log: more detailed execution log
- inputs-s-[ID]: org: split files; cmd: files after applying current command instance (parallel partials)
- outputs-temp: agg-[ID] parallel output files per command instance; seq-check-[ID] sequential output files per command instance (to check aggregator correctness)
- <output.txt> : output file produced after running entire script with this infrastructure (provided as last argument to ../infra_run.py)
Pipeline for Running with Benchmark
- Given total bytes desired, max byte per line, min byte per line, generate file with random word
- Given total bytes, total lines, percentage of non-distinct line length, and percentage of non-distinct word, generate file with random words that guarantee and best adhere to given percentages (depending on randomization and total lines given, we might have more repeats than desired percentage)
- Given total bytes, regex or word, and percentage of bytes allocated to generate words that match pattern, generate file with random word
- Currently, change probability and size settings in the relative python file's main function.
- Example Run:
python3 generation_with_regex.py