-
Notifications
You must be signed in to change notification settings - Fork 0
Home
This repository contains the scripts used to benchmark ExeTera, Pandas, Dask and PostgreSQL relative to each other. It contains scripts for generating artificial data, and running benchmarks on both artificial and real data.
These scripts are used to import data from csv to hdf5. The data used in official benchmarks is the Covid Symptom Study, a copy of which (with the exception of a few fields containing identifiable information) can be obtained by through the Health Data Gateway by searching for 'Covid Symptom Study'.
Importing is benchmarked on the patients
, assessments
, and tests
tables.
ExeTera import is performed using ExeTera's import
command, details of which can be found on the ExeTera wiki.
Pandas import can be carried out using the import_patients_pandas.py
script. Despite its name, it is used to import the three tables described above, as follows:
python import_patients_pandas.py <csv_file_name> <hdf5_file_name>
Dask import is carried out using the import_patients_dask.py
script. It is used as follows:
python import_patients_dask.py <csv_file_name> <hdf5_file_name>
The script generate_sql_import_scripts.py
is used to generate sql files that can be run via psql
. This script is run as follows:
python generate_sql_import_scripts.py <schema_filename> <table_name> <import_filename>
The reading scripts are used to measure the performance of reading specific columns from the imported Covid Symptom Study datasets
Pandas reading is measured using read_patients_exetera.py
. It is called as follows:
python read_patients_exetera.py <imported hdf5 file> <column_count>
Pandas reading is measured using read_patients_pandas.py
. It is called as follows:
python read_patients_pandas.py <imported hdf5 file> <column_count>
Note, no dask read function was written in the end as Dask was unable to import either patients
or assessments
.
PostgreSQL read is performed by carrying out a standard SELECT
statement with \timing on
set
ExeTera joins are carried out by the following scripts:
-
exetera_join_p_a.py
: left join of assessments on the left and patients on the right -
exetera_join_a_p.py
: left join of patients on the left and assessments on the right -
exetera_join_p_t.py
: left_join of tests on the left and patients on the right -
exetera_join_t_p.py
: left join of patients on the left and tests on the right
These scripts don't take parameters and must be called separately.
Pandas joins are carried out by the following scripts:
-
execute_hdf_pandas_p_to_a_join_scenario.py
: left join of assessments on the left and patients on the right -
execute_hdf_pandas_a_to_p_join_scenario.py
: left join of patients on the left and assessments on the right -
execute_hdf_pandas_p_to_t_join_scenario.py
: left join of tests on the left and patients on the right -
execute_hdf_pandas_t_to_p_join_scenario.py
: left join of patients on the left and tests on the right
Note, the way these scripts are currently organised, they are
These scripts take two arguments as parameters and are called as follows:
# substitute in the appropriate script name
python execute_hdf_pandas_*_to_*_join_scenario.py <left hdf5 file> <right hdf5 file>
There are no Dask join scripts as Dask was unable to import the data to hdf5
PostgreSQL joins are performed in psql
with \timing on
create table p_to_a as select patients_filt.year_of_birth, patients_filt.height_cm, patients_filt.weight_kg, assessments.patient_id from assessments left join patients_filt on patients_filt.id = assessments.patient_id;
create table a_to_p as select assessments.health_status, assessments.tested_covid_positive, assessments.temperature, patients_filt.id from patients_filt left join assessments on assessments.patient_id = patients_filt.id;
create table p_to_t as select patients_filt.year_of_birth, patients_filt.height_cm, patients_filt.weight_kg, tests.patient_id from tests left join patients_filt on patients_filt.id = tests.patient_id;
create table t_to_p as select tests.country_code, tests.date_taken_specific, tests.result, patients_filt.id from patients_filt left join tests on tests.patient_id = patients_filt.id;
Artificial joins test the performance and scale performing the same join with different numbers of rows. For these scenarios, the left table contains a primary key that is a foreign key in the right table. As the tests are run, the right table is always given 10x the number of rows as the left table, and there are a mean of 10 instances of each primary key in the foreign key table.
Tables are created as follows:
python create_exetera_join_scenario <left_row_count> <right_row_count>
This creates a file of the form ds_x_y.hdf5
where x and y are the left and right row counts.
Tables are joined as follows:
python execute_exetera_join_scenario.py <file_name>
Tables are created as follows:
python create_pandas_join_scenario <left_row_count> <right_row_count>
This creates two tables 'l_df_x.hdf5` and 'r_df_y.hdf5' where x and y are the left and right row counts.
Tables are joined as follows:
python execute_pandas_join_scenario.py <left_file_name> <right_file_name>
Tables are created as follows:
python create_dask_join_scenario <left_row_count> <right_row_count> <partition_size>
Partition size is in rows. The row count is divided by the partition size to determine how many partitions a table has.
This creates two tables 'd_l_df_x.hdf5` and 'd_r_df_y.hdf5' where x and y are the left and right row counts.
Tables are joined as follows:
python execute_dask_join_scenario.py <left_file_name> <right_file_name>
Tables are created as follows:
python create_sql_join_scenario <left_row_count> <right_row_count>
This uses psycopg2
to create and populate the tables
Tables are joined through psql
with \timing on
as follows:
create table l_to_r as select l.pk, l.l0, l.l1, r.fk, r.r0, r.r1 from <y> as r left join <x> as l on l.pk = r.fk;
Where y and x are the right and left table names, respectively