Note: this repository is undergoing active development. Check back for updates.
We recommend using our Docker image to run this tool.
docker pull us-docker.pkg.dev/general-theiagen/theiagen/theiavalidate:0.1.0
usage: python3 theiavalidate.py table1 table2 [options]
This tool compares two tab-delimited files and outputs a report of the differences between the two files.
positional arguments:
table1 the first table to compare
table2 the second table to compare
optional arguments:
-h, --help
show this help message and exit
-v, --version
show program's version number and exit
-c, --columns_to_compare
a comma-separated list of columns to compare
required for a successful run
-m, --validation_criteria
a tab-delimited file containing the validation criteria to check
-l, --column_translation
a tab-delimited file that links column names between the two tables
-o, --output_prefix
the output file name prefix
do not include any spaces
-n, --na_values
the values that should be considered NA
default values = ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', '', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', 'None']
--verbose
increase stdout verbosity
--debug
increase stdout verbosity to debug; overwrites --verbose
See also the examples
folder for example inputs.
These are the two TSV files that will be examined. The order of the tables does not matter.
CAUTION: each table requires exactly the same number of samples and matching sample names (or values in the first column). If the tables do not have the same samples, the script will fail. There can be no additional samples in either table as well.
The columns_to_compare
variable determines what columns will be examined. This is a comma separated list, such as: "assembly_length,est_coverage,gambit_predicted_taxon"
. The order of the columns does not matter. All other columns not listed will be ignored.
An example validation_criteria.tsv file is shown below. The first column is the column name in the two tables. The second column is the validation criteria to use for that column. This file expects a header and is tab-delimited.
CAUTION: Any column names in this file must also be in columns_to_compare
for additional validation criteria to be performed.
column_name validation_criteria
column1 EXACT
column2 SET
column3 0.01
Currently implemented validation criteria include:
validation_criteria | explanation |
---|---|
EXACT | The values in the two columns must be exactly the same; in this case [foo,bar] != [bar,foo] . When applied to columns referencing files, file contents will be compared to check if they are identical. |
SET | The values in the two columns must be the same set of values; in this case [foo,bar] == [bar,foo] . When applied to columns referencing files, the lines within the files will be sorted alphabetically before comparing. |
<FLOAT> | The values in the two columns must be within <FLOAT>*100 of each other; e.g., 0.3 -> 30% difference allowed. |
IGNORE | The values in the two columns are assumed to match; in this case foo == bar . |
An example column_translation.tsv file is shown below. The first column is the column name in one table, and the second column is the corresponding column name in the other table. All columns with the name in the first column will be renamed to match the corresponding column name in the second column. This file has no header and is tab-delimited.
column_name1_table1 column_name1_table2
column_name2_table1 column_name2_table2
original_column_name new_column_name
For example, if table1
has a column named column_name1_table1
, it will be renamed to column_name1_table2
in all outputs and comparisons.
The output prefix variable is a string that will prefix all output file. Do not include any whitespace. The default is theiavalidate
.
The na_values
variable is a list of values that should be considered NA by Pandas. The default list is different than the default na_values list used by Pandas. This is because some outputs are legitimately "NA"
and should not be considered missing data by Pandas. All and only the values in this list will be replaced with pandas.na
or numpy.nan
in the output files and comparisons.
These two outputs increase the verbosity of the logging system to INFO
and DEBUG
, respectively. DEBUG
produces far more output than INFO
and may be excessive for non-debugging purposes. If both --debug
and --verbose
are present, --debug
takes precendence. If no verbosity is specified, the logging level is set to ERROR
.
See also the examples
folder for example outputs.
Or, you can copy and paste following command in the Docker image to generate the example outputs.
theiavalidate.py \
theiavalidate/examples/example-table1.tsv \
theiavalidate/examples/example-table2.tsv \
-c "assembly_length,gambit_predicted_taxon,amrfinderplus_amr_core_genes,extra_column" \
-l theiavalidate/examples/example-column_translation.tsv \
-m theiavalidate/examples/example-validation_criteria.tsv \
-o example-output
These files are the original input files with only the columns specified in columns_to_compare
and all columns being renamed to what is specified in the column_translation.tsv
file. These files are provided to allow the user to see what columns are being compared and to allow the user to manually inspect the original data.
This file is a tab-delimited file containing all rows and columns specified in columns_to_compare
. The only values in this file are the values that are not exactly the same between the two tables.
NOTE: This file is only provided if a validation_criteria.tsv
file is provided. This file is a tab-delimited file containing all rows and columns specified in columns_to_compare
. The only values in this file are the values that do not meet the validation criteria specified in the validation_criteria.tsv
file.
This file (available as an HTML and PDF) is a summary of the differences between the two tables. It contains the following information:
- the date
theiavalidate.py
was run - as rows, the columns specified in
columns_to_compare
- as columns:
- the number of rows in
table1
that have values - the number of rows in
table2
that have values - the number of differences (exact match)
- the corresponding validation criteria (if provided)
- the number of samples failing the validation criteria
- the number of rows in
If a validation_criteria.tsv
file was provided, a definition of the (currently implemented) validation criteria are provided at the bottom of the table
Shows the differing lines within mismatching files for a given sample and column. Each pair of mismatching files generates a separate file.