This repository contains all tools used and/or developed by Australian BioCommons to help automate routine tasks with gen3.
The library can be used in order to convert a spreadsheet format into a complete set of the yaml files required to build a Gen3 Data Dictionary.
An implementation of the library for generating the CAD dictionary is demonstrated by sheet2yaml.py
.
- Make schema edits to the
Harmonised Variables - v1 google sheet
- All schema objects in the google sheet need to have a template schema in the
schema
folder (to be phased out when all info can be generated) - Run
schema2yaml.py
which automatically reads the google sheets and parses the required information, writing the parsed schemas to the folderschema_out
- Copy
schema_out/*.yaml
topath/to/umccr-dictionary/dictionary/cad/gdcdictionary/schema
, compile, test, validate. If test or validate fails, go back to 1 above. - Simulate data with the new schema,
make simulate dd=cad
, adjust the number of samples and name of project as required.- Replace random simulated values with plausible ones using
plausible-data-gen
script
- Replace random simulated values with plausible ones using
- Switch the old gen3 dictionary with the new one
- upload the compiled json schema to the configured s3 bucket@
DICTIONARY_URL
- ?? delete psql volume (?) (only in development phase so doesn't matter if data is lost)
- disable indexing services (kibana, guppy, tube)
- restart services, re-configure auth
- upload the simulated data against the new dictionary
- re-enable and restart indexing services and re-run etl index (
guppy_setup.sh
)
- upload the compiled json schema to the configured s3 bucket@
sheet2yaml-CLI.py
is a similar script where inputs are specified as command line arguments rather than hard-coded into the script.
To use this script, one needs to provide identifiers for the google sheet as well as to each tab of the google sheet that needs to be read.
Each google sheet must follow the expected format as specified in the template sheet.
run python sheet2yaml-CLI.py -h
to see required arguments.
A fairly simple python script that takes as input a path to a set of json files and a csv file describing plausible values and replaces the random numbers generated out of gen3 software with ones from a defined distribution or range.
Clone this repo
Example usage:
cd gen3schemadev
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python3 plausible_data_gen.py --path <PATH_TO_SIM_DATA> [--values <PATH_TO_CSV> | --gurl <PATH_TO_GOOGLE_SHEET>] --generate-files --file-types aligned_reads
The code snippet above would generate plausible data
The program will write the modified json files to a directory called edited_jsons
.
If the --dummy_sequencing_files
and/or --dummy_lipid_files
flags are specified, files will be placed into a directory called dummy_files
The CSV needs to have the following columns (see the plausible values tab for an example.)
Column | Definition | Allowed values |
---|---|---|
object | The name of the schema node, i.e. .json | |
property | The name of the property within that schema/object | |
data_type | The type of data that needs to be generated. (enums and strings not currently supported) | range; mean; number; median; integer |
schema_type | The type of data in the schema (enums and strings not currently supported) | datetime; integer; number; string; enum |
mean | Required if 'data_type' is 'mean', generates a random number from normal distribution centred on this number | number |
sd | Required if 'data_type' is 'mean', generates a random number from normal distribution with this as sd | number |
median | Required if 'data_type' is 'median', generates a random number from normal distribution centred on this number | number |
first_quart | Required if 'data_type' is 'median', generates a random number from normal distribution using IQR to estimate sd | number |
third_quart | Required if 'data_type' is 'median', generates a random number from normal distribution using IQR to estimate sd | number |
proportion | [NOT CURRENTLY USED] TODO: use this to select an appropriate proportion from enums | 0<x<1 |
range_start | Required if 'data_type' is 'range', generates a random number between this number and 'range_end' | number |
range_end | Required if 'data_type' is 'range', generates a random number between 'range_end' and this number | number |
source | Reference where the information was found | free text |
enum | [NOT CURRENTLY USED] |