Automated processes for Gen3 Data Dictionary, data simulation and submission automation

This repository contains all tools used and/or developed by Australian BioCommons to help automate routine tasks with gen3.

`gen3schemadev` is an object relational mapper library for gen3 schemas.

The library can be used in order to convert a spreadsheet format into a complete set of the yaml files required to build a Gen3 Data Dictionary.

An implementation of the library for generating the CAD dictionary is demonstrated by sheet2yaml.py.

Current Workflow for editing CAD Project Dictionary

Local deployment (compose-services) hard-coded google sheet

Make schema edits to the Harmonised Variables - v1 google sheet
All schema objects in the google sheet need to have a template schema in the schema folder (to be phased out when all info can be generated)
Run schema2yaml.py which automatically reads the google sheets and parses the required information, writing the parsed schemas to the folder schema_out
Copy schema_out/*.yaml to path/to/umccr-dictionary/dictionary/cad/gdcdictionary/schema, compile, test, validate. If test or validate fails, go back to 1 above.
Simulate data with the new schema, make simulate dd=cad, adjust the number of samples and name of project as required.
1. Replace random simulated values with plausible ones using plausible-data-gen script
Switch the old gen3 dictionary with the new one
1. upload the compiled json schema to the configured s3 bucket@DICTIONARY_URL
2. ?? delete psql volume (?) (only in development phase so doesn't matter if data is lost)
3. disable indexing services (kibana, guppy, tube)
4. restart services, re-configure auth
5. upload the simulated data against the new dictionary
6. re-enable and restart indexing services and re-run etl index (guppy_setup.sh)

sheet2yaml-CLI.py

sheet2yaml-CLI.py is a similar script where inputs are specified as command line arguments rather than hard-coded into the script.

To use this script, one needs to provide identifiers for the google sheet as well as to each tab of the google sheet that needs to be read.

Each google sheet must follow the expected format as specified in the template sheet.

run python sheet2yaml-CLI.py -h to see required arguments.

Plausible Data Generator

A fairly simple python script that takes as input a path to a set of json files and a csv file describing plausible values and replaces the random numbers generated out of gen3 software with ones from a defined distribution or range.

Usage

Clone this repo

Example usage:

cd gen3schemadev
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python3 plausible_data_gen.py --path <PATH_TO_SIM_DATA> [--values <PATH_TO_CSV> | --gurl <PATH_TO_GOOGLE_SHEET>] --generate-files --file-types aligned_reads

The code snippet above would generate plausible data

The program will write the modified json files to a directory called edited_jsons.

If the --dummy_sequencing_files and/or --dummy_lipid_files flags are specified, files will be placed into a directory called dummy_files

Format of CSV

The CSV needs to have the following columns (see the plausible values tab for an example.)

Column	Definition	Allowed values
object	The name of the schema node, i.e. .json
property	The name of the property within that schema/object
data_type	The type of data that needs to be generated. (enums and strings not currently supported)	range; mean; number; median; integer
schema_type	The type of data in the schema (enums and strings not currently supported)	datetime; integer; number; string; enum
mean	Required if 'data_type' is 'mean', generates a random number from normal distribution centred on this number	number
sd	Required if 'data_type' is 'mean', generates a random number from normal distribution with this as sd	number
median	Required if 'data_type' is 'median', generates a random number from normal distribution centred on this number	number
first_quart	Required if 'data_type' is 'median', generates a random number from normal distribution using IQR to estimate sd	number
third_quart	Required if 'data_type' is 'median', generates a random number from normal distribution using IQR to estimate sd	number
proportion	[NOT CURRENTLY USED] TODO: use this to select an appropriate proportion from enums	0<x<1
range_start	Required if 'data_type' is 'range', generates a random number between this number and 'range_end'	number
range_end	Required if 'data_type' is 'range', generates a random number between 'range_end' and this number	number
source	Reference where the information was found	free text
enum	[NOT CURRENTLY USED]

Name		Name	Last commit message	Last commit date
Latest commit History 252 Commits
configs		configs
docs		docs
file_type_templates		file_type_templates
gen3schemadev		gen3schemadev
jupyter		jupyter
schema		schema
scripts		scripts
users/__pycache__		users/__pycache__
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md
datas_submittor.py		datas_submittor.py
delete_files.py		delete_files.py
delete_projects.py		delete_projects.py
fix_links.py		fix_links.py
metadata_validator.ipynb		metadata_validator.ipynb
plausible_data_gen.py		plausible_data_gen.py
populate_metadata_service.py		populate_metadata_service.py
requirements.txt		requirements.txt
setup.py		setup.py
sheet2yaml-CLI.py		sheet2yaml-CLI.py
transform_ausdiab.py		transform_ausdiab.py
users.yaml		users.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated processes for Gen3 Data Dictionary, data simulation and submission automation

`gen3schemadev` is an object relational mapper library for gen3 schemas.

Current Workflow for editing CAD Project Dictionary

Local deployment (compose-services) hard-coded google sheet

sheet2yaml-CLI.py

Plausible Data Generator

Usage

Format of CSV

Metadata Validator

Gen3 Data Submitter

About

Releases

Packages

Contributors 3

Languages

License

AustralianBioCommons/gen3schemadev

Folders and files

Latest commit

History

Repository files navigation

Automated processes for Gen3 Data Dictionary, data simulation and submission automation

gen3schemadev is an object relational mapper library for gen3 schemas.

Current Workflow for editing CAD Project Dictionary

Local deployment (compose-services) hard-coded google sheet

sheet2yaml-CLI.py

Plausible Data Generator

Usage

Format of CSV

Metadata Validator

Gen3 Data Submitter

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

`gen3schemadev` is an object relational mapper library for gen3 schemas.

Packages