Skip to content

Commit

Permalink
Add documentation for software utilities
Browse files Browse the repository at this point in the history
  • Loading branch information
rhdolin committed Sep 21, 2023
1 parent bf7fb87 commit 2fd2a9c
Show file tree
Hide file tree
Showing 3 changed files with 6 additions and 4 deletions.
4 changes: 2 additions & 2 deletions utilities/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ This code converts a chromosome-level variant, as derived from a VCF, into a con
## bed2json
Converts a BED file into a format suitable for loading into MongoDB. Chromosome numbering must include 'chr': 'chr1', 'chrX', 'chrY', 'chrM'. BED file must be sorted by chromosome, by position (bedtools sort default).

## vcf2json
Uses [vcf2fhir](https://github.com/elimuinformatics/vcf2fhir) logic to translate VCF records into a format suitable for loading into MongoDB.
## run_vcf2json
Batch process that calls vcf2json for a set of VCF files, yielding three output files ('variantsData.json', 'phaseData.json', 'molecularConsequences.json') for loading into respective MongoDB collections. Does not update Patients or Tests collections. VCFs to be processed are listed in vcfData.csv, which must include columns _vcf_filename_, _ref_build_ (populated with 'GRCh37' or 'GRCh38'), _patient_id_, _test_date_ (yyyy-mm-dd), _test_id_, _specimen_id_, _genomic_source_class_ (populate with 'germline', 'somatic', or 'mixed'), _ratio_ad_dp_ (used for mitochondrial DNA processing, generally set it to 0.99), _sample_position_ (zero-based, useful for multi-sample VCFs). vcf2json translation logic is based on [vcf2fhir](https://github.com/elimuinformatics/vcf2fhir).

## vcfPrepper
Implements the molecular consequence pipeline described on the [Getting Started](https://github.com/FHIR/genomics-operations/wiki/2.-Getting-Started#molecular-consequences) page.
4 changes: 2 additions & 2 deletions utilities/run.py → utilities/run_vcf2json.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ def extractData(csv_file_path, variants_data, molecular_output, phase_output):
common.query_genes(transcript_map)

df = pd.read_csv(csv_file_path)
for _, row in tqdm(df.iterrows(), total=len(df), desc="Processing Data"):
for _, row in tqdm(df.iterrows(), total=len(df)-1, desc="Processing Data"):
phased_rec_map = {}
vcf2json.vcf2json(row['vcf_filename'],
row['ref_build'],
Expand Down Expand Up @@ -57,4 +57,4 @@ def run_vcf2json():
print("Data Generated.")


run_vcf2json()
run_vcf2json()
2 changes: 2 additions & 0 deletions utilities/vcfData.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
vcf_filename,ref_build,patient_id,test_date,test_id,specimen_id,genomic_source_class,ratio_ad_dp,sample_position
TCGA_DD_A1EH.vcf,GRCh37,TCGA-DD-A1EH,2022-01-20,TCGA-DD-A1EH-T1,TCGA-DD-A1EH-Sp1,somatic,0.99,0

0 comments on commit 2fd2a9c

Please sign in to comment.