Add documentation for software utilities

FHIR · Sep 21, 2023 · 2fd2a9c · 2fd2a9c
1 parent bf7fb87
commit 2fd2a9c
Show file tree

Hide file tree

Showing 3 changed files with 6 additions and 4 deletions.
diff --git a/utilities/README.md b/utilities/README.md
@@ -7,8 +7,8 @@ This code converts a chromosome-level variant, as derived from a VCF, into a con
 ## bed2json
 Converts a BED file into a format suitable for loading into MongoDB. Chromosome numbering must include 'chr': 'chr1', 'chrX', 'chrY', 'chrM'. BED file must be sorted by chromosome, by position (bedtools sort default).
 
-## vcf2json
-Uses  [vcf2fhir](https://github.com/elimuinformatics/vcf2fhir)  logic to translate VCF records into a format suitable for loading into MongoDB.
+## run_vcf2json
+Batch process that calls vcf2json for a set of VCF files, yielding three output files ('variantsData.json', 'phaseData.json', 'molecularConsequences.json') for loading into respective MongoDB collections. Does not update Patients or Tests collections. VCFs to be processed are listed in vcfData.csv, which must include columns _vcf_filename_, _ref_build_ (populated with 'GRCh37' or 'GRCh38'), _patient_id_, _test_date_ (yyyy-mm-dd), _test_id_, _specimen_id_, _genomic_source_class_ (populate with 'germline', 'somatic', or 'mixed'), _ratio_ad_dp_ (used for mitochondrial DNA processing, generally set it to 0.99), _sample_position_ (zero-based, useful for multi-sample VCFs). vcf2json translation logic is based on [vcf2fhir](https://github.com/elimuinformatics/vcf2fhir).
 
 ## vcfPrepper
 Implements the molecular consequence pipeline described on the [Getting Started](https://github.com/FHIR/genomics-operations/wiki/2.-Getting-Started#molecular-consequences) page.
diff --git a/utilities/run.py → utilities/run_vcf2json.py b/utilities/run.py → utilities/run_vcf2json.py
@@ -13,7 +13,7 @@ def extractData(csv_file_path, variants_data, molecular_output, phase_output):
         common.query_genes(transcript_map)
 
         df = pd.read_csv(csv_file_path)
-        for _, row in tqdm(df.iterrows(), total=len(df), desc="Processing Data"):
+        for _, row in tqdm(df.iterrows(), total=len(df)-1, desc="Processing Data"):
             phased_rec_map = {}
             vcf2json.vcf2json(row['vcf_filename'],
                               row['ref_build'],
@@ -57,4 +57,4 @@ def run_vcf2json():
     print("Data Generated.")
 
 
-run_vcf2json()
+run_vcf2json()
diff --git a/utilities/vcfData.csv b/utilities/vcfData.csv
@@ -0,0 +1,2 @@
+vcf_filename,ref_build,patient_id,test_date,test_id,specimen_id,genomic_source_class,ratio_ad_dp,sample_position
+TCGA_DD_A1EH.vcf,GRCh37,TCGA-DD-A1EH,2022-01-20,TCGA-DD-A1EH-T1,TCGA-DD-A1EH-Sp1,somatic,0.99,0
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		vcf_filename,ref_build,patient_id,test_date,test_id,specimen_id,genomic_source_class,ratio_ad_dp,sample_position
		TCGA_DD_A1EH.vcf,GRCh37,TCGA-DD-A1EH,2022-01-20,TCGA-DD-A1EH-T1,TCGA-DD-A1EH-Sp1,somatic,0.99,0