diff --git a/README.md b/README.md index 78a55fc..808daac 100644 --- a/README.md +++ b/README.md @@ -2,34 +2,42 @@ Gauchian is a targeted variant caller for the GBA gene based on a whole-genome sequencing (WGS) BAM file. Gauchian uses a novel method to solve the problems caused by the high sequence similarity with the pseudogene paralog GBAP1 and is able to detect variants accurately in the Exons 9-11 homology region, such as large deletions or duplications between GBA and GBAP1, and GBAP1-like variants in GBA, including p.A495P, p.L483P, p.D448H, c.1263del, RecNciI, RecTL and c.1263del+RecTL. In addition to these challenging variants, Gauchian also calls known pathogenic or likely pathogenic GBA variants classified in ClinVar. Gauchian has been tested on Illumina WGS data with standard sequencing depth (>=30X). Gauchian does not work on targeted sequencing data. Please refer to our [preprint](https://www.medrxiv.org/content/10.1101/2021.11.12.21266253v1) for more details about the method. -## Running the program +## Installation + +```bash +git clone https://github.com/Illumina/Gauchian +cd Gauchian +python3 setup.py install +``` -This Python3 program can be run as follows: +## Running the program ```bash -python -m gauchian --manifest MANIFEST_FILE \ - --genome [19/37/38] \ - --prefix OUTPUT_FILE_PREFIX \ - --outDir OUTPUT_DIRECTORY \ - --threads NUMBER_THREADS +gauchian --manifest MANIFEST_FILE \ + --genome [19/37/38] \ + --prefix OUTPUT_FILE_PREFIX \ + --outDir OUTPUT_DIRECTORY \ + --threads NUMBER_THREADS ``` -The manifest is a text file in which each line should list the absolute path to an input WGS BAM/CRAM file. Full WGS BAM/CRAM files are recommended. If you would like to use a subsetted bamlet, please subset using region files in gauchian/data/GBA_region_*.bed. For CRAM input, it’s suggested to provide the path to the reference fasta file with `--reference` in the command. +The manifest is a text file in which each line should list the absolute path to an input WGS BAM/CRAM file. Full WGS BAM/CRAM files are recommended. If you would like to use a subsetted bamlet, please subset using region files in gauchian/data/GBA_region_*.bed. + +For CRAM input, it’s suggested to provide the path to the reference fasta file with `--reference` in the command. ## Interpreting the output The program produces a .tsv file in the directory specified by --outDir. The fields are explained below: -| Fields in tsv | Explanation | -|:----------------------------------------|:-------------------------------------------------------------------------------| -| Sample | Sample name | -| is_biallelic_GBAP1-like_variant_exon9-11| Whether the sample is called as biallelic for GBAP1-like variants in exon9-11 | -| is_carrier_GBAP1-like_variant_exon9-11 | Whether the sample is called as a carrier for GBAP1-like variants in exon9-11 | -| total_CN | Total copy number of GBA+GBAP1 | -| deletion_breakpoint_in_GBA_gene | Whether the deletion breakpoint is in GBA gene if a deletion exists | -| GBAP1-like_variant_exon9-11 | GBAP1-like variants called in exon9-11, two alleles separated by / | -| other_variants | Other variants called (non-GBAP1-like variants or variants outside of exon9-11)| +| Fields in tsv | Explanation | +|:-----------------------------------------|:-------------------------------------------------------------------------------| +| Sample | Sample name | +| is_biallelic(GBAP1-like_variant_exon9-11)| Whether the sample is called as biallelic for GBAP1-like variants in exon9-11 | +| is_carrier(GBAP1-like_variant_exon9-11) | Whether the sample is called as a carrier for GBAP1-like variants in exon9-11 | +| CN(GBA+GBAP1) | Total copy number of GBA+GBAP1 | +| deletion_breakpoint_in_GBA | Whether the deletion breakpoint is in GBA gene if a deletion exists | +| GBAP1-like_variant_exon9-11 | GBAP1-like variants called in exon9-11, two alleles separated by / | +| other_unphased_variants | Other variants called (non-GBAP1-like variants or variants outside of exon9-11)| A .json file is also produced that contains more information about each sample. diff --git a/docs/demo.rst b/docs/demo.rst new file mode 100644 index 0000000..b2dc3b2 --- /dev/null +++ b/docs/demo.rst @@ -0,0 +1,24 @@ +============ +Program Demo +============ + +1. Download a WGS bam file:: + + $ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/1000G_2504_high_coverage/data/ERR3239883/NA20815.final.cram + $ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/1000G_2504_high_coverage/data/ERR3239883/NA20815.final.cram.crai + +2. Make a manifest file:: + + $ echo "$PWD/NA20815.final.cram" > manifest.txt + +3. Run Gauchian, which takes about one to two minutes with single thread:: + + $ gauchian -m manifest.txt -g 38 -o out -p testgba + +4. Check the output file out/testgba.tsv + +============= ========================================= ======================================= ============= ========================== =========================== ======================= +Sample is_biallelic(GBAP1-like_variant_exon9-11) is_carrier(GBAP1-like_variant_exon9-11) CN(GBA+GBAP1) deletion_breakpoint_in_GBA GBAP1-like_variant_exon9-11 other_unphased_variants +============= ========================================= ======================================= ============= ========================== =========================== ======================= +NA20815.final False True 4 N/A L483P/ None +============= ========================================= ======================================= ============= ========================== =========================== ======================= diff --git a/gauchian/gauchian.py b/gauchian/gauchian.py index 6244956..b7858a0 100644 --- a/gauchian/gauchian.py +++ b/gauchian/gauchian.py @@ -159,12 +159,12 @@ def write_to_tsv(final_output, out_tsv): """Prepare tsv output""" header = [ "Sample", - "is_biallelic_GBAP1-like_variant_exon9-11", - "is_carrier_GBAP1-like_variant_exon9-11", - "total_CN", - "deletion_breakpoint_in_GBA_gene", + "is_biallelic(GBAP1-like_variant_exon9-11)", + "is_carrier(GBAP1-like_variant_exon9-11)", + "CN(GBA+GBAP1)", + "deletion_breakpoint_in_GBA", "GBAP1-like_variant_exon9-11", - "other_variants" + "other_unphased_variants" ] with open(out_tsv, "w") as tsv_output: tsv_output.write("\t".join(header) + "\n") @@ -186,6 +186,8 @@ def write_to_tsv(final_output, out_tsv): p1like_variants, other_variants ] + if output_per_sample[4] is None: + output_per_sample[4] = "N/A" tsv_output.write("\t".join([str(a) for a in output_per_sample]) + "\n") else: tsv_output.write("\t".join([str(a) for a in [sample_id]+[None]*6]) + "\n") diff --git a/setup.py b/setup.py index 97ae10b..3b4f9b7 100644 --- a/setup.py +++ b/setup.py @@ -17,6 +17,7 @@ def readme(): author_email='xchen2@illumina.com', license='GPLv3', packages=['gauchian', 'gauchian.caller', 'gauchian.depth_calling'], + package_data={'gauchian': ['data/*']}, install_requires=['pysam', 'numpy', 'scipy', 'statsmodels'], setup_requires=['pytest-runner'], tests_require=['pytest'],