Merge pull request #89 from zstephens/genreads_overhaul

Genreads overhaul
zstephens · Feb 6, 2021 · 375e192 · 375e192
2 parents cd99b14 + 9490049
commit 375e192
Show file tree

Hide file tree

Showing 60 changed files with 5,802 additions and 5,982 deletions.
diff --git a/.gitattributes b/.gitattributes
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,5 @@
+# Ignore filetypes
+*.pyc
+/python2env/
+/.ipynb_checkpoints/
+/.vscode/
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
diff --git a/ChangeLog.md b/ChangeLog.md
@@ -0,0 +1,24 @@
+# NEAT v3.0
+- NEAT gen_reads now runs in Python 3 exclusively. The previous, Python 2 version is stored in the repo as v2.0, but will not be undergoing active development.
+- Converted sequence objects to Biopython Sequence objects to take advantage of the Biopython library
+- Converted cigar strings to lists. Now there are simply a functions that convert a generic cigar string to a list and vice versa.
+- Tried to take better advantage of some Biopython libraries, such as for parsing fastas.
+- For now, we've eliminated the "Jobs" option and the merge jobs function. We plan to implement multi-threading instead as a way to speed up NEAT's simulation process.
+- Added a basic bacterial wrapper that will simulate multiple generations of bacteria based on an input fasta, mutate them and then produce the fastqs, bams, and vcfs for the resultant bacterial population.
+
+
+## TODOs for v3.1
+NEAT is still undergoing active development with many exciting upgrades planned. We also plan to bring the code up to full production scale and will continue to improve the following features (if you would like to [Contribute](CONTRIBUTING.md))
+- Using Python's multithreading libraries, speed up NEAT's gen_reads tool significantly.
+- Take advantage of pandas library for reading in bed files and other files.
+- Code optimization for all gen_reads files (in source folder)
+- Further cleanup to PEP8 standards
+- Refactor the code to integrate NEAT's utilities into the package
+- Improvements and standardization for the utilities, across the board
+- VCF compare has some nice features and output, but is very slow to use. Can we improve this utility?
+
+For improvements, we have a lot of great ideas for general improvements aimed at better simulating bacteria, but we believe this same improvements will have applications in other species as well. 
+- Multiploidy - all right this has nothing to do with bacteria specifically, but it is a feature we would like to implement into gen_reads.
+- Structural Variants - model large scale structural variants with an eye toward intergenic SVs.
+- Transposable Elements - model transposons within the sequence
+- Repeat regions - will bring a variety of interesting applications
diff --git a/LICENSE.md b/LICENSE.md
diff --git a/README.md b/README.md
@@ -1,4 +1,8 @@
-# neat-genreads
+# The NEAT Project v3.0
+Welcome to the NEAT project, the NExt-generation sequencing Analysis Toolkit, version 3.0. Neat has now been updated with Python 3, and is moving toward PEP8 standards. There is still lots of work to be done. See the [ChangeLog](ChangeLog.md) for notes.
+
+Stay tuned over the coming weeks for exciting updates to NEAT, and learn how to [contribute](CONTRIBUTING.md) yourself. If you'd like to use some of our code, no problem! Just review the [license](LICENSE.md), first.
+
 NEAT-genReads is a fine-grained read simulator. GenReads simulates real-looking data using models learned from specific datasets. There are several supporting utilities for generating models used for simulation.
 
 This is an in-progress v2.0 of the software. For a previous stable release please see: [genReads1](https://github.com/zstephens/genReads1)
@@ -37,14 +41,20 @@ Table of Contents
 
 ## Requirements
 
-* Python 2.7
-* Numpy 1.9.1+
+* Python >= 3.6
+* biopython >= 1.78
+* matplotlib >= 3.3.4 (optional, for plotting utilities)
+* matplotlib_venn >= 0.11.6 (optional, for plotting utilities)
+* pandas >= 1.2.1
+* numpy >= 1.19.5
+* pysam >= 0.16.0.1
+
 
 ## Usage
 Here's the simplest invocation of genReads using default parameters. This command produces a single ended fastq file with reads of length 101, ploidy 2, coverage 10X, using the default sequencing substitution, GC% bias, and mutation rate models.
 
 ```
-python genReads.py -r ref.fa -R 101 -o simulated_data
+python gen_reads.py -r ref.fa -R 101 -o simulated_data
 ``` 
 
 The most commonly added options are --pe, --bam, --vcf, and -c. 
@@ -59,30 +69,34 @@ Option           |  Description
 -c <float>       |  Average coverage across the entire dataset. Default: 10
 -e <str>         |  Sequencing error model pickle file
 -E <float>       |  Average sequencing error rate. The sequencing error rate model is rescaled to make this the average value. 
--p <int>         |  ploidy [2]
--t <str>         |  bed file containing targeted regions; default coverage for targeted regions is 98% of -c option; default coverage outside targeted regions is 2% of -c option
+-p <int>         |  Sample Ploidy, default 2
+-tr <str>        |  Bed file containing targeted regions; default coverage for targeted regions is 98% of -c option; default coverage outside targeted regions is 2% of -c option
+-dr <str>	     |  Bed file with sample regions to discard.
 -to <float>      |  off-target coverage scalar [0.02]
 -m <str>         |  mutation model pickle file
 -M <float>       |  Average mutation rate. The mutation rate model is rescaled to make this the average value. Must be between 0 and 0.3. These random mutations are inserted in addition to the once specified in the -v option.
--s <str>         |  input sample model
+-Mb <str>	 |  Bed file containing positional mutation rates
+-N <int>	 |  Below this quality score, base-call's will be replaced with N's
 -v <str>         |  Input VCF file. Variants from this VCF will be inserted into the simulated sequence with 100% certainty.
 --pe <int> <int> |  Paired-end fragment length mean and standard deviation. To produce paired end data, one of --pe or --pe-model must be specified.
 --pe-model <str> |  Empirical fragment length distribution. Can be generated using [computeFraglen.py](#computefraglenpy). To produce paired end data, one of --pe or --pe-model must be specified.
 --gc-model <str> |  Empirical GC coverage bias distribution.  Can be generated using [computeGC.py](#computegcpy)
---job <int> <int>|  Jobs IDs for generating reads in parallel
---nnr            |  save non-N ref regions (for parallel jobs)
 --bam            |  Output golden BAM file
 --vcf            |  Output golden VCF file
+--fa		 |  Output FASTA instead of FASTQ
 --rng <int>      |  rng seed value; identical RNG value should produce identical runs of the program, so things like read locations, variant positions, error positions, etc, should all be the same.
 --gz             |  Gzip output FQ and VCF
 --no-fastq       |  Bypass generation of FASTQ read files
+--discard-offtarget |  Discard reads outside of targeted regions
+--rescale-qual   |  Rescale Quality scores to match -E input
+-d  |   Turn on debugging mode (useful for development)
 
 
 ## Functionality
 
 ![Diagram describing the way that genReads simulates datasets](docs/flow_new.png "Diagram describing the way that genReads simulates datasets")
 
-NEAT genReads produces simulated sequencing datasets. It creates FASTQ files with reads sampled from a provided reference genome, using sequencing error rates and mutation rates learned from real sequencing data. The strength of genReads lies in the ability for the user to customize many sequencing parameters, produce 'golden', true positive datasets, and produce types of data that other simulators cannot (e.g. tumour/normal data).
+NEAT gen_reads produces simulated sequencing datasets. It creates FASTQ files with reads sampled from a provided reference genome, using sequencing error rates and mutation rates learned from real sequencing data. The strength of genReads lies in the ability for the user to customize many sequencing parameters, produce 'golden', true positive datasets, and produce types of data that other simulators cannot (e.g. tumour/normal data).
 
 Features:
 
@@ -111,7 +125,7 @@ The following commands are examples for common types of data to be generated. Th
 Simulate whole genome dataset with random variants inserted according to the default model. 
 
 ```
-python genReads.py                  \
+python gen_reads.py                  \
         -r hg19.fa                  \
         -R 126                      \
         -o /home/me/simulated_reads \
@@ -124,7 +138,7 @@ python genReads.py                  \
 Simulate a targeted region of a genome, e.g. exome, with -t.
 
 ```
-python genReads.py                  \
+python gen_reads.py                  \
         -r hg19.fa                  \
         -R 126                      \
         -o /home/me/simulated_reads \
@@ -138,7 +152,7 @@ python genReads.py                  \
 Simulate a whole genome dataset with only the variants in the provided VCF file using -v and -M.
 
 ```
-python genReads.py                  \
+python gen_reads.py                  \
         -r hg19.fa                  \
         -R 126                      \
         -o /home/me/simulated_reads \
@@ -153,7 +167,7 @@ python genReads.py                  \
 Simulate single-end reads by omitting the --pe option.
 
 ```
-python genReads.py                  \
+python gen_reads.py                  \
         -r hg19.fa                  \
         -R 126                      \
         -o /home/me/simulated_reads \
@@ -165,7 +179,7 @@ python genReads.py                  \
 Simulate PacBio-like reads by providing an error model.
 
 ```
-python genReads.py                         \
+python gen_reads.py                         \
 	-r hg19.fa                         \
 	-R 5000                            \
 	-e models/errorModel_pacbio_toy.p  \
@@ -177,10 +191,10 @@ python genReads.py                         \
 When possible, simulation can be done in parallel via multiple executions with different --job options. The resultant files will then need to be merged using utilities/mergeJobs.py. The following example shows splitting a simulation into 4 separate jobs (which can be run independently):
 
 ```
-python genReads.py  -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 1 4
-python genReads.py  -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 2 4
-python genReads.py  -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 3 4
-python genReads.py  -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 4 4
+python gen_reads.py  -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 1 4
+python gen_reads.py  -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 2 4
+python gen_reads.py  -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 3 4
+python gen_reads.py  -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 4 4
 
 python mergeJobs.py -i /home/me/simulated_reads -o /home/me/simulated_reads_merged -s /path/to/samtools
 ```
@@ -190,9 +204,9 @@ In future revisions the dependence on SAMtools will be removed.
 To simulate human WGS 50X, try 50 chunks or less.
 
 # Utilities	
-Several scripts are distributed with genReads that are used to generate the models used for simulation.
+Several scripts are distributed with gen_reads that are used to generate the models used for simulation.
 
-## computeGC.py
+## compute_gc.py
 
 Computes GC% coverage bias distribution from sample (bedrolls genomecov) data.
 Takes .genomecov files produced by BEDtools genomeCov (with -d option).
@@ -212,7 +226,7 @@ python computeGC.py                 \
         -o /path/to/model.p
 ```
 
-## computeFraglen.py
+## compute_fraglen.py
 
 Computes empirical fragment length distribution from sample data.
 Takes SAM file via stdin:
@@ -221,7 +235,7 @@ Takes SAM file via stdin:
 
 and creates fraglen.p model in working directory.
 
-## genMutModel.py
+## gen_mut_model.py
 
 Takes references genome and TSV file to generate mutation models:
 
@@ -234,9 +248,20 @@ python genMutModel.py               \
 
 Trinucleotides are identified in the reference genome and the variant file. Frequencies of each trinucleotide transition are calculated and output as a pickle (.p) file.
 
+Option           |  Description
+------           |:----------
+-r <str>         |  Reference file for organism in FASTA format. Required
+-m <str>         |  Mutation file for organism in VCF format. Required
+-o <str>         |  Path to output file and prefix. Required. 
+-b <str>         |  BED file of regions to include
+--save-trinuc    |  Save trinucleotide counts for reference
+--human-sample   |  Use to skip unnumbered scaffolds in human references
+--skip-common    |  Do not save common snps or high mutation areas
+
+
 ## genSeqErrorModel.py
 
-Generates sequence error model for genReads.py -e option.
+Generates sequence error model for gen_reads.py -e option.
 This script needs revision, to improve the quality-score model eventually, and to include code to learn sequencing errors from pileup data.
 
 ```
@@ -251,6 +276,7 @@ python genSeqErrorModel.py                            \
         -s number of simulation iterations [1000000]  \
         --plot perform some optional plotting
 ```
+
 ## plotMutModel.py
 
 Performs plotting and comparison of mutation models generated from genMutModel.py.
@@ -268,8 +294,6 @@ Tool for comparing VCF files.
 
 ```
 python vcf_compare_OLD.py
-        --version          show program's version number and exit      \
-        -h, --help         show this help message and exit             \
         -r <ref.fa>        * Reference Fasta                           \
         -g <golden.vcf>    * Golden VCF                                \
         -w <workflow.vcf>  * Workflow VCF                              \