FastqSimulator

A program for taking reference sequences and creating fake fastq reads

Process

Using the parser modules created for the reference sequence generator project, this program takes input files in genbank (representing a genomic region, NG_ accession) or LRG file formats. It extracts coordinates and sequences from these files and stores the contents in a dictionary.
For each transcript found in the input file, a version of the original dictionary is modified using the sequence_modifier class. This class uses the full genomic sequence and the exon coordinates to create artificial variants using the following algorithm:
- Find start and end coordinates and generate a random number between the two positions as the coordinate to change
- Use a random selecter to pick one of A, C, G, or T (reselect if the new base is the same)
- substitute the base in the genomic sequence with the new one
- Use a combination of the old base, the new base, the coordinate position relative to the start of the coding sequence and the length of the protein sequence to predict HGVS nomenclature of the variant annotation
Once the sequence has been modified, the exon sequence (along with a region of padding either side to allow for decent read coverage) is extracted, along with details about the exon and the variant. These are put in a new dictionary and returned.

Annovar requires a separate database for Ensembl gene IDs. Use of this database (ensGene) is not currently in the code
for non-coding exons, the variants may be mismatched by 1 base. This may be due to differences between LRG and GB coordinates, or may be a different issue. Hard coding a change to the offset will fix some and break others... investigate more.

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
.idea		.idea
VCFs		VCFs
input		input
pickles		pickles
GbkParser_fq.py		GbkParser_fq.py
GbkParser_fq.pyc		GbkParser_fq.pyc
LICENSE		LICENSE
LrgParser_fq.py		LrgParser_fq.py
LrgParser_fq.pyc		LrgParser_fq.pyc
README.md		README.md
aligner.py		aligner.py
aligner.pyc		aligner.pyc
comparison.py		comparison.py
comparison.pyc		comparison.pyc
excel_out_comparison.py		excel_out_comparison.py
fake_runner.py		fake_runner.py
fqRunner.py		fqRunner.py
fq_sampler.py		fq_sampler.py
fq_sampler.pyc		fq_sampler.pyc
read_condenser.py		read_condenser.py
read_condenser.pyc		read_condenser.pyc
sequence_modifier.py		sequence_modifier.py
sequence_modifier.pyc		sequence_modifier.pyc
sequence_modifier.py~		sequence_modifier.py~