This directory contains scripts that can be used to eliminate mapping bias from mapped allele-specific reads. First, reads are mapped normally using a mapper chosen by the user (must output BAM or SAM format). Then mapped reads that overlap single nucleotide polymorphisms (SNPs) are identified. For each read that overlaps a SNP, its genotype is swapped with that of the other allele and the read is re-mapped. Re-mapped reads that fail to map to exactly the same location in the genome are discarded.
We now provide a Snakemake workflow that can be used to run the entire mappability filtering pipeline. For more information see the Snakemake README
Create input files containing genetic variants (SNPs). The WASP pipeline can take SNP input files in HDF5 format or in a text-based format. The HDF5 format is preferred as it contains phase information that improves the performance of the mappability filtering pipeline. (If you do not have phasing information, then the text-based format should be used.)
Convert SNP data to HDF5 format using the program snp2h5. HDF5 files are an efficient binary data format used by WASP scripts. The snp2h5 program can take VCF or IMPUTE2 files as input.
If VCF files are used, the sample names are read directly from the header lines of the VCF file. If IMPUTE2 files are used, then the names of the samples should be provided in a separate text file.
# using an IMPUTE input file:
./snp2h5/snp2h5 --chrom example_data/chromInfo.hg19.txt \
--format impute \
--geno_prob example_data/geno_probs.h5 \
--snp_index example_data/snp_index.h5 \
--snp_tab example_data/snp_tab.h5 \
--haplotype example_data/haps.h5 \
--samples samples_names.txt
example_data/genotypes/chr*.hg19.impute2.gz \
example_data/genotypes/chr*.hg19.impute2_haps.gz
# using VCF files:
./snp2h5/snp2h5 --chrom data/ucsc/hg19/chromInfo.txt.gz \
--format vcf \
--haplotype haplotypes.h5 \
--snp_index snp_index.h5 \
--snp_tab snp_tab.h5 \
data/1000G/ALL.chr*.vcf.gz
# read separate VCF files that contain genotype likelihoods (GL)
# or genotype probabilities (GP) if these were not in the main
# VCF file
./snp2h5/snp2h5 --chrom data/ucsc/hg19/chromInfo.txt.gz \
--format vcf \
--geno_prob geno_probs.h5 \
1000G/supporting/genotype_likelihoods/shapeit2/ALL.chr*.gl.vcf.gz
The text-based input files have three space-delimited columns (position, ref_allele, alt_allele), and one input file per chromosome. The filenames must contain the name of the chromosome (e.g. chr2).
We provide example scripts that can be used to create these files from IMPUTE or VCF files:
# get SNPs from IMPUTE files:
./mapping/extract_impute_snps.sh example_data/genotypes/ output_snp_dir/
# get SNPs from VCF files:
./mapping/extract_vcf_snps.sh example_data/genotypes/ output_snp_dir/
Map the fastq files using your favorite mapper/options and filter for quality using a cutoff of your choice
bowtie2 -x bowtie2_index/hg37 -1 ${SAMPLE_NAME}_1.fq.gz \
-2 ${SAMPLE_NAME}_2.fq.gz \
| samtools view -b -q 10 - > map1/${SAMPLE_NAME}.bam
samtools sort -o map1/${SAMPLE_NAME}.sort.bam map1/${SAMPLE_NAME}.bam
samtools index map1/${SAMPLE_NAME}.sort.bam
Use find_intersecting_snps.py to identify reads that may have mapping biases
python mapping/find_intersecting_snps.py \
--is_paired_end \
--is_sorted \
--output_dir find_intersecting_snps \
--snp_tab snp_tab.h5 \
--snp_index snp_index.h5 \
--haplotype haplotype.h5 \
--samples my_samples.txt \
map1/${SAMPLE_NAME}.sort.bam
positional arguments:
bam_filename Coordinate-sorted input BAM file containing
mapped reads.
optional arguments:
-h, --help show this help message and exit
--is_paired_end, -p Indicates that reads are paired-end (default
is single).
--is_sorted, -s Indicates that the input BAM file is
coordinate-sorted (default is False).
--max_seqs MAX_SEQS The maximum number of sequences with
different allelic combinations to consider
remapping (default=64). Read pairs wi
more allelic combinations than MAX_SEQs are
discarded
--max_snps MAX_SNPS The maximum number of SNPs allowed to
overlap a read before discarding the read.
Allowing higher numbers will decrease speed
and increase memory usage (default=6).
--output_dir OUT_DIR Directory to write output files to. If not
specified, output files are written to the
same directory as the input BAM file.
--snp_dir SNP_DIR Directory containing SNP text files This
directory should contain one file per
chromosome named like chr<#>.snps.txt
Each file should contain 3 columns: position
RefAllele AltAllele. This option should
only be used if --snp_tab, --snp_index,
and --haplotype arguments are not used.
If this argument is provided, all possible
allelic combinations are used (rather
than set of observed haplotypes)
--snp_tab SNP_TABLE_H5_FILE
Path to HDF5 file to read SNP info from. Each
row of SNP table contains SNP name, position,
allele1, allele2.
--snp_index SNP_INDEX_H5_FILE
Path to HDF5 file containing SNP index.
The SNP index is used to convert the
genomic position of a SNP to
its corresponding row in the haplotype and
snp_tab HDF5 files.
--haplotype HAPLOTYPE_H5_FILE
Path to HDF5 file to read phased haplotypes
from. When generating alternative reads use
known haplotypes from this file rather than
all possible allelic combinations.
--samples SAMPLES Use only haplotypes and SNPs that are
polymorphic in these samples. SAMPLES can
either be a comma-delimited string of sample
names or a path to a file with one
sample name per line (file is assumed to be
whitespace-delimited and first column is
assumed to be sample name). Sample names
should match those present in the
--haplotype file. Samples are ignored if no
haplotype file is provided.
PREFIX.keep.bam - bamfile with reads that did not intersect SNPs
or indels that can be kept without remapping
PREFIX.to.remap.bam - bamfile with original reads that overlapped SNPs
that need to be remapped
PREFIX.remap.fq.gz - fastq file containing the reads with flipped
alleles to remap. If paired-end option is used
two files ending with .fq1.gz and .fq2.gz are output.
(PREFIX is the name of the input file, excluding the trailing .bam)
Note: Reads that overlap indels are currently excluded and
will not be present in any of the 'remap' files or the
input.keep.bam file. For this reason the total number of reads
will not add up to the number of reads provided in the
input.sort.bam file.
Map the PREFIX.remap.fq.gz using the same mapping arguments used in Step 1. Note that the arguments should be exactly the same as those in Step 1 EXCEPT for arguments that directly modify the reads that are used by the aligner. For example the read trimming arguments to bowtie (-3 and -5 arguments) should be used in Step 1 ONLY, because they modify the reads that are output by bowtie.
bowtie2 -x bowtie2_index/hg37 \
-1 find_intersecting_snps/${SAMPLE_NAME}_1.remap.fq.gz \
-2 find_intersecting_snps/${SAMPLE_NAME}_2.remap.fq.gz \
| samtools view -b -q 10 - > map2/${SAMPLE_NAME}.bam
samtools sort -o map2/${SAMPLE_NAME}.sort.bam map2/${SAMPLE_NAME}.bam
samtools index map2/${SAMPLE_NAME}.sort.bam
Use filter_remapped_reads.py to filter out reads where one or more of the allelic versions of the reads fail to map back to the same location as the original read.
filter_remapped_reads.py [-h] to_remap_bam remap_bam keep_bam
positional arguments:
to_remap_bam input BAM file containing original set of reads that
needed to be remapped after having their alleles
flipped. This file is output by the
find_intersecting_snps.py script.
remap_bam input BAM file containing remapped reads (with flipped
alleles)
keep_bam output BAM file to write filtered set of reads to
python mapping/filter_remapped_reads.py \
find_intersection_snps/${SAMPLE_NAME}.to.remap.bam \
map2/${SAMPLE_NAME}.sort.bam \
filter_remapped_reads/${SAMPLE_NAME}.keep.bam
Merge ${SAMPLE_NAME}.keep.bam and ${SAMPLE_NAME}.remap.keep.bam can be merged for a complete set of mappability-filtered aligned reads. The merged file should then be sorted and indexed:
samtools merge merge/${SAMPLE_NAME}.keep.merge.bam \
filter_remapped_reads/${SAMPLE_NAME}.keep.bam \
find_intersecting_snps/${SAMPLE_NAME}.keep.bam
samtools sort -o merge/${SAMPLE_NAME}.keep.merge.sort.bam \
merge/${SAMPLE_NAME}.keep.merge.bam
samtools index ${SAMPLE_NAME}.keep.merged.sort.bam
Filter duplicate reads. Programs such as samtools rmdup introduce bias when they filter duplicate reads because they retain the read with the highest score (which usually matches the reference). We provide a script rmdup.py which performs unbiased removal of duplicate reads. The script discards duplicate reads at random (independent of their score). The input BAM or SAM file must be sorted.
# for single end reads:
python rmdup.py <sorted.input.bam> <output.bam>
# for paired-end reads:
python rmdup_pe.py <sorted.input.bam> <output.bam>
To run the tests, execute py.test
from within the mapping directory.
The tests currently require bowtie2 and samtools to be in the PATH.