Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tar -f option explanation correction. "GenomeAnalysisTK-latest" is later now. #2

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/tutorials-2011/bwa_tutorial.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ First we are going to grab the source files for bwa from sourceforge, using curl

%% curl -O -L http://sourceforge.net/projects/bio-bwa/files/bwa-0.5.9.tar.bz2

Now we want to uncompress the tarball file using "tar". x extracts, v is verbose (telling you what it is doing), f skips prompting for each individual file, and j tells it to unzip .bz2 files.::
Now we want to uncompress the tarball file using "tar". x extracts, v is verbose (telling you what it is doing), f specifies the archive file (otherwise tar will assume it's the default Tape ARchive device node), and j tells it to unzip .bz2 files.::

%% tar xvfj bwa-0.5.9.tar.bz2
%% cd bwa-0.5.9
Expand Down
12 changes: 6 additions & 6 deletions doc/tutorials-2011/snp_tutorial.txt
Original file line number Diff line number Diff line change
Expand Up @@ -83,16 +83,16 @@ Now let's run GATK's Unified Genotyper. The GATK people recommend a few quality

Read 1 looks fine, but read 2 is probably mis-aligned; rather than counting the C as an insertion, which is probably better, the aligner has placed it after the indel, making it look like there is a SNP there. In GATK we can account for this problem by doing local re-alignment around potential indel sites, in which we incorporate information about all the reads in that region simultaneously, rather than mapping each read individually. First, GATK needs to figure out which regions are in need of re-alignment::

%% java -Xmx1g -jar GenomeAnalysisTK-1.0.5974/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -I ../RAL357_full_bwa.sorted.bam -o RAL357.realign.intervals -L 2L:100000-150000
%% java -Xmx1g -jar GenomeAnalysisTK-1.1-23-g8072bd9/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -I ../RAL357_full_bwa.sorted.bam -o RAL357.realign.intervals -L 2L:100000-150000

And then it needs to actually do the re-alignment (this step is slow, taking ~10 min, even for our small region)::

%% java -Xmx4g -jar GenomeAnalysisTK-1.0.5974/GenomeAnalysisTK.jar -I ../RAL357_full_bwa.sorted.bam -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -T IndelRealigner -targetIntervals RAL357.realign.intervals -o ../RAL357_full_bwa.realigned.bam
%% java -Xmx4g -jar GenomeAnalysisTK-1.1-23-g8072bd9/GenomeAnalysisTK.jar -I ../RAL357_full_bwa.sorted.bam -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -T IndelRealigner -targetIntervals RAL357.realign.intervals -o ../RAL357_full_bwa.realigned.bam

Now we need to do this re-alignment again with the other fly strain that we genotyped (if you were running a large number of samples, this is where a shell script would come in handy). You might want to run this in the background or in a new session while you're re-aligning the first one::

%% java -Xmx1g -jar GenomeAnalysisTK-1.0.5974/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -I ../RAL391_full_bwa.sorted.bam -o RAL391.realign.intervals -L 2L:100000-150000
%% java -Xmx4g -jar GenomeAnalysisTK-1.0.5974/GenomeAnalysisTK.jar -I ../RAL391_full_bwa.sorted.bam -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -T IndelRealigner -targetIntervals RAL391.realign.intervals -o ../RAL391_full_bwa.realigned.bam
%% java -Xmx1g -jar GenomeAnalysisTK-1.1-23-g8072bd9/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -I ../RAL391_full_bwa.sorted.bam -o RAL391.realign.intervals -L 2L:100000-150000
%% java -Xmx4g -jar GenomeAnalysisTK-1.1-23-g8072bd9/GenomeAnalysisTK.jar -I ../RAL391_full_bwa.sorted.bam -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -T IndelRealigner -targetIntervals RAL391.realign.intervals -o ../RAL391_full_bwa.realigned.bam

Now we would like to run the SNP caller, but we first need to use a package called Picard to fix a minor formatting problem in the re-aligned BAM files before GATK's Unified Genotyper will accept them (these steps take a few minutes each as well)::

Expand All @@ -105,7 +105,7 @@ Now we would like to run the SNP caller, but we first need to use a package call
And finally we can run GATK's SNP caller::

%% cd /data/snp_calling/GATK_snps
%% java -jar GenomeAnalysisTK-1.0.5974/GenomeAnalysisTK.jar -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -T UnifiedGenotyper -I ../RAL357_full_bwa.realigned.fixed.bam -I ../RAL391_full_bwa.realigned.fixed.bam -o RAL_GATK.vcf -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 500 -L 2L:100000-150000
%% java -jar GenomeAnalysisTK-1.1-23-g8072bd9/GenomeAnalysisTK.jar -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -T UnifiedGenotyper -I ../RAL357_full_bwa.realigned.fixed.bam -I ../RAL391_full_bwa.realigned.fixed.bam -o RAL_GATK.vcf -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 500 -L 2L:100000-150000

These parameters are similar, but not identical, to those in Samtools. -stand_emit_conf 10.0 means that it won't report any potential SNPs with a quality below 10.0; but unless they meet the quality threshold set by -stand_call_conf (50.0, in this case), they will be listed as failing the quality filter. -dcov 500 means that any site that has more than 500x coverage, the genotype caller will only use 500 randomly selected reads (for computational efficiency).

Expand Down Expand Up @@ -137,4 +137,4 @@ There is a package called vcftools that has all sorts of utilities for working w

Another exercise:

In the snp_calling directory, you will also find BAM files generated by aligning the same set of reads to the same reference genome for one of the two fly lines (RAL357) using bowtie rather than bwa. Use Samtools to call SNPs and generate a VCF file on the bowtie alignment and compare it to the VCF file you got from the bwa alignment.
In the snp_calling directory, you will also find BAM files generated by aligning the same set of reads to the same reference genome for one of the two fly lines (RAL357) using bowtie rather than bwa. Use Samtools to call SNPs and generate a VCF file on the bowtie alignment and compare it to the VCF file you got from the bwa alignment.