From e03580f8ed457d047f6d6feaac274b5d0b236364 Mon Sep 17 00:00:00 2001 From: bdklahn Date: Tue, 9 Aug 2011 10:36:13 -0400 Subject: [PATCH 1/2] correct tar -f option explanation acronym emphasis --- doc/tutorials-2011/bwa_tutorial.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/tutorials-2011/bwa_tutorial.txt b/doc/tutorials-2011/bwa_tutorial.txt index 7d29871..e3a0ecb 100644 --- a/doc/tutorials-2011/bwa_tutorial.txt +++ b/doc/tutorials-2011/bwa_tutorial.txt @@ -25,7 +25,7 @@ First we are going to grab the source files for bwa from sourceforge, using curl %% curl -O -L http://sourceforge.net/projects/bio-bwa/files/bwa-0.5.9.tar.bz2 -Now we want to uncompress the tarball file using "tar". x extracts, v is verbose (telling you what it is doing), f skips prompting for each individual file, and j tells it to unzip .bz2 files.:: +Now we want to uncompress the tarball file using "tar". x extracts, v is verbose (telling you what it is doing), f specifies the archive file (otherwise tar will assume it's the default Tape ARchive device node), and j tells it to unzip .bz2 files.:: %% tar xvfj bwa-0.5.9.tar.bz2 %% cd bwa-0.5.9 From a4666931bb2afa532d4f1cafb4472d59b85a690e Mon Sep 17 00:00:00 2001 From: bdklahn Date: Tue, 9 Aug 2011 12:50:54 -0400 Subject: [PATCH 2/2] new version of files inside GenomeAnalysisTK-latest.tar.bz2. Updated java commands file name arguments for this. --- doc/tutorials-2011/snp_tutorial.txt | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/doc/tutorials-2011/snp_tutorial.txt b/doc/tutorials-2011/snp_tutorial.txt index bea7ed5..b8f4e69 100644 --- a/doc/tutorials-2011/snp_tutorial.txt +++ b/doc/tutorials-2011/snp_tutorial.txt @@ -83,16 +83,16 @@ Now let's run GATK's Unified Genotyper. The GATK people recommend a few quality Read 1 looks fine, but read 2 is probably mis-aligned; rather than counting the C as an insertion, which is probably better, the aligner has placed it after the indel, making it look like there is a SNP there. In GATK we can account for this problem by doing local re-alignment around potential indel sites, in which we incorporate information about all the reads in that region simultaneously, rather than mapping each read individually. First, GATK needs to figure out which regions are in need of re-alignment:: - %% java -Xmx1g -jar GenomeAnalysisTK-1.0.5974/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -I ../RAL357_full_bwa.sorted.bam -o RAL357.realign.intervals -L 2L:100000-150000 + %% java -Xmx1g -jar GenomeAnalysisTK-1.1-23-g8072bd9/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -I ../RAL357_full_bwa.sorted.bam -o RAL357.realign.intervals -L 2L:100000-150000 And then it needs to actually do the re-alignment (this step is slow, taking ~10 min, even for our small region):: - %% java -Xmx4g -jar GenomeAnalysisTK-1.0.5974/GenomeAnalysisTK.jar -I ../RAL357_full_bwa.sorted.bam -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -T IndelRealigner -targetIntervals RAL357.realign.intervals -o ../RAL357_full_bwa.realigned.bam + %% java -Xmx4g -jar GenomeAnalysisTK-1.1-23-g8072bd9/GenomeAnalysisTK.jar -I ../RAL357_full_bwa.sorted.bam -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -T IndelRealigner -targetIntervals RAL357.realign.intervals -o ../RAL357_full_bwa.realigned.bam Now we need to do this re-alignment again with the other fly strain that we genotyped (if you were running a large number of samples, this is where a shell script would come in handy). You might want to run this in the background or in a new session while you're re-aligning the first one:: - %% java -Xmx1g -jar GenomeAnalysisTK-1.0.5974/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -I ../RAL391_full_bwa.sorted.bam -o RAL391.realign.intervals -L 2L:100000-150000 - %% java -Xmx4g -jar GenomeAnalysisTK-1.0.5974/GenomeAnalysisTK.jar -I ../RAL391_full_bwa.sorted.bam -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -T IndelRealigner -targetIntervals RAL391.realign.intervals -o ../RAL391_full_bwa.realigned.bam + %% java -Xmx1g -jar GenomeAnalysisTK-1.1-23-g8072bd9/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -I ../RAL391_full_bwa.sorted.bam -o RAL391.realign.intervals -L 2L:100000-150000 + %% java -Xmx4g -jar GenomeAnalysisTK-1.1-23-g8072bd9/GenomeAnalysisTK.jar -I ../RAL391_full_bwa.sorted.bam -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -T IndelRealigner -targetIntervals RAL391.realign.intervals -o ../RAL391_full_bwa.realigned.bam Now we would like to run the SNP caller, but we first need to use a package called Picard to fix a minor formatting problem in the re-aligned BAM files before GATK's Unified Genotyper will accept them (these steps take a few minutes each as well):: @@ -105,7 +105,7 @@ Now we would like to run the SNP caller, but we first need to use a package call And finally we can run GATK's SNP caller:: %% cd /data/snp_calling/GATK_snps - %% java -jar GenomeAnalysisTK-1.0.5974/GenomeAnalysisTK.jar -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -T UnifiedGenotyper -I ../RAL357_full_bwa.realigned.fixed.bam -I ../RAL391_full_bwa.realigned.fixed.bam -o RAL_GATK.vcf -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 500 -L 2L:100000-150000 + %% java -jar GenomeAnalysisTK-1.1-23-g8072bd9/GenomeAnalysisTK.jar -R /data/drosophila/dmel-all-chromosome-r5.37.fasta -T UnifiedGenotyper -I ../RAL357_full_bwa.realigned.fixed.bam -I ../RAL391_full_bwa.realigned.fixed.bam -o RAL_GATK.vcf -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 500 -L 2L:100000-150000 These parameters are similar, but not identical, to those in Samtools. -stand_emit_conf 10.0 means that it won't report any potential SNPs with a quality below 10.0; but unless they meet the quality threshold set by -stand_call_conf (50.0, in this case), they will be listed as failing the quality filter. -dcov 500 means that any site that has more than 500x coverage, the genotype caller will only use 500 randomly selected reads (for computational efficiency). @@ -137,4 +137,4 @@ There is a package called vcftools that has all sorts of utilities for working w Another exercise: -In the snp_calling directory, you will also find BAM files generated by aligning the same set of reads to the same reference genome for one of the two fly lines (RAL357) using bowtie rather than bwa. Use Samtools to call SNPs and generate a VCF file on the bowtie alignment and compare it to the VCF file you got from the bwa alignment. \ No newline at end of file +In the snp_calling directory, you will also find BAM files generated by aligning the same set of reads to the same reference genome for one of the two fly lines (RAL357) using bowtie rather than bwa. Use Samtools to call SNPs and generate a VCF file on the bowtie alignment and compare it to the VCF file you got from the bwa alignment.