-
Notifications
You must be signed in to change notification settings - Fork 242
HOWTOs
NOTE: This page is obsolete and out of date. The documentation has moved to http://samtools.github.io/bcftools/howtos/index.html.
- Installing samtools and bcftools
- Variant calling via mpileup
- Variant filtering
- Consensus calling
- Obtaining the list (or number) of samples from a VCF
- Querying a VCF
- Add contig or missing INFO/FORMAT lines to a header
- Add INFO tags from a TAB/BED file
Installing samtools and bcftools
git clone --branch=develop git://github.com/samtools/htslib.git
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/samtools.git
cd bcftools; make
cd ../samtools; make
Variant calling via mpileup
samtools mpileup -uf ref.fa aln.bam | bcftools call -mv -Oz > calls.vcf.gz
Note that the calling can be made more sensitive/restrictive by using different prior,
the default is bcftools call -P1.1e-3
.
Variant filtering
Variant filtering is an area of active development and is not easy. The annotations produced by variant callers provide only indirect hints about quality of the calls, an approach which worked for one dataset may not work for another. Nowadays most powerful seem machine learning approaches such as SVM (not implemented in bcftools), see an example of SVM filtering pipeline here.
The least one can do is to apply fixed-threshold filtering. One can get a feel for typical distributions of annotations from a set of known good calls vs all calls. If you don’t already have a pipeline for this, use bcftools query to extract the annotations and plot manually.
A good measure of the callset quality is often ts/tv, the ratio of transitions and transversions. Try the following command with different QUAL thresholds
bcftools filter -i'%QUAL>20' calls.vcf.gz | bcftools stats | grep TSTV
Another useful metrics is sequencing depth (DP bigger than twice the average depth indicates problematic regions and is often enriched for artefacts); the minimum number of high-quality non-reference reads (DV); proximity to indels (-g), etc. To give a concrete example, the following filter seemed to work quite well for one particular dataset (human data, exomes):
bcftools filter -sLowQual -g3 -G10 \
-e'%QUAL<10 || (RPB<0.1 && %QUAL<15) || (AC<2 && %QUAL<15) || %MAX(DV)<=3 || %MAX(DV)/%MAX(DP)<=0.3' \
calls.vcf.gz
- via vcf2fq:
samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq
- via bcftools consensus:
samtools mpileup -uf ref.fa aln.bam | bcftools call -mv -Oz -o calls.vcf.gz
tabix calls.vcf.gz
cat ref.fa | bcftools consensus calls.vcf.gz > cns.fa
Obtaining the list (or number) of samples from a VCF
To get the list of samples:
bcftools query -l file.vcf.gz
To get the number of samples:
bcftools query -l file.vcf.gz | wc -l
Querying a VCF
Retrieve the CHROM, POS and FORMAT/GT field from a VCF, include only lines where ID column is annotated
bcftools query -i'%ID!="."' -f'%CHROM %POS[ %GT]\n' file.vcf.gz
Add contig or missing INFO/FORMAT lines to a header
Some bcftools commands complain if there are missing contig, INFO or FORMAT lines in the header of your VCF. One way to add them is using bcftools annotate
. Make a file with the header lines to be appended, then
bcftools annotate -h header_lines.txt file.vcf.gz
This will append the lines to your header. It will not replace or edit IDs that already exist.
Add INFO tags from a TAB/BED file
Suppose we want to create new INFO tags FOO and BAR, say FOO are strings
and BAR are integers. Create a BED file annots.bed which contains:
1 752566 752567 FooValue1 12345
1 752721 752722 FooValue2 67890
Note that the same chromosome naming convention must be used in both BED and VCF file!
Alternatively, a TAB-delimited file can used instead. In such a case, create annots.tab
with 1-based and inclusive coordinates:
1 752567 752567 FooValue1 12345
1 752722 752722 FooValue2 67890
Now we need to tell bcftools how is FOO and BAR defined. Create a header file
annots.hdr which contains:
##INFO=<ID=FOO,Number=1,Type=String,Description="Some description">
##INFO=<ID=BAR,Number=1,Type=Integer,Description="Some description">
Finally index the BED file, verify that the index works as expected and let bcftools do its job:
bgzip annots.bed
tabix -p bed annots.bed.gz
tabix annots.bed.gz 1:752567-752567 # works as expected?
bcftools -a annots.bed.gz -h annots.hdr -c CHROM,FROM,TO,FOO,BAR in.vcf.gz
or, in the case of a TAB delimited file:
bgzip annots.tab
tabix -s1 -b2 -e3 annots.tab.gz
tabix annots.bed.gz 1:752567-752567 # works as expected?
bcftools -a annots.tab.gz -h annots.hdr -c CHROM,FROM,TO,FOO,BAR in.vcf.gz