-
Notifications
You must be signed in to change notification settings - Fork 20
SAMTOOLS Output Formats
Since SAM and BAM are originally not designed for local alignments, especially of protein sequences, this document describes Lambda's implementation of the standard.
Please see the official specification if some of the terms used here are not clear to you.
column | use in Lambda |
---|---|
QNAME | name of the query sequence, truncated at first whitespace |
FLAG | bit 16 and bit 256 implemented in a standard conform way |
RNAME | name of the subject sequence, truncated at first whitespace |
POS | begin position of alignment on subject sequence; begin position on original untranslated DNA sequence for TBlastN, TBlastX, end position if negative strand; begin position on protein sequence for BlastP, BlastX |
MAPQ | 255 |
CIGAR | query DNA cigar (untranslated DNA sequence for BlastX, TBlastX); * for BlastP, TBlastN; reversed if negative strand/frame |
RNEXT | * |
PNEXT | 0 |
TLEN | 0 |
SEQ | query DNA sequence (untranslated DNA sequence for BlastX, TBlastX); * for BlastP, TBlastN; reverse-complemented if negative strand/frame; see below for clipping |
QUAL | * |
OPT | see below |
Following the recommendations of the specification the SEQ field is only written, if it is different from the previous line's SEQ field. This can be changed via Lambda's command line parameter --sam-bam-seq
which can be set to always
or never
(the latter saves more space). This behaviour also applies to the qs
tag defined below.
Via the --sam-bam-clip
parameter you can chose between hard
-clipping and soft
-clipping. Soft-clipping will result in full sequences in the SEQ and qs
fields while hard-clipping will only show the locally matching part. Depending on that the CIGAR strings will also contain H
or S
characters. Hard-clipping is the default, because it takes up less space.
Please be aware that if the query sequence is translated, those DNA positions that are lost because frame-shifts or incomplete frames (at the end of a sequence) are always hard-clipped. These positions are also not represented in the protein cigar (see the qs
tag below).
tag | description |
---|---|
official | |
AS |
bit score |
OC |
query protein cigar (* for BLASTN) |
NM |
edit distance (in protein space unless BLASTN) |
IH |
number of matches this query has |
regarding the alignment | |
ae |
expect value |
ar |
raw score |
ai |
% identity (in protein space unless BLASTN) |
ap |
% positive (in protein space unless BLASTN) |
regarding the query sequence | |
qf |
query frame |
qs |
query protein sequence (* for BLASTN) |
regarding the subject sequence | |
sf |
subject frame |
st |
subject taxonomy ID(s) separated by ; (see Taxonomic Workflows) |
regarding all matches of this query | |
ls |
lowest common ancestor scientific name (see Taxonomic Workflows) |
lt |
lowest common ancestor taxonomy id (see Taxonomic Workflows) |
These tags can be specified with the command line argument --sam-bam-tags
. If you would like to see any other tags supported, please don't hesitate to contact us.
BAM files require all subject names to be written to the header. For SAM this is not required, so Lambda does
not automatically do it to save space (especially for protein database this is a lot!). If you still want
them with SAM, e.g. for better BAM compatibility, use the --sam-with-refheader
option.
If anything is unclear, don't hesitate to contact to me.