-
Notifications
You must be signed in to change notification settings - Fork 108
ABySS File Formats
Sequence overlap graph in ABySS adj (adjacency) format.
See the ABySS dot file format for a description of a sequence overlap graph.
Example:
23 44 198 ; 3193- 56- [d=-23] ; 3681-
25 30 1045 ; 3983- 1794- [d=-28] 2808+ [d=-28] 3136- [d=-28] ; 2699+ 4758+
27 54 175 ; 1255+ 4657- ;
28 51 3854 ; 875+ 3725- ; 1314- [d=-21]
29 73 1151 ; 3015+ ; 2199+
30 34 4896 ; 229- 4236+ [d=-26] 4060+ [d=-24] ; 2091+ 4267+
31 58 2483 ; 1454+ [d=-28] ; 3453+ [d=-28]
32 33 530 ; 2566- ; 3453+ [d=-28]
The .adj
files generated by ABySS describe the sequence overlap graph at each stage of an assembly. In the sequence overlap graph, each vertex represents a sequence (e.g. a contig) and each edge represents a perfect overlap between the ends of two sequences. In most cases, the length of the sequence overlap is k - 1 bases.
An .adj
file consists of 3 fields per line, separated by semicolons (';').
The first field (e.g. "28 51 3854") provides information about the subject sequence and consist of 3 parts: <SEQ_ID> <SEQ_LEN> <KMERS>
, where SEQ_ID
is a unique identifier for the sequence assigned by ABySS, SEQ_LEN
is the length of the sequence in bases, and KMERS
is the number of KMERS that mapped to the sequence during assembly (i.e. the sum of kmer multiplicities for each kmer in the sequence.)
The second and third fields (e.g. "3193- 56- [d=-23]", "3681-") list the SEQ_ID
of sequences that overlap the subject sequence. Each field consists of a list of whitespace-separated SEQ_ID
, each of which has a +
or -
suffix to indicate the sense of the sequence that produces the overlap. The +/- sense of a given sequence is determined by the form it takes in the FASTA file corresponding to the .adj
file. Using the naming conventions of the ABySS output files, this correspondence should usually be clear (e.g. "myassembly-1.adj" corresponds to "myassembly-1.fa"). The sense of the sequence listed in the FASTA file is considered to be the +
sense. By default, the length of the overlap between two sequences is assumed to be k - 1 bases. If this not the case, an additional distance specifier (e.g. "[d=-23]") must be inserted following the SEQ_ID
to indicate that the overlap is of a different length. A negative distance value indicates the overlap size in bases, whereas a positive distance indicates a gap size in bases.
It is important to note that the order of the second and third fields is exactly the opposite of what one would expect; the second field lists sequences that overlap the subject sequence on the right side and the third field lists sequences that overlap the subject sequence on the left side.
The .adj
format is a ABySS-specific format for describing graphs that should preferably be replaced by the more standard .dot
format.
NCBI GenBank Accessioned Golden Path
Sequence overlap graph in SGA ASQG format
ABySS can read and write ASQG files.
Tabular data in Comma-separated values format
Distance estimates in ABySS dist format (similar to adj format)
Distance estimate graph in Graphviz dot format
This file format is an extensions of the ABySS dot file format for sequence overlap graphs.
Edge property | Description |
---|---|
d | Distance (bp) |
e | Estimated error (bp) |
n | Number of supporting fragments |
The estimated distance from A+ to B+ is 100 bp with an estimated error of 5 bp and is supported by 10 read pairs:
"A+" -> "B+" [d=100 e=5 n=10]
Sequence overlap graph in Graphviz dot format. The extension .gv
is preferred over .dot
.
As a sequence overlap graph is primarily a graph, ABySS uses an existing graph file format, the GraphViz DOT file format. The GraphViz DOT syntax is well defined and implemented by a number of existing graph tools. Below is described how ABySS represents a sequence overlap graph using a subset of the GraphViz DOT language.
-
Sequences are represented by vertices and overlaps by directed edges.
-
A sequence is represented by two vertices: one vertex for the sequence itself and one for its reverse complement, which are named "A+" and "A-" respectively for a sequence named "A".
-
A directed edge represents an overlap of two sequences.
"A+" -> "B+"
indicates that a suffix of A matches a prefix of B.
Consider the following simple graph of four sequences (vertices) and three overlaps (edges).
A --> B
\
\
C --> D
The DOT graph is:
digraph g {
"A+" -> "B+"
"A+" -> "D+"
"C+" -> "D+"
}
When the contig A overlaps the reverse complement of B, it's shown like so:
"A+" -> "B-"
Just as every vertex was a twin, every edge has a twin. The existence of an edge (u, v) implies the existence of an edge (v-, u-), where x- denotes the reverse complement of x.
The following pairs of edges are equivalent:
"A+" -> "B+" "B-" -> "A-"
"C+" -> "D-" "D+" -> "C-"
"E-" -> "F+" "F-" -> "E+"
Vertices and edges may be augmented with properties. For example, The length of the contig A is 100 bp:
"A+" [l=100]
The sequences "A+" and "B+" overlap by 100 bp.
"A+" -> "B+" [d=-100]
ABySS represents an overlap with a negative distance and a gap with a positive distance. In this manner, the length of a path through the graph is the sum of the vertex lengths and edge distances.
The type (integer, float or string) of a property is evident from its format.
-
[l=100]
is an integer -
[c=3.5]
is a float -
[s="ABC"]
is a string
For a De Bruijn graph where all overlaps are exactly k-1 bp, a default edge property may be used. e.g. both overlaps below are 31 bp.
digraph g {
edge [d=-31]
"A+" -> "B+"
"B+" -> "C+"
}
The contig sequences are stored in a separate indexed FASTA file (.fai).
Using a standard file format has the advantage of giving immediate access to existing tools. For example...
-
Render the graph for visualization
dot -Tpng in.dot -o out.png
-
Count the number of vertices and edges
gc g.dot
-
Count the number of connected components
gc -c g.dot
-
Separate the connected components into separate files
ccomps -x in.dot -o out.dot
-
Remove any contigs with less than 5 coverage (where c is the coverage property)
gvpr -i 'N[c>=5]' in.dot >out.dot
-
Calculate the distance from a source vertex to all other vertices
dijkstra "A+" in.dot >out.dot
ABySS includes a tool, abyss-todot
, to convert between sequence
overlap graph formats, including SGA ASQG, SAM and
ABySS ADJ.
Contig and scaffold sequences in FASTA format
An index of sequences in FASTA format
Column | Description |
---|---|
1 | Name of the sequence |
2 | Length of the sequence |
3 | Offset of the first base in the file |
4 | Number of bases in each line |
5 | Number of bytes in each line |
Sequence overlap graph in Graphical Fragment Assembly (GFA) format.
Sequence overlap graph in Graphviz dot format. See Dot.
A histogram of the fragment size distribution in tab-separated values format, without a header
Column | Description |
---|---|
1 | Fragment size |
2 | Count |
- Positive fragment sizes are oriented forward-reverse (FR).
- Negative fragment sizes are oriented reverse-forward (RF).
Reports in Markdown format
An ABySS PATH file describes how sequences should be joined to form new sequences.
Example:
118
217 3- 140+ 178-
218 10- 148+
219 22+ 107-
220 43- 6+ 73+
221 51- 158+
222 52+ 119-
223 57- 166+ 175-
224 62- 209-
225 67- 176-
226 84- 156-
227 87- 134-
228 97- 216-
229 100+ 129+
230 102- 21- 188+
231 174- 27+ 192+
The format consists of two columns separated by a TAB character:
- ID of joined sequence
- list of sequences IDs to be joined (separated by spaces)
For each contig, the '+' orientation is the exact sequence that appears in the FASTA file, while the '-' orientation is the reverse complement of that sequence. If the line is composed of a single identifier, the specified contig is removed from the assembly.
Reads aligned to contig/scaffold sequences in Sequence Alignment/Map format
Example:
@SQ SN:5105 LN:122
@SQ SN:5106 LN:92
@SQ SN:5107 LN:186
* 161 4 1 32 6S32M63S 77 1 77 * *
* 161 4 1 32 40S32M29S 215 1 134 * *
* 129 4 1 32 69S32M 358 1 50 * *
* 161 4 1 31 31M70S 1390 1 34 * *
* 161 4 1 32 6S32M63S 1390 1 46 * *
* 161 4 1 32 6S32M63S 77 1 59 * *
* 161 4 1 32 6S32M63S 77 1 55 * *
* 177 4 1 32 58S32M11S 147 1 13 * *
* 161 4 1 30 13S30M58S 77 1 60 * *
The SAM format is used by ABySS to describe alignments of reads to assembled sequences at different stages of the assembly. As of ABySS version 1.3.8, the reads are aligned to the assembled sequences twice: once during the construction of the contigs, and once during the construction of the scaffolds.
By default, ABySS omits field 1 (QNAME, the ID of the aligned sequence), field 10 (SEQ, the aligned sequence), and field 11 (QUAL, the quality string of the aligned sequence) from any generated SAM data, placing a *
in these fields instead. This is done because the fields are not needed by ABySS to calculate distances estimates, and omitting the fields greatly reduces the overall size of the SAM data. (To generate SAM data that contains all of the fields, ABySS may be compiled using the --enable-samseqqual
option for configure
.)
The ABySS assembly pipeline does not generate any output SAM files by default, because the files tend to be very large. Instead, ABySS streams the SAM data through a Unix command pipeline to generate the distance estimates that are used to link contigs and scaffolds. Only the distance estimates are saved to disk (.dist
and .dist.dot
files). However, an ABySS user may force generation of output SAM files by specifying the "pe-sam" and/or "mp-sam" targets on the "abyss-pe" commandline, e.g.
abyss-pe name=myassembly k=40 in='read1.fastq read2.fastq' pe-sam mp-sam scaffolds
The "pe-sam" and "mp-sam" targets will generate the files myassembly-3.sam.gz
and myassembly-6.sam.gz
, respectively. There are also "pe-bam" and "mp-bam" targets if the user wishes to generate equivalent BAM (compressed SAM) files. See the abyss-pe
man page for more info.
Statistics of contig/scaffold contiguity in TSV, CSV and Markdown formats.
n | n:500 | L50 | LG50 | NG50 | min | N80 | N50 | N20 | E-size | max | sum | name |
---|---|---|---|---|---|---|---|---|---|---|---|---|
64 | 35 | 10 | 9 | 9644 | 720 | 4516 | 9404 | 11669 | 8519 | 15015 | 211449 | HS0674-unitigs.fa |
13 | 6 | 2 | 2 | 54743 | 8044 | 54373 | 54743 | 67480 | 51662 | 67480 | 212330 | HS0674-contigs.fa |
13 | 6 | 2 | 2 | 54743 | 8044 | 54373 | 54743 | 67480 | 51662 | 67480 | 212330 | HS0674-scaffolds.fa |
10 | 10 | 3 | 3 | 32704 | 2380 | 16561 | 32704 | 43251 | 30224 | 43251 | 212330 | HS0674-scaftigs.fa |
Each row represents a FASTA file at each major stage of the assembly.
Row | Description |
---|---|
Unitigs | Sequences assembled without using paired-end information |
Contigs | Sequences assembled with paired information, scaffolding over sequencing coverage gaps, but not repeats |
Scaffolds | Sequences assembled with paired information, scaffolding over sequencing coverage gaps and repeats |
Scaftigs | Scaffolds broken at any N or n character |
All the stats other than n are computed on sequences of the threshold size or larger, and the threshold size is given in the column header n:#
, and the smallest sequence at least this threshold size is given in the column min
. For example, if the column header is n:500
all the stats except n
are computed on sequences 500 bp or larger. The sizes of the sequences count only the ACGTacgt
characters and do not count the IUPAC ambiguity codes, particularly N
.
Column | Description |
---|---|
n | Total number of sequences |
n:500 | Number of sequences at least 500 bp |
L50 | Number of sequences at least the N50 size |
LG50 | Number of sequences at least the NG50 size |
NG50 | Half the genome is in sequences of the NG50 size or larger |
min | The size of the smallest sequence |
N80 | At least 80% of the assembly is in sequences of the N80 size or larger |
N50 | At least half the assembly is in sequences of the N50 size or larger |
N20 | At least 20% of the assembly is in sequences of the N20 size or larger |
E-size | The sum of the square of the sequence sizes divided by the assembly size |
max | The size of the largest sequence |
sum | The sum of the sequence sizes |
name | The file name of the assembly |
Tabular data in tab-separated values format, including a one-line header