-
Notifications
You must be signed in to change notification settings - Fork 108
ABySS Users FAQ
- I am getting an error that says
Kmer::setLength(unsigned int): Assertion `length <= 64' failed
- My ABySS assembly jobs hang when I run them with high k values! (e.g. k=250)
- My ABySS MPI job with a large number of processors (over 1000) is using much more memory than expected. What's up?
- My ABySS assembly fails and I get an error that says
abyss-fixmate: error: All reads are mateless. This can happen when first and second read IDs do not match.
- Why do I count more contigs than
abyss-fac
that are larger than 500 bp? - How much memory does ABySS use?
- What are the lower case characters in my assembly?
ABySS has a compile-time parameter for the maximum value of k. As of ABySS 2.0.0, the maximum k value is 128 by default. In order to do assemblies with higher k values you must compile ABySS from source and use the --enable-maxk
option during the configure
step, i.e.
$ ./configure --enable-maxk=192
$ make
$ make install
The value of --enable-maxk
should be a multiple of 32. ABySS needs to know the maximum value of k so that it can minimize the amount of memory it uses to represent the de Bruijn graph. If memory usage is not a concern, you may set --enable-maxk
as high as you like.
The way that OpenMPI handles messages changes when the message sizes exceeded a certain size called the eager send limit. In ABySS, message size depends directly on k, and when the eager send limit is exceeded, assembly jobs will deadlock.
The best workaround for this problem is to explicitly set the eager send limit. This can be done by setting an environment variable called mpirun
in your cluster job script.
Example:
#!/bin/sh
PATH=/home/joe/abyss-1.3.7/maxk_96/bin:$PATH
export mpirun='mpirun --mca btl_sm_eager_limit 16000 --mca btl_openib_eager_limit 16000'
abyss-pe k=96 name=assembly in='read1.fastq read2.fastq'
The values for the btl_sm_eager_limit
and btl_openib_eager_limit
are in bytes, and it is usually fine to set them both to the same value. The formula for determining the appropriate value is:
eager_limit >= (max_k/4 + 32) * 100
3. My ABySS MPI job with a large number of processors (over 1000) is using much more memory than expected. What's up?
The default parameters of Open MPI allocate a large amount of memory to communication buffers. The following options will reduce the amount of memory allocated to buffers.
mpirun --mca btl_openib_receive_queues X,128,256,192,128:X,4096,256,128,32:X,12288,256,128,32:X,65536,256,128,3
4. My ABySS assembly fails and I get an error that says abyss-fixmate: error: All reads are mateless. This can happen when first and second read IDs do not match.
During the contig and scaffold stages of an assembly, ABySS aligns the paired end reads to the sequences that have been assembled so far (e.g. unitigs), so that it can link them into larger sequences (e.g. contigs). In order to be able to do this, ABySS needs to be able to correctly match up reads that belong to the same pair. If you are seeing this error, please check that either
- Both reads from a pair have identical FASTQ IDs (first word of line beginning with
@
), OR - Both reads from a pair have identical FASTQ IDs followed by
/1
and/2
, respectively.
It is actually not required for the sequences in the read 1 and read 2 files to be sorted in the same order, but it is strongly recommended because it reduces the memory usage of abyss-fixmate
. (In the majority of cases, the sequences in the read 1 and read 2 files will already be sorted in the same order anyway.)
abyss-fac
does not count Ns toward the 500 bp, and samtools faidx
counts all symbols. See the ABySS stats file format.
With Open MPI 3.x, you may see a segmentation fault similar to this one:
[hpce705:162958] *** Process received signal ***
[hpce705:162958] Signal: Segmentation fault (11)
[hpce705:162958] Signal code: (128)
[hpce705:162958] Failing at address: (nil)
[hpce705:162958] [ 0] /gsc/btl/linuxbrew/lib/libc.so.6(+0x33070)[0x7f7b4c627070]
[hpce705:162958] [ 1] /gsc/btl/linuxbrew/Cellar/open-mpi/3.1.0/lib/openmpi/mca_btl_vader.so(+0x4bde)[0x7f7b40b8fbde]
[hpce705:162958] [ 2] /gsc/btl/linuxbrew/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f7b4be5f60c]
[hpce705:162958] [ 3] /gsc/btl/linuxbrew/lib/libmpi.so.40(PMPI_Request_get_status+0x74)[0x7f7b4d2abb54]
[hpce705:162958] [ 4] ABYSS-P[0x40dcec]
Try using Open MPI 2.1.3. If that crashes as well, try using the shared-memory (sm
) BTL rather than the default vader
BTL by adding to your abyss-pe
command line mpirun='/path/to/openmpi-2.1.3/mpirun --mca btl self,sm'
.
The most memory intensive step of ABySS is the initial de Bruijn graph assembly step. Bloom filter ABySS (abyss-bloom-dbg
) uses the amount of memory specified by the -b, --bloom-size
option, plus some overhead.
Hash table ABySS (ABYSS
and ABYSS-P
) uses (8 + maxk / 4) · n bytes of RAM, plus some overhead, where n is the number of distinct k-mers. You may use ntCard to count the number of distinct k-mers in the data set, which is reported as F0
. For example, if ntCard reports F0 3000000000
…
(8 + maxk / 4) ⋅ n = (8 + 128 / 4) ⋅ 3e9 = 120 GB of RAM
Lower case characters represent positions where ABySS is unsure of the precise sequence at that location. The uncertainty could be due to heterozygous sequence or a collapsed repeat. Polishing your assembly using a tool such as Pilon, Racon, ntEdit or Unicycler-polish will refine the sequence in these uncertain loci.