Impatient? See our Quickstart Guide
A brief overview of the theory can be found here.
The riboSeed manuscript can be found here.
Nicholas R Waters, Florence Abram, Fiona Brennan, Ashleigh Holmes, Leighton Pritchard;
riboSeed: leveraging prokaryotic genomic architecture to assemble across ribosomal regions,
Nucleic Acids Research, gky212, https://doi.org/10.1093/nar/gky212
Interested in the figures/tables/analyses in the manuscript? See the README in the scripts
dir.
riboSeed
requires an appropriate reference genome for the de fere novo assembly. We recommend using PlentyOfBugs., which simplifies this process by comparing a preliminary assembly of your isolate to existing reference genomes.
Please back up any and all data used, and work within a virtualenv.
Genome assembly gobbles RAM. If you, like me, are working on a 4gb RAM lappy, don't run riboSeed in parallel and instead run in series by using the --serialize
option. That should prevent you from running out of RAM during the final SPAdes calls.
riboSeed is an supplemental assembly refinement method to try to address the issue of multiple ribosomal regions in a genome, as these create repeats unresolvable by short read sequencing. It takes advantage of the fact that while each region is identical, the regions flanking are unique, and therefore can potentially be used to seed an assembly in such a way that rDNA regions are bridged.
For a description of each submodule, follow the links below to the readthedocs manual page.
Preprocessing
De fere novo assembly
Visualizations/assessment
snag
| extract and visualize rDNA regionsstack
| calculate coverage at rDNAs in final assemblysketch
| plot the relative rDNA regions in a handful of genomesswap
| switch questionable contigsscore
| automated scoring for rDNA assembliesspec
| speculate the nunber of rDNA operons based on assembly graph
riboSeed can be run as a docker container, as follows
docker run -it --rm -v $PWD:/data/ nickp60/riboseed:latest run -r /data/<path to reference file> -S1 /data/<path to reads> -o /data/<path to desired output>
# for instance,
docker run -it --rm -v $PWD:/data/ nickp60/riboseed:0.4.90 run -r /data/tests/references/contigs.fasta -S1 /data/tests/references/toy_reads1.fq -o /data/dockertest/
-it
helps deal with managing messages between the container and the host, and --rm
deletes the container as it exits. -v
sets volumes to allow a bridge between the container and the host.
Conda is a cross-platform, cross-language package management system. If you haven't already installed conda, follow these instructions here, and install the python3 version. Once you have that done, install riboSeed and all of its dependencies with one command:
conda create --name ribo_env riboseed
source activate ribo_env
(Note the lowercase "s")
You can also clone this repository, and run python setup.py install
.
Python requirements can be found in the requirements.txt
file.
External requirements can be found in the environment.yml
, and can be used to create a conda environment: (conda env create -f environment.yml
)
NOTE: barrnap has certain Perl requirements that may not be included on your machine. Ensure barrnap runs fine before trying ribo snag
. Or try python barrnap.
The ribo run
command orchestrates the most commonly used sequence of calls to scan
, select
, seed
, sketch
, score
, and so on.
usage: ribo run [-r reference.fasta] -c config_file [-o /output/dir/]
[-n experiment_name] [-K {bac,euk,arc,mito}] [-S 16S:23S:5S]
[--clusters str] [-C str] [-F reads_F.fq] [-R reads_R.fq]
[-S1 reads_S.fq] [-s int]
[--ref_as_contig {ignore,infer,trusted,untrusted}] [--linear]
[-j] [--score] [-l int] [-k 21,33,55,77,99,127]
[--force_kmers] [-p 21,33,55,77,99] [-d int] [--clean_temps]
[-i int] [-v {1,2,3,4,5}] [--cores int] [--memory int]
[--damn_the_torpedos] [-t {1,2,4}] [-z] [-h] [--version]
Run the riboSeed pipeline of scan, select, seed, sketch, and score. Uses a
config file to wrangle all the args not available via these commandline args.
This can either be run by providing (as minimum) a reference, some reads, and
an output directory; or, if you have a completed config file, you can run it
with just that.
optional arguments:
-r reference.fasta, --reference_fasta reference.fasta
path to a (multi)fasta or a directory containing one
or more chromosomal sequences in fasta format.
Required, unless using a config file
-c config_file, --config config_file
config file; if none given, create one; default:
/home/nicholas/GitHub/riboSeed
-o /output/dir/, --output /output/dir/
output directory; default: /home/nicholas/GitHub/riboS
eed/2018-06-14T1353_riboSeed_pipeline_results/
-n experiment_name, --experiment_name experiment_name
prefix for results files; default: inferred
-K {bac,euk,arc,mito}, --Kingdom {bac,euk,arc,mito}
whether to look for eukaryotic, archaeal, or bacterial
rDNA; default: bac
-S 16S:23S:5S, --specific_features 16S:23S:5S
colon:separated -- specific features; default:
16S:23S:5S
--clusters str number of rDNA clusters;if submitting multiple
records, must be a colon:separated list whose length
matches number of genbank records. Default is inferred
from specific feature with fewest hits
-C str, --cluster_file str
clustered_loci file output from riboSelect;this is
created by default from run_riboSeed, but if you don't
agree with the operon structure predicted by
riboSelect, you can use your alternate clustered_loci
file. default: None
-F reads_F.fq, --fastq1 reads_F.fq
path to forward fastq file, can be compressed
-R reads_R.fq, --fastq2 reads_R.fq
path to reverse fastq file, can be compressed
-S1 reads_S.fq, --fastq_single1 reads_S.fq
path to single fastq file
-s int, --score_min int
If using smalt, this sets the '-m' param; default with
smalt is inferred from read length. If using BWA,
reads mapping with ASscore lower than this will be
rejected; default with BWA is half of read length
--ref_as_contig {ignore,infer,trusted,untrusted}
ignore: reference will not be used in subassembly.
trusted: SPAdes will use the seed sequences as a
--trusted-contig; untrusted: SPAdes will treat as
--untrusted-contig. infer: if mapping percentage over
80%, 'trusted'; else 'untrusted'. See SPAdes docs for
details. default: infer
--linear if genome is known to not be circular and a region of
interest (including flanking bits) extends past
chromosome end, this extends the seqence past
chromosome origin forward by --padding; default: False
--subassembler {spades,skesa}
assembler to use for subassembly scheme. SPAdes is
used by default, but Skesa is a new addition that
seems to work for subassembly and is faster
-j, --just_seed Don't do an assembly, just generate the long read
'seeds'; default: False
--score run riboScore too! default: False
-l int, --flanking_length int
length of flanking regions, in bp; default: 1000
-k 21,33,55,77,99,127, --kmers 21,33,55,77,99,127
kmers used for final assembly, separated by commas
such as21,33,55,77,99,127. Can be set to 'auto', where
SPAdes chooses. We ensure kmers are not too big or too
close to read length; default: 21,33,55,77,99,127
--force_kmers skip checking to see if kmerchoice is appropriate to
read length. Sometimes kmers longer than reads can
help in the final assembly, as the long reads
generated by riboSeed contain kmers longer than the
read length
-p 21,33,55,77,99, --pre_kmers 21,33,55,77,99
kmers used during seeding assemblies, separated bt
commas; default: 21,33,55,77,99
-d int, --min_flank_depth int
a subassembly won't be performed if this minimum depth
is not achieved on both the 3' and5' end of the
pseudocontig. default: 0
--clean_temps if --clean_temps, mapping files will be removed once
they are no no longer needed during the mapping
iterations to save space; default: False
-i int, --iterations int
if iterations>1, multiple seedings will occur after
subassembly of seed regions; if setting --target_len,
seedings will continue until --iterations are
completed or --target_len is matched or exceeded;
default: 3
-v {1,2,3,4,5}, --verbosity {1,2,3,4,5}
Logger writes debug to file in output dir; this sets
verbosity level sent to stderr. 1 = debug(), 2 =
info(), 3 = warning(), 4 = error() and 5 = critical();
default: 2
--cores int cores used; default: None
--memory int cores for multiprocessing; default: 8
--damn_the_torpedos Ignore certain errors, full speed ahead!
-t {1,2,4}, --threads {1,2,4}
if your cores are hyperthreaded, set number threads to
the number of threads per processer.If unsure, see
'cat /proc/cpuinfo' under 'cpu cores', or 'lscpu'
under 'Thread(s) per core'.: 1
-z, --serialize if --serialize, runs seeding and assembly without
multiprocessing. We recommend this for machines with
less than 8GB RAM: False
-h, --help Displays this help message
--version show program's version number and exit
Pull requests are more than welcome!
You may run into issues where you get an error about "Unable to connect to X server: None" or localhost:N. Sorry about that; any tips would be useful; a quick glance at the commit history will show I have spent much time trying to resolve it, without any luck. If you do run into this, try the following:
- connect to the machine with an X session (
ssh -X hostname
) - avoid using
gnu screen
if possible, but if you do need to use it, start thescreen
session after ensuring you have a$DISPLAY
availible through starting the host session with-X
If you are on MacOS, you may run into an issue with Pysam.
ImportError: dlopen(/Users/nicholas/miniconda3/envs/ribo/lib/python3.5/site-packages/pysam/libchtslib.cpython-35m-darwin.so, 2): Library not loaded: @rpath/liblzma.5.dylib
Referenced from: /Users/nicholas/miniconda3/envs/ribo/lib/libhts.2.dylib
Reason: Incompatible library version: libhts.2.dylib requires version 8.0.0 or later, but liblzma.5.dylib provides version 6.0.0
The simplest solution is to pip instal pysam, forcing the original to be overwritten:
pip install pysam -U
In cases where this does not work, try installing by first making a conda env with the environment.yaml
file, and then installing riboSeed from pip.
conda env create -y environment.yaml
source activate ribo
pip install riboSeed
If you run into malloc issues similar to ablab/spades#9, we recommend running in a VM.
Submitting --smalt_scoring
with vastly different scoring schemes usually causes an error.
The tests for the module can be found under the tests
directory. I run them with the unittests module. The tests assume the installation of all the recommended tools.