Nextflow workflow for automatic repeat detection, classification and masking.
This pipeline is a copy of MAKER recommendation for 'advanced repeat library construction'.
The only requirement is Nextflow, which can be installed on any POSIX system (Linux, Solaris, OS X) with Java 7 or 8:
curl -s https://get.nextflow.io | bash
The pipeline itself requires a great many different pieces of software. There are two ways to install all of the software packages and scripts required for this pipeline - natively, or by using docker (recommended).
This is certainly the easier (and more reproducible) method. I've already built a docker image that includes almost all of the software necessary to run the pipeline. I'm unable to include the RepBase repeat database due to licencing restrictions, but you can register yourself for free (academic use only) and bundle it into a new docker image yourself:
# We don't need the tar.gz once the image is built, so we'll use a temporary directory.
cd `mktemp -d`
# Download the RepBase repeat library (replace RB_USERNAME and RB_PASSWORD with your username and password)
wget --user $RB_USERNAME \
--password $RB_PASSWORD \
-O repeatmaskerlibraries.tar.gz \
http://www.girinst.org/server/RepBase/protected/repeatmaskerlibraries/RepBaseRepeatMaskerEdition-20170127.tar.gz
# Make an (almost) empty Dockerfile.
# All of the important instructions are in the repeatmasker-onbuild image. You can either grab the pre-build copy from Dockerhub,
# or you can build it from the Dockerfiles/nf-repeatmasking-onbuild directory of this repository.
echo "FROM robsyme/nf-repeatmasking-onbuild" > Dockerfile
# Build a new docker images called 'repeats'.
# When building, this image looks for a file called 'repeatmaskerlibraries.tar.gz' which it pulls into the image.
docker build -t robsyme/nf-repeatmasking .
You'll need the following pieces of software
- Bioperl
- Hmmer
- MITE Hunter
- Genometools
- RepeatMasker
- Blast+ (v2.4.0)
- RepeatModeler
- R
- ggplot2
- dplyr
- tidyr
- magrittr