diff --git a/README.md b/README.md index acf6aff..c612247 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ -# phyloFlash v3.4 +# phyloFlash [![GitHub (pre-)release](https://img.shields.io/github/release/HRGV/phyloflash/all.svg?label=Latest%20Version)]() [![Bioconda](https://img.shields.io/conda/vn/Bioconda/phyloFlash.svg)](https://bioconda.github.io/recipes/phyloflash/README.html) @@ -12,18 +12,14 @@ by Harald Gruber-Vodicka, Elmar A. Pruesse, and Brandon Seah. phylogenetic composition of an Illumina (meta)genomic or transcriptomic dataset. **[Manual](https://hrgv.github.io/phyloFlash)** -***NOTE*** Version 3 changed some input options and also how mapping-based taxa -(NTUs) are handled. Please download the last release of v2.0 ([tar.gz -archive](https://github.com/HRGV/phyloFlash/archive/v2.0-beta6.tar.gz)) for the -old implementation. No changes have been made to the database setup, so -databases prepared for v2.0 can still be used for v3.0. - Read [our paper](https://doi.org/10.1128/mSystems.00920-20) on phyloFlash. + ## Quick-start ### Download via Conda +We recommend installing phyloFlash and its dependencies using Conda or Mamba. [Conda](https://conda.io/docs/) is a package manager that will also install dependencies that are required if you don't have them already. @@ -31,15 +27,16 @@ phyloFlash is distributed through the [Bioconda](http://bioconda.github.io/) channel on Conda. According to the [Conda documentation](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html), -it is recommended to install all packages at the same time to avoid dependency -conflicts, and to create new environments instead of installing to the base -environment. +avoid installing new packages to your base environment but create new +environments for them as required. Also, specify all desired packages at the +same time when creating a new environment, instead of adding them sequentially, +to avoid dependency conflicts. -We also recommend using [Mamba](https://mamba.readthedocs.io/en/latest/) as a +We also suggest using [Mamba](https://mamba.readthedocs.io/en/latest/) as a drop-in substitute for Conda. It implements a more effective dependency solver -and is also the default Conda frontend for the pipeline managers Snakemake. -Conda sometimes fails to solve the environment, and in these cases Mamba -usually works. +and is also the default Conda frontend for the pipeline manager Snakemake. +Simply replace `conda` with `mamba` in the commands below. Note that the +`defaults` channel should be enabled. ```bash # If you haven't set up Bioconda already @@ -49,39 +46,58 @@ conda config --add channels conda-forge # Try the following step if "solving environment" does not terminate conda config --set channel_priority strict # Create new environment named "pf" with phyloflash -# Sortmerna is an optional dependency -conda create -n pf phyloflash sortmerna=2.1b -# If Conda is unable to solve the environment; requires mamba in base env -mamba create -n pf phyloflash sortmerna=2.1b +conda create -n pf phyloflash +# Activate environment +conda activate pf +# Check that dependencies all installed properly +phyloFlash.pl -check_env ``` -### Download from GitHub -If you prefer not to use Conda, or are interested in a specific version that is -not distributed there, you can download releases from the -[releases](https://github.com/HRGV/phyloFlash/releases) page on GitHub. +### Download pre-formatted database -If you clone the repository directly off GitHub you might end up with a version -that is still under development. +Pre-formatted databases derived from SILVA releases 138 onwards are available +from the following Zenodo archives: -```bash -# Download latest release -wget https://github.com/HRGV/phyloFlash/archive/pf3.4.tar.gz -tar -xzf pf3.4.tar.gz + * [SILVA 138.1](https://doi.org/10.5281/zenodo.7892521) (latest) + * [SILVA 138](https://doi.org/10.5281/zenodo.7890453) -# Check for dependencies and install them if necessary -cd phyloFlash-pf3.4 -./phyloFlash.pl -check_env +Download, checksum, and unpack: + +```bash +wget https://zenodo.org/record/7892522/files/138.1.tar.gz # 5.5 GB download +tar -xzf 138.1.tar.gz # unpacks folder 138.1/ in the current location ``` -### Set up database and run +Specify path to the database folder with the option `-dbhome` when running +phyloFlash (see below). -This assumes that the phyloFlash scripts are already in your path. +Older versions of the SILVA database have a more restrictive license, so we are +unable to distribute pre-formatted versions. You will have to download the +original SILVA files and run the `phyloFlash_makedb.pl` script yourself (see +Manual). + + +### Test phyloFlash with test dataset + +Test data are included with phyloFlash. The following assumes that you +installed phyloFlash to a Conda environment called `pf`, and that the database +files have been unpacked to a folder `/path/to/138.1`. By default, phyloFlash +will look for the database folder in the folder where it is installed. If it is +located somewhere else, specify this to the `-dbhome` option. ```bash -# Install reference database (takes some time) -phyloFlash_makedb.pl --remote +conda activate pf # If Conda environment not already activated +phyloFlash.pl -dbhome /path/to/138.1 -lib TEST -CPUs 16 \ + -read1 ${CONDA_PREFIX}/lib/phyloFlash/test_files/test_F.fq.gz \ + -read2 ${CONDA_PREFIX}/lib/phyloFlash/test_files/test_R.fq.gz \ + -almosteverything +``` + +### Example phyloFlash commands + +```bash # Run with test data and 16 processors (default is to use all processors available) phyloFlash.pl -lib TEST -CPUs 16 -read1 test_files/test_F.fq.gz -read2 test_files/test_R.fq.gz @@ -116,6 +132,7 @@ read sets. Use the `-zip` switch to compress output files into tar.gz archive, and `-log` to save run messages to a log file + ## Output phyloFlash screens metagenomic or metatranscriptomic reads for SSU rRNA @@ -133,6 +150,7 @@ Plain text and HTML-formatted reports are produced, reporting summary statistics from each run. The HTML report includes an interactive graphical summary. + ## Going further The phyloFlash suite also includes other tools for SSU rRNA-centric metagenome @@ -149,9 +167,13 @@ analyses. Run the commands without arguments to see help messages. and extract contigs connected to them. Optionally compare to phyloFlash results from the same library. + ## Manual -For further information **please refer to the [Manual](https://hrgv.github.io/phyloFlash)**. +For further information please refer to the +[Manual](https://hrgv.github.io/phyloFlash) as well as the command-line help +page `phyloFlash.pl -man`. + ## Versions and changes @@ -199,6 +221,7 @@ For further information **please refer to the [Manual](https://hrgv.github.io/ph * No change to heatmap script for comparing multiple samples * v2.0 complete rewrite + ## Contact Please report any problems to the [phyloFlash Google @@ -210,12 +233,14 @@ issue tracker. We also welcome any feedback on the software and its documentation, especially suggestions for improvement! + ## Acknowledgements We thank colleagues and phyloFlash users who have contributed to phyloFlash development by testing the software, reporting bugs, and suggesting new features. + ## Citation If you use phyloFlash for a publication, please cite our paper in _mSystems_: diff --git a/docs/index.md b/docs/index.md index 0bb997a..b990bab 100644 --- a/docs/index.md +++ b/docs/index.md @@ -8,12 +8,6 @@ layout: home phylogenetic composition of an Illumina (meta)genomic or transcriptomic dataset. -***NOTE*** Version 3 changes some input options and also how mapping-based taxa -(NTUs) are handled. Please download the last release of v2.0 ([tar.gz -archive](https://github.com/HRGV/phyloFlash/archive/v2.0-beta6.tar.gz)) for the -old implementation. No changes have been made to the database setup, so -databases prepared for v2.0 can still be used for v3.0. - This manual explains how to install and use phyloFlash. Navigate from the menu bar above or the table of contents below. @@ -33,22 +27,24 @@ You may read more about the pipeline design and application in our ### Download via Conda +We recommend installing phyloFlash and its dependencies using Conda or Mamba. [Conda](https://conda.io/docs/) is a package manager that will also install -dependencies that are required if you don't have them already. phyloFlash is -distributed through the [Bioconda](http://bioconda.github.io/) channel on -Conda. +dependencies that are required if you don't have them already. + +phyloFlash is distributed through the [Bioconda](http://bioconda.github.io/) +channel on Conda. -According to the [Conda -documentation](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html), -it is recommended to install all packages at the same time to avoid dependency -conflicts, and to create new environments instead of installing to the base -environment. +According to the [Conda documentation](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html), +avoid installing new packages to your base environment but create new +environments for them as required. Also, specify all desired packages at the +same time when creating a new environment, instead of adding them sequentially, +to avoid dependency conflicts. -We also recommend using [Mamba](https://mamba.readthedocs.io/en/latest/) as a +We also suggest using [Mamba](https://mamba.readthedocs.io/en/latest/) as a drop-in substitute for Conda. It implements a more effective dependency solver -and is also the default Conda frontend for the pipeline managers Snakemake. -Conda sometimes fails to solve the environment, and in these cases Mamba -usually works. +and is also the default Conda frontend for the pipeline manager Snakemake. +Simply replace `conda` with `mamba` in the commands below. Note that the +`defaults` channel should be enabled. ```bash # If you haven't set up Bioconda already @@ -58,39 +54,58 @@ conda config --add channels conda-forge # Try the following step if "solving environment" does not terminate conda config --set channel_priority strict # Create new environment named "pf" with phyloflash -# sortmerna is an optional dependency -conda create -n pf phyloflash sortmerna=2.1b -# If Conda is unable to solve the environment; requires mamba in base env -mamba create -n pf phyloflash sortmerna=2.1b +conda create -n pf phyloflash +# Activate environment +conda activate pf +# Check that dependencies all installed properly +phyloFlash.pl -check_env ``` -### Download from GitHub -If you prefer not to use Conda, or are interested in a specific version that is -not distributed there, you can download releases from the -[releases](https://github.com/HRGV/phyloFlash/releases) page on GitHub. +### Download pre-formatted database -If you clone the repository directly off GitHub you might end up with a version -that is still under development. +Pre-formatted databases derived from SILVA releases 138 onwards are available +from the following Zenodo archives: -```bash -# Download latest release -wget https://github.com/HRGV/phyloFlash/archive/pf3.4.tar.gz -tar -xzf pf3.4.tar.gz + * [SILVA 138.1](https://doi.org/10.5281/zenodo.7892521) (latest) + * [SILVA 138](https://doi.org/10.5281/zenodo.7890453) -# Check for dependencies and install them if necessary -cd phyloFlash-pf3.4 -./phyloFlash.pl -check_env +Download, checksum, and unpack: + +```bash +wget https://zenodo.org/record/7892522/files/138.1.tar.gz # 5.5 GB download +tar -xzf 138.1.tar.gz # unpacks folder 138.1/ in the current location ``` -### Set up database and run +Specify path to the database folder with the option `-dbhome` when running +phyloFlash (see below). + +Older versions of the SILVA database have a more restrictive license, so we are +unable to distribute pre-formatted versions. You will have to download the +original SILVA files and run the `phyloFlash_makedb.pl` script yourself (see +Manual). -This assumes that the phyloFlash scripts are already in your path. + +### Test phyloFlash with test dataset + +Test data are included with phyloFlash. The following assumes that you +installed phyloFlash to a Conda environment called `pf`, and that the database +files have been unpacked to a folder `/path/to/138.1`. By default, phyloFlash +will look for the database folder in the folder where it is installed. If it is +located somewhere else, specify this to the `-dbhome` option. ```bash -# Install reference database (takes some time) -phyloFlash_makedb.pl --remote +conda activate pf # If Conda environment not already activated +phyloFlash.pl -dbhome /path/to/138.1 -lib TEST -CPUs 16 \ + -read1 ${CONDA_PREFIX}/lib/phyloFlash/test_files/test_F.fq.gz \ + -read2 ${CONDA_PREFIX}/lib/phyloFlash/test_files/test_R.fq.gz \ + -almosteverything +``` + +### Example phyloFlash commands + +```bash # Run with test data and 16 processors (default is to use all processors available) phyloFlash.pl -lib TEST -CPUs 16 -read1 test_files/test_F.fq.gz -read2 test_files/test_R.fq.gz diff --git a/docs/install.md b/docs/install.md index 4dc7e92..b6886e5 100644 --- a/docs/install.md +++ b/docs/install.md @@ -4,43 +4,22 @@ title: Installation order: 1 --- -## Quick-start - -```bash -# Install via Conda -conda install sortmerna=2.1b # Only if you want to use Sortmerna (optional dependency) -conda install phyloflash -# Check for dependencies -phyloFlash.pl -check_env -# Download and set up database in current folder (takes some time) -phyloFlash_makedb.pl --remote -``` - ## 1. System requirements To use **phyloFlash** you will need a GNU/Linux system with Perl, R and Python installed. (OS X is for the brave, we have not tested this!) + ## 2. Download package ### 2.1 Download via Conda +We recommend installing phyloFlash and its dependencies using Conda or Mamba. [Conda](https://conda.io/docs/) is a package manager that will also install -dependencies that are required if you don't have them already. phyloFlash is -distributed through the [Bioconda](http://bioconda.github.io/) channel on -Conda. - -According to the [Conda -documentation](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html), -it is recommended to install all packages at the same time to avoid dependency -conflicts, and to create new environments instead of installing to the base -environment. - -We also recommend using [Mamba](https://mamba.readthedocs.io/en/latest/) as a -drop-in substitute for Conda. It implements a more effective dependency solver -and is also the default Conda frontend for the pipeline managers Snakemake. -Conda sometimes fails to solve the environment, and in these cases Mamba -usually works. +dependencies that are required if you don't have them already. + +phyloFlash is distributed through the [Bioconda](http://bioconda.github.io/) +channel on Conda. ```bash # If you haven't set up Bioconda already @@ -50,43 +29,45 @@ conda config --add channels conda-forge # Try the following step if "solving environment" does not terminate conda config --set channel_priority strict # Create new environment named "pf" with phyloflash -# sortmerna is an optional dependency -conda create -n pf phyloflash sortmerna=2.1b -# If Conda is unable to solve the environment; requires mamba in base env -mamba create -n pf phyloflash sortmerna=2.1b +conda create -n pf phyloflash +# Activate environment +conda activate pf +# Check that dependencies all installed properly +phyloFlash.pl -check_env ``` -In some cases, `conda install` can hang on the "Solving environment" step. This -appears to be because of ambiguities in dependency specifications in packages -on different channels (see this -[issue](https://github.com/conda/conda/issues/8197) on GitHub). Setting the -`channel_priority` to `strict` asks Conda to always pick the higher-priority -channel first when installing packages. This requires conda version to be 4.6 -and above. + * Avoid installing new packages to your base environment. Instead, create new + environments with required packages as you need them. + * Install packages to a new environment simultaneously, instead of adding them + sequentially. This will prevent dependency conflicts. + * In some cases, `conda install` can hang on the "Solving environment" step. + This appears to be because of ambiguities in dependency specifications in + packages on different channels (see this + [issue](https://github.com/conda/conda/issues/8197) on GitHub). Setting the + `channel_priority` to `strict` asks Conda to always pick the higher-priority + channel first when installing packages. This requires conda version to be + 4.6 and above. + * We also suggest using [Mamba](https://mamba.readthedocs.io/en/latest/) as a + drop-in substitute for Conda. It implements a more effective dependency + solver and is also the default Conda frontend for the pipeline manager + Snakemake. Simply replace `conda` with `mamba` in the commands. Note that + the `defaults` channel should be enabled. + * If you wish to use Sortmerna (optional) for extracting rRNA reads, specify + version 2.1b: `conda create -n pf_sortmerna phyloflash sortmerna=2.1b` -### 2.2 Download from GitHub - -If you prefer not to use Conda, or are interested in a specific version that is -not distributed there, you can download releases from the -[releases](https://github.com/HRGV/phyloFlash/releases) page on GitHub. -If you clone the repository directly off GitHub you might end up with a version -that is still under development. - -```bash -# Download latest release -wget https://github.com/HRGV/phyloFlash/archive/pf3.4.tar.gz -tar -xzf pf3.4.tar.gz -``` +### 2.2 Download from GitHub -Alternatively clone the latest development version with Git: +If you wish to modify the source code, you can clone the repository from GitHub ```bash git clone https://github.com/HRGV/phyloFlash.git -ls phyloFlash +cd phyloFlash +git status ``` -## 3. Check and install prerequisites + +## 3. Check and install dependencies Check that dependencies are available: @@ -102,7 +83,7 @@ phyloFlash relies on the following software: - [Perl >= 5.13.2](http://www.perl.org/get.html) - [EMIRGE](https://github.com/csmiller/EMIRGE) and its dependencies - [BBmap](http://sourceforge.net/projects/bbmap/) - - [Vsearch](https://github.com/torognes/vsearch) + - [Vsearch >=2.5.0](https://github.com/torognes/vsearch) - [SPAdes](http://bioinf.spbau.ru/spades) - [Bedtools](https://github.com/arq5x/bedtools2) - [Mafft](http://mafft.cbrc.jp/alignment/software/) @@ -128,81 +109,84 @@ Within R, run the command install.packages(c("ggdendro","gtable","reshape2","ggplot2","optparse")) ``` -## 4. Setting up the reference database +## 4. Set up the reference database phyloFlash uses modified versions of the SILVA SSU database of small-subunit ribosomal RNA sequences that is maintained by the [ARB SILVA project](www.arb-silva.de). -*NOTE: The [SILVA -license](http://www.arb-silva.de/fileadmin/silva_databases/current/LICENSE.txt) -prohibits usage of the SILVA databases or parts of them within a -non-academic/commercial environment beyond a 48h test period. If you want to -use the SILVA databases with phyloFlash in a non-academic/commercial -environment please contact them at contact(at)arb-silva.de.* - -The database has to be reformatted for use by phyloFlash. This is done with the -script `phyloFlash_makedb.pl`. Known contamination sequences from cloning -vectors are removed, repeat regions which can have an adverse effect on -sequence reconstruction are masked, the database is clustered at 99% and 96% -identity to speed up mapping/searching, and finally indexed for the read -mapper. - -*NOTE: A .udb indexed database will be created with Vsearch if version v2.5.0+ -is detected. However, the file will only be readable by the user running the -database setup script. If you wish to make it available for other users, please -change the file permissions for the .udb file accordingly.* - -The final disk space required for the default SILVA SSU database is about 5 Gb. -An additional 5 Gb is required for the `.udb` indexed database for Vsearch -v2.5.0+. An additional 2.5 Gb is required for the SortMeRNA indexed database if -requested. - -If you wish to use SortMeRNA in addition to or instead of BBmap for filtering -rRNA reads, pass the option `--sortmerena` to `phyloFlash_makedb.pl`. This -requires `sortmerna` and `indexdb_rna` to be in your path. At the moment only -SortMeRNA v2.1b is supported. -A full description of options for the database setup can be seen with +### 4.1. Download pre-formatted database -```bash -phyloFlash_makedb.pl --help -``` +Pre-formatted databases derived from SILVA releases 138 onwards are available +from the following Zenodo archives: -### 4.1. Downloading database automatically + * [SILVA 138.1](https://doi.org/10.5281/zenodo.7892521) (latest) + * [SILVA 138](https://doi.org/10.5281/zenodo.7890453) -To create a suitable database, just run +NOTE: Prebuilt databases are not provided for SILVA versions before 138, +because these are released under different license(s) that prohibit usage of +the SILVA databases or parts of them within a non-academic/commercial +environment beyond a 48 h test period. SILVA version 138 onwards is released +under a more permissive Creative Commons Attribution 4.0 license. + +Download, checksum, and unpack (example for release 138.1): ```bash -phyloFlash_makedb.pl --remote +wget https://zenodo.org/record/7892522/files/138.1.tar.gz # 5.5 GB download +tar -xzf 138.1.tar.gz # unpacks folder 138.1/ in the current location ``` -in the directory where you unpacked phyloFlash. The script will download the -most current source databases and prepare the files required by -`phyloFlash.pl`. +Specify path to the database folder with the option `-dbhome` when running +phyloFlash (see below). -*NOTE: This currently only works if you are not behind a proxy* -If you are behind a proxy and cannot download the database via the script, you -can download the current version of the SILVA database from [the SILVA -website](https://www.arb-silva.de/no_cache/download/archive/current/Exports/). -The filename should be `SILVA_XXX_SSURef_Nr99_tax_silva_trunc.fasta.gz` where -`XXX` is the current version number. You should also download the UniVec -database [from NCBI](https://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/). -Then proceed with the instructions in section 4.2 below. +### 4.2. Format database locally -### 4.2. Set up database from local copy of SILVA SSU NR99 +If you wish to use earlier versions of the SILVA database, or a custom database +file, you will have to format and index them. This is done with the script +`phyloFlash_makedb.pl`. Known contamination sequences from cloning vectors are +removed, repeat regions which can have an adverse effect on sequence +reconstruction are masked, the database is clustered at 99% and 96% identity to +speed up mapping/searching, and finally indexed for the read mapper. -If you already have a local copy of the SILVA SSU NR99 database (in Fasta -format), and the NCBI Univec database, you can supply the paths: +A full description of options for the database setup can be seen with + +```bash +phyloFlash_makedb.pl --help +``` + +Download the desired version of the SILVA SSURef NR99 database from [the SILVA +website](https://www.arb-silva.de/download/archive/) (in Fastsa format) under the `Exports` subfolder of the respective release. The filename should be `SILVA_XXX_SSURef_Nr99_tax_silva_trunc.fasta.gz` where +`XXX` is the version number. Links to the last five releases: + * [138.1](https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz) + * [138](https://www.arb-silva.de/fileadmin/silva_databases/release_138/Exports/SILVA_138_SSURef_NR99_tax_silva_trunc.fasta.gz) + * [132](https://www.arb-silva.de/fileadmin/silva_databases/release_132/Exports/SILVA_132_SSURef_Nr99_tax_silva_trunc.fasta.gz) + * [128](https://www.arb-silva.de/fileadmin/silva_databases/release_128/Exports/SILVA_128_SSURef_Nr99_tax_silva_trunc.fasta.gz) + * [123.1](https://www.arb-silva.de/fileadmin/silva_databases/release_123.1/Exports/SILVA_123.1_SSURef_Nr99_tax_silva_trunc.fasta.gz) + +Also download the UniVec database [from NCBI](https://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/). + +Specify the paths to the SILVA and UniVec files wtih the `--silva_file` and `--univec_file` options respectively to build the database locally, example below. ```bash phyloFlash_makedb.pl --univec_file /path/to/Univec --silva_file /path/to/SILVA_128_SSURef_Nr99_tax_silva_trunc.fasta.gz +# Creates a new folder ./128 ``` -By default, `phyloFlash.pl` will look in the folder where it is installed for -the subfolder with the highest SILVA version number. You can change this by -passing the `-dbhome