diff --git a/README.md b/README.md index acf6aff..c612247 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ phyloFlash logo -# phyloFlash v3.4 +# phyloFlash [![GitHub (pre-)release](https://img.shields.io/github/release/HRGV/phyloflash/all.svg?label=Latest%20Version)]() [![Bioconda](https://img.shields.io/conda/vn/Bioconda/phyloFlash.svg)](https://bioconda.github.io/recipes/phyloflash/README.html) @@ -12,18 +12,14 @@ by Harald Gruber-Vodicka, Elmar A. Pruesse, and Brandon Seah. phylogenetic composition of an Illumina (meta)genomic or transcriptomic dataset. **[Manual](https://hrgv.github.io/phyloFlash)** -***NOTE*** Version 3 changed some input options and also how mapping-based taxa -(NTUs) are handled. Please download the last release of v2.0 ([tar.gz -archive](https://github.com/HRGV/phyloFlash/archive/v2.0-beta6.tar.gz)) for the -old implementation. No changes have been made to the database setup, so -databases prepared for v2.0 can still be used for v3.0. - Read [our paper](https://doi.org/10.1128/mSystems.00920-20) on phyloFlash. + ## Quick-start ### Download via Conda +We recommend installing phyloFlash and its dependencies using Conda or Mamba. [Conda](https://conda.io/docs/) is a package manager that will also install dependencies that are required if you don't have them already. @@ -31,15 +27,16 @@ phyloFlash is distributed through the [Bioconda](http://bioconda.github.io/) channel on Conda. According to the [Conda documentation](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html), -it is recommended to install all packages at the same time to avoid dependency -conflicts, and to create new environments instead of installing to the base -environment. +avoid installing new packages to your base environment but create new +environments for them as required. Also, specify all desired packages at the +same time when creating a new environment, instead of adding them sequentially, +to avoid dependency conflicts. -We also recommend using [Mamba](https://mamba.readthedocs.io/en/latest/) as a +We also suggest using [Mamba](https://mamba.readthedocs.io/en/latest/) as a drop-in substitute for Conda. It implements a more effective dependency solver -and is also the default Conda frontend for the pipeline managers Snakemake. -Conda sometimes fails to solve the environment, and in these cases Mamba -usually works. +and is also the default Conda frontend for the pipeline manager Snakemake. +Simply replace `conda` with `mamba` in the commands below. Note that the +`defaults` channel should be enabled. ```bash # If you haven't set up Bioconda already @@ -49,39 +46,58 @@ conda config --add channels conda-forge # Try the following step if "solving environment" does not terminate conda config --set channel_priority strict # Create new environment named "pf" with phyloflash -# Sortmerna is an optional dependency -conda create -n pf phyloflash sortmerna=2.1b -# If Conda is unable to solve the environment; requires mamba in base env -mamba create -n pf phyloflash sortmerna=2.1b +conda create -n pf phyloflash +# Activate environment +conda activate pf +# Check that dependencies all installed properly +phyloFlash.pl -check_env ``` -### Download from GitHub -If you prefer not to use Conda, or are interested in a specific version that is -not distributed there, you can download releases from the -[releases](https://github.com/HRGV/phyloFlash/releases) page on GitHub. +### Download pre-formatted database -If you clone the repository directly off GitHub you might end up with a version -that is still under development. +Pre-formatted databases derived from SILVA releases 138 onwards are available +from the following Zenodo archives: -```bash -# Download latest release -wget https://github.com/HRGV/phyloFlash/archive/pf3.4.tar.gz -tar -xzf pf3.4.tar.gz + * [SILVA 138.1](https://doi.org/10.5281/zenodo.7892521) (latest) + * [SILVA 138](https://doi.org/10.5281/zenodo.7890453) -# Check for dependencies and install them if necessary -cd phyloFlash-pf3.4 -./phyloFlash.pl -check_env +Download, checksum, and unpack: + +```bash +wget https://zenodo.org/record/7892522/files/138.1.tar.gz # 5.5 GB download +tar -xzf 138.1.tar.gz # unpacks folder 138.1/ in the current location ``` -### Set up database and run +Specify path to the database folder with the option `-dbhome` when running +phyloFlash (see below). -This assumes that the phyloFlash scripts are already in your path. +Older versions of the SILVA database have a more restrictive license, so we are +unable to distribute pre-formatted versions. You will have to download the +original SILVA files and run the `phyloFlash_makedb.pl` script yourself (see +Manual). + + +### Test phyloFlash with test dataset + +Test data are included with phyloFlash. The following assumes that you +installed phyloFlash to a Conda environment called `pf`, and that the database +files have been unpacked to a folder `/path/to/138.1`. By default, phyloFlash +will look for the database folder in the folder where it is installed. If it is +located somewhere else, specify this to the `-dbhome` option. ```bash -# Install reference database (takes some time) -phyloFlash_makedb.pl --remote +conda activate pf # If Conda environment not already activated +phyloFlash.pl -dbhome /path/to/138.1 -lib TEST -CPUs 16 \ + -read1 ${CONDA_PREFIX}/lib/phyloFlash/test_files/test_F.fq.gz \ + -read2 ${CONDA_PREFIX}/lib/phyloFlash/test_files/test_R.fq.gz \ + -almosteverything +``` + +### Example phyloFlash commands + +```bash # Run with test data and 16 processors (default is to use all processors available) phyloFlash.pl -lib TEST -CPUs 16 -read1 test_files/test_F.fq.gz -read2 test_files/test_R.fq.gz @@ -116,6 +132,7 @@ read sets. Use the `-zip` switch to compress output files into tar.gz archive, and `-log` to save run messages to a log file + ## Output phyloFlash screens metagenomic or metatranscriptomic reads for SSU rRNA @@ -133,6 +150,7 @@ Plain text and HTML-formatted reports are produced, reporting summary statistics from each run. The HTML report includes an interactive graphical summary. + ## Going further The phyloFlash suite also includes other tools for SSU rRNA-centric metagenome @@ -149,9 +167,13 @@ analyses. Run the commands without arguments to see help messages. and extract contigs connected to them. Optionally compare to phyloFlash results from the same library. + ## Manual -For further information **please refer to the [Manual](https://hrgv.github.io/phyloFlash)**. +For further information please refer to the +[Manual](https://hrgv.github.io/phyloFlash) as well as the command-line help +page `phyloFlash.pl -man`. + ## Versions and changes @@ -199,6 +221,7 @@ For further information **please refer to the [Manual](https://hrgv.github.io/ph * No change to heatmap script for comparing multiple samples * v2.0 complete rewrite + ## Contact Please report any problems to the [phyloFlash Google @@ -210,12 +233,14 @@ issue tracker. We also welcome any feedback on the software and its documentation, especially suggestions for improvement! + ## Acknowledgements We thank colleagues and phyloFlash users who have contributed to phyloFlash development by testing the software, reporting bugs, and suggesting new features. + ## Citation If you use phyloFlash for a publication, please cite our paper in _mSystems_: diff --git a/docs/index.md b/docs/index.md index 0bb997a..b990bab 100644 --- a/docs/index.md +++ b/docs/index.md @@ -8,12 +8,6 @@ layout: home phylogenetic composition of an Illumina (meta)genomic or transcriptomic dataset. -***NOTE*** Version 3 changes some input options and also how mapping-based taxa -(NTUs) are handled. Please download the last release of v2.0 ([tar.gz -archive](https://github.com/HRGV/phyloFlash/archive/v2.0-beta6.tar.gz)) for the -old implementation. No changes have been made to the database setup, so -databases prepared for v2.0 can still be used for v3.0. - This manual explains how to install and use phyloFlash. Navigate from the menu bar above or the table of contents below. @@ -33,22 +27,24 @@ You may read more about the pipeline design and application in our ### Download via Conda +We recommend installing phyloFlash and its dependencies using Conda or Mamba. [Conda](https://conda.io/docs/) is a package manager that will also install -dependencies that are required if you don't have them already. phyloFlash is -distributed through the [Bioconda](http://bioconda.github.io/) channel on -Conda. +dependencies that are required if you don't have them already. + +phyloFlash is distributed through the [Bioconda](http://bioconda.github.io/) +channel on Conda. -According to the [Conda -documentation](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html), -it is recommended to install all packages at the same time to avoid dependency -conflicts, and to create new environments instead of installing to the base -environment. +According to the [Conda documentation](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html), +avoid installing new packages to your base environment but create new +environments for them as required. Also, specify all desired packages at the +same time when creating a new environment, instead of adding them sequentially, +to avoid dependency conflicts. -We also recommend using [Mamba](https://mamba.readthedocs.io/en/latest/) as a +We also suggest using [Mamba](https://mamba.readthedocs.io/en/latest/) as a drop-in substitute for Conda. It implements a more effective dependency solver -and is also the default Conda frontend for the pipeline managers Snakemake. -Conda sometimes fails to solve the environment, and in these cases Mamba -usually works. +and is also the default Conda frontend for the pipeline manager Snakemake. +Simply replace `conda` with `mamba` in the commands below. Note that the +`defaults` channel should be enabled. ```bash # If you haven't set up Bioconda already @@ -58,39 +54,58 @@ conda config --add channels conda-forge # Try the following step if "solving environment" does not terminate conda config --set channel_priority strict # Create new environment named "pf" with phyloflash -# sortmerna is an optional dependency -conda create -n pf phyloflash sortmerna=2.1b -# If Conda is unable to solve the environment; requires mamba in base env -mamba create -n pf phyloflash sortmerna=2.1b +conda create -n pf phyloflash +# Activate environment +conda activate pf +# Check that dependencies all installed properly +phyloFlash.pl -check_env ``` -### Download from GitHub -If you prefer not to use Conda, or are interested in a specific version that is -not distributed there, you can download releases from the -[releases](https://github.com/HRGV/phyloFlash/releases) page on GitHub. +### Download pre-formatted database -If you clone the repository directly off GitHub you might end up with a version -that is still under development. +Pre-formatted databases derived from SILVA releases 138 onwards are available +from the following Zenodo archives: -```bash -# Download latest release -wget https://github.com/HRGV/phyloFlash/archive/pf3.4.tar.gz -tar -xzf pf3.4.tar.gz + * [SILVA 138.1](https://doi.org/10.5281/zenodo.7892521) (latest) + * [SILVA 138](https://doi.org/10.5281/zenodo.7890453) -# Check for dependencies and install them if necessary -cd phyloFlash-pf3.4 -./phyloFlash.pl -check_env +Download, checksum, and unpack: + +```bash +wget https://zenodo.org/record/7892522/files/138.1.tar.gz # 5.5 GB download +tar -xzf 138.1.tar.gz # unpacks folder 138.1/ in the current location ``` -### Set up database and run +Specify path to the database folder with the option `-dbhome` when running +phyloFlash (see below). + +Older versions of the SILVA database have a more restrictive license, so we are +unable to distribute pre-formatted versions. You will have to download the +original SILVA files and run the `phyloFlash_makedb.pl` script yourself (see +Manual). -This assumes that the phyloFlash scripts are already in your path. + +### Test phyloFlash with test dataset + +Test data are included with phyloFlash. The following assumes that you +installed phyloFlash to a Conda environment called `pf`, and that the database +files have been unpacked to a folder `/path/to/138.1`. By default, phyloFlash +will look for the database folder in the folder where it is installed. If it is +located somewhere else, specify this to the `-dbhome` option. ```bash -# Install reference database (takes some time) -phyloFlash_makedb.pl --remote +conda activate pf # If Conda environment not already activated +phyloFlash.pl -dbhome /path/to/138.1 -lib TEST -CPUs 16 \ + -read1 ${CONDA_PREFIX}/lib/phyloFlash/test_files/test_F.fq.gz \ + -read2 ${CONDA_PREFIX}/lib/phyloFlash/test_files/test_R.fq.gz \ + -almosteverything +``` + +### Example phyloFlash commands + +```bash # Run with test data and 16 processors (default is to use all processors available) phyloFlash.pl -lib TEST -CPUs 16 -read1 test_files/test_F.fq.gz -read2 test_files/test_R.fq.gz diff --git a/docs/install.md b/docs/install.md index 4dc7e92..b6886e5 100644 --- a/docs/install.md +++ b/docs/install.md @@ -4,43 +4,22 @@ title: Installation order: 1 --- -## Quick-start - -```bash -# Install via Conda -conda install sortmerna=2.1b # Only if you want to use Sortmerna (optional dependency) -conda install phyloflash -# Check for dependencies -phyloFlash.pl -check_env -# Download and set up database in current folder (takes some time) -phyloFlash_makedb.pl --remote -``` - ## 1. System requirements To use **phyloFlash** you will need a GNU/Linux system with Perl, R and Python installed. (OS X is for the brave, we have not tested this!) + ## 2. Download package ### 2.1 Download via Conda +We recommend installing phyloFlash and its dependencies using Conda or Mamba. [Conda](https://conda.io/docs/) is a package manager that will also install -dependencies that are required if you don't have them already. phyloFlash is -distributed through the [Bioconda](http://bioconda.github.io/) channel on -Conda. - -According to the [Conda -documentation](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html), -it is recommended to install all packages at the same time to avoid dependency -conflicts, and to create new environments instead of installing to the base -environment. - -We also recommend using [Mamba](https://mamba.readthedocs.io/en/latest/) as a -drop-in substitute for Conda. It implements a more effective dependency solver -and is also the default Conda frontend for the pipeline managers Snakemake. -Conda sometimes fails to solve the environment, and in these cases Mamba -usually works. +dependencies that are required if you don't have them already. + +phyloFlash is distributed through the [Bioconda](http://bioconda.github.io/) +channel on Conda. ```bash # If you haven't set up Bioconda already @@ -50,43 +29,45 @@ conda config --add channels conda-forge # Try the following step if "solving environment" does not terminate conda config --set channel_priority strict # Create new environment named "pf" with phyloflash -# sortmerna is an optional dependency -conda create -n pf phyloflash sortmerna=2.1b -# If Conda is unable to solve the environment; requires mamba in base env -mamba create -n pf phyloflash sortmerna=2.1b +conda create -n pf phyloflash +# Activate environment +conda activate pf +# Check that dependencies all installed properly +phyloFlash.pl -check_env ``` -In some cases, `conda install` can hang on the "Solving environment" step. This -appears to be because of ambiguities in dependency specifications in packages -on different channels (see this -[issue](https://github.com/conda/conda/issues/8197) on GitHub). Setting the -`channel_priority` to `strict` asks Conda to always pick the higher-priority -channel first when installing packages. This requires conda version to be 4.6 -and above. + * Avoid installing new packages to your base environment. Instead, create new + environments with required packages as you need them. + * Install packages to a new environment simultaneously, instead of adding them + sequentially. This will prevent dependency conflicts. + * In some cases, `conda install` can hang on the "Solving environment" step. + This appears to be because of ambiguities in dependency specifications in + packages on different channels (see this + [issue](https://github.com/conda/conda/issues/8197) on GitHub). Setting the + `channel_priority` to `strict` asks Conda to always pick the higher-priority + channel first when installing packages. This requires conda version to be + 4.6 and above. + * We also suggest using [Mamba](https://mamba.readthedocs.io/en/latest/) as a + drop-in substitute for Conda. It implements a more effective dependency + solver and is also the default Conda frontend for the pipeline manager + Snakemake. Simply replace `conda` with `mamba` in the commands. Note that + the `defaults` channel should be enabled. + * If you wish to use Sortmerna (optional) for extracting rRNA reads, specify + version 2.1b: `conda create -n pf_sortmerna phyloflash sortmerna=2.1b` -### 2.2 Download from GitHub - -If you prefer not to use Conda, or are interested in a specific version that is -not distributed there, you can download releases from the -[releases](https://github.com/HRGV/phyloFlash/releases) page on GitHub. -If you clone the repository directly off GitHub you might end up with a version -that is still under development. - -```bash -# Download latest release -wget https://github.com/HRGV/phyloFlash/archive/pf3.4.tar.gz -tar -xzf pf3.4.tar.gz -``` +### 2.2 Download from GitHub -Alternatively clone the latest development version with Git: +If you wish to modify the source code, you can clone the repository from GitHub ```bash git clone https://github.com/HRGV/phyloFlash.git -ls phyloFlash +cd phyloFlash +git status ``` -## 3. Check and install prerequisites + +## 3. Check and install dependencies Check that dependencies are available: @@ -102,7 +83,7 @@ phyloFlash relies on the following software: - [Perl >= 5.13.2](http://www.perl.org/get.html) - [EMIRGE](https://github.com/csmiller/EMIRGE) and its dependencies - [BBmap](http://sourceforge.net/projects/bbmap/) - - [Vsearch](https://github.com/torognes/vsearch) + - [Vsearch >=2.5.0](https://github.com/torognes/vsearch) - [SPAdes](http://bioinf.spbau.ru/spades) - [Bedtools](https://github.com/arq5x/bedtools2) - [Mafft](http://mafft.cbrc.jp/alignment/software/) @@ -128,81 +109,84 @@ Within R, run the command install.packages(c("ggdendro","gtable","reshape2","ggplot2","optparse")) ``` -## 4. Setting up the reference database +## 4. Set up the reference database phyloFlash uses modified versions of the SILVA SSU database of small-subunit ribosomal RNA sequences that is maintained by the [ARB SILVA project](www.arb-silva.de). -*NOTE: The [SILVA -license](http://www.arb-silva.de/fileadmin/silva_databases/current/LICENSE.txt) -prohibits usage of the SILVA databases or parts of them within a -non-academic/commercial environment beyond a 48h test period. If you want to -use the SILVA databases with phyloFlash in a non-academic/commercial -environment please contact them at contact(at)arb-silva.de.* - -The database has to be reformatted for use by phyloFlash. This is done with the -script `phyloFlash_makedb.pl`. Known contamination sequences from cloning -vectors are removed, repeat regions which can have an adverse effect on -sequence reconstruction are masked, the database is clustered at 99% and 96% -identity to speed up mapping/searching, and finally indexed for the read -mapper. - -*NOTE: A .udb indexed database will be created with Vsearch if version v2.5.0+ -is detected. However, the file will only be readable by the user running the -database setup script. If you wish to make it available for other users, please -change the file permissions for the .udb file accordingly.* - -The final disk space required for the default SILVA SSU database is about 5 Gb. -An additional 5 Gb is required for the `.udb` indexed database for Vsearch -v2.5.0+. An additional 2.5 Gb is required for the SortMeRNA indexed database if -requested. - -If you wish to use SortMeRNA in addition to or instead of BBmap for filtering -rRNA reads, pass the option `--sortmerena` to `phyloFlash_makedb.pl`. This -requires `sortmerna` and `indexdb_rna` to be in your path. At the moment only -SortMeRNA v2.1b is supported. -A full description of options for the database setup can be seen with +### 4.1. Download pre-formatted database -```bash -phyloFlash_makedb.pl --help -``` +Pre-formatted databases derived from SILVA releases 138 onwards are available +from the following Zenodo archives: -### 4.1. Downloading database automatically + * [SILVA 138.1](https://doi.org/10.5281/zenodo.7892521) (latest) + * [SILVA 138](https://doi.org/10.5281/zenodo.7890453) -To create a suitable database, just run +NOTE: Prebuilt databases are not provided for SILVA versions before 138, +because these are released under different license(s) that prohibit usage of +the SILVA databases or parts of them within a non-academic/commercial +environment beyond a 48 h test period. SILVA version 138 onwards is released +under a more permissive Creative Commons Attribution 4.0 license. + +Download, checksum, and unpack (example for release 138.1): ```bash -phyloFlash_makedb.pl --remote +wget https://zenodo.org/record/7892522/files/138.1.tar.gz # 5.5 GB download +tar -xzf 138.1.tar.gz # unpacks folder 138.1/ in the current location ``` -in the directory where you unpacked phyloFlash. The script will download the -most current source databases and prepare the files required by -`phyloFlash.pl`. +Specify path to the database folder with the option `-dbhome` when running +phyloFlash (see below). -*NOTE: This currently only works if you are not behind a proxy* -If you are behind a proxy and cannot download the database via the script, you -can download the current version of the SILVA database from [the SILVA -website](https://www.arb-silva.de/no_cache/download/archive/current/Exports/). -The filename should be `SILVA_XXX_SSURef_Nr99_tax_silva_trunc.fasta.gz` where -`XXX` is the current version number. You should also download the UniVec -database [from NCBI](https://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/). -Then proceed with the instructions in section 4.2 below. +### 4.2. Format database locally -### 4.2. Set up database from local copy of SILVA SSU NR99 +If you wish to use earlier versions of the SILVA database, or a custom database +file, you will have to format and index them. This is done with the script +`phyloFlash_makedb.pl`. Known contamination sequences from cloning vectors are +removed, repeat regions which can have an adverse effect on sequence +reconstruction are masked, the database is clustered at 99% and 96% identity to +speed up mapping/searching, and finally indexed for the read mapper. -If you already have a local copy of the SILVA SSU NR99 database (in Fasta -format), and the NCBI Univec database, you can supply the paths: +A full description of options for the database setup can be seen with + +```bash +phyloFlash_makedb.pl --help +``` + +Download the desired version of the SILVA SSURef NR99 database from [the SILVA +website](https://www.arb-silva.de/download/archive/) (in Fastsa format) under the `Exports` subfolder of the respective release. The filename should be `SILVA_XXX_SSURef_Nr99_tax_silva_trunc.fasta.gz` where +`XXX` is the version number. Links to the last five releases: + * [138.1](https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz) + * [138](https://www.arb-silva.de/fileadmin/silva_databases/release_138/Exports/SILVA_138_SSURef_NR99_tax_silva_trunc.fasta.gz) + * [132](https://www.arb-silva.de/fileadmin/silva_databases/release_132/Exports/SILVA_132_SSURef_Nr99_tax_silva_trunc.fasta.gz) + * [128](https://www.arb-silva.de/fileadmin/silva_databases/release_128/Exports/SILVA_128_SSURef_Nr99_tax_silva_trunc.fasta.gz) + * [123.1](https://www.arb-silva.de/fileadmin/silva_databases/release_123.1/Exports/SILVA_123.1_SSURef_Nr99_tax_silva_trunc.fasta.gz) + +Also download the UniVec database [from NCBI](https://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/). + +Specify the paths to the SILVA and UniVec files wtih the `--silva_file` and `--univec_file` options respectively to build the database locally, example below. ```bash phyloFlash_makedb.pl --univec_file /path/to/Univec --silva_file /path/to/SILVA_128_SSURef_Nr99_tax_silva_trunc.fasta.gz +# Creates a new folder ./128 ``` -By default, `phyloFlash.pl` will look in the folder where it is installed for -the subfolder with the highest SILVA version number. You can change this by -passing the `-dbhome ` switch to phyloFlash.pl or by modifying the -`DBHOME` variable in `phyloFlash.pl`. + + * A new folder containing the database files will be created. The folder name + will correspond to the SILVA release number and is parsed from the input + file name (which should follow the SILVA file naming convention exactly). + * The `--remote` option is no longer supported. + * If you wish to use SortMeRNA in addition to or instead of BBmap for + filtering rRNA reads, pass the option `--sortmerena` to + `phyloFlash_makedb.pl`. This requires `sortmerna` and `indexdb_rna` to be in + your path. At the moment only SortMeRNA v2.1b is supported. + * When you run the main `phyloFlash.pl` script, it will by default look in the + folder where it is installed for the subfolder with the highest SILVA + version number. You can change this by specifying the path with the + `-dbhome` option in `phyloFlash.pl`. + ### 4.3. Set up a custom database with your own sequences diff --git a/docs/usage.md b/docs/usage.md index ab29f13..b5cd688 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -14,6 +14,7 @@ phyloFlash.pl -man # Manual page in pager Running `phyloFlash.pl` without arguments will show the basic help message. + ## 1. Basic usage To screen paired-end 100 bp read files named `reads_F.fq.gz` and @@ -57,6 +58,7 @@ data input. Use to test setup. `-outfiles` Show detailed list of output and temporary files and exit. + ### 2.1. Standard input arguments `-lib LIBNAME` Library name to use as a filename prefix for the output files @@ -75,6 +77,7 @@ omitted, phyloFlash will run in *experimental* single-end mode. `-readlength N` Set expected readlength (between 50 and 500). Always use if your read length differs from 100. Default: 100. + ### 2.2. Performance-related `-CPUs N` Number of threads to use. Defaults to all available CPU cores. @@ -87,6 +90,7 @@ reads, and use values below 1000000. Default: unlimited. emirge_amplicon.py. This feature is not reliable as emirge_amplicon.py has been problematic to run (use values >100000). Default: 500000. + ### 2.3. Customizing the run `-skip_spades` Do not use SPAdes to assemble full-length sequences from @@ -130,6 +134,7 @@ SILVA version number. rRNA sequences. The SSU sequences will be extracted with Barrnap, and the input read files will be screened against these extracted "trusted" SSU sequences + ### 2.4. Localization and compatibility options `-crlf` Use CRLF as the line terminator in CSV output, to be RFC4180 compliant @@ -138,6 +143,7 @@ read files will be screened against these extracted "trusted" SSU sequences `-decimalcomma` Use decimal comma instead of decimal point to fix locale problems for some European systems (Default: Off) + ### 2.5. Configuring output `-html` Produce an HTML-formatted version of the report file. This helps @@ -153,7 +159,8 @@ although it is free to use. (Default: Off) `-log` Write status messages printed to STDERR also to a log file (Default: Off) -`-zip` Compress output into a tar.gz archive file (Default: Off) +`-zip` Compress output into a tar.gz archive file. Overridden by +`-almosteverything` and `-everything` (Default: Off) `-keeptmp` Keep temporary/intermediate files (Default: Off) @@ -163,6 +170,7 @@ without defaults and any local settings must still be specified. Equivalent to `-almosteverything` Like `-everything` except without `-emirge` + ## 3. Testing phyloFlash You will find test data in the `test_files` folder. The test data provided @@ -174,6 +182,7 @@ these files: phyloFlash.pl -lib TEST -read1 test_files/test_F.fq.gz -read2 test_files/test_R.fq.gz ``` + ## 4. Expected performance 10 million 100 bp paired-end-reads of a metagenomic library are processed in diff --git a/phyloFlash.pl b/phyloFlash.pl index fa5d558..2fa355d 100755 --- a/phyloFlash.pl +++ b/phyloFlash.pl @@ -249,7 +249,8 @@ =head2 OUTPUT OPTIONS =item -zip -Compress output into a tar.gz archive file +Compress output into a tar.gz archive file. Overridden by I<-almosteverything> +or I<-everything>. Default: Off ("-nozip") @@ -346,7 +347,7 @@ =head1 COPYRIGHT AND LICENSE # (0 will be turned into "\n" in parsecmdline) # default database names for EMIRGE and Vsearch my $emirge_db = "SILVA_SSU.noLSU.masked.trimmed.NR96.fixed"; -my $vsearch_db = "SILVA_SSU.noLSU.masked.trimmed"; +my $vsearch_db = "SILVA_SSU.noLSU.masked.trimmed.udb"; my $sortmerna_db = $emirge_db; my $ins_used = "SE mode!"; # Report insert size used by EMIRGE @@ -420,7 +421,7 @@ sub check_dbhome { my $dbhome = shift; my @required_list = ('ref/genome/1/summary.txt', $emirge_db.".fasta", - $vsearch_db.".fasta"); + $vsearch_db); push @required_list, ("$sortmerna_db.bursttrie_0.dat","$sortmerna_db.acc2taxstring.hashimage") if ($use_sortmerna == 1); foreach (@required_list) { return "${dbhome}/$_" unless -r "${dbhome}/$_" @@ -2239,8 +2240,6 @@ sub vsearch_best_match { $outfiles{"all_final_fasta"}{"made"}++; if (-s $outfiles{"all_final_fasta"}{"filename"}) { - # Check whether UDB file can be used - my $vsearch_ver_check = check_vsearch_version(); my @vsearch_args = ("-usearch_global", $outfiles{"all_final_fasta"}{"filename"}, "-id 0.7", "-userout", $outfiles{"vsearch_csv"}{"filename"}, @@ -2249,12 +2248,8 @@ sub vsearch_best_match { "--strand plus --notrunclabels", "-notmatched", $outfiles{"notmatched_fasta"}{"filename"}, "-dbmatched", $outfiles{"dbhits_all_fasta"}{"filename"}, + "-db ${DBHOME}/${vsearch_db}", ); - if (defined $vsearch_ver_check) { - push @vsearch_args, "-db ${DBHOME}/${vsearch_db}.udb"; - } else { - push @vsearch_args, "-db ${DBHOME}/${vsearch_db}.fasta"; - } # Run Vsearch run_prog("vsearch", join(" ", @vsearch_args), diff --git a/phyloFlash_makedb.pl b/phyloFlash_makedb.pl index 294d390..fb84aff 100755 --- a/phyloFlash_makedb.pl +++ b/phyloFlash_makedb.pl @@ -299,7 +299,7 @@ =head1 COPYRIGHT AND LICENSE @lsu_in_ssh, $overwrite); unlink "$dbdir/SILVA_SSU.fasta" unless ($keep==1); - unlink glob "$dbdir/tmp.barrnap_hits.*" unless ($keep==1); + unlink glob "tmp.barrnap_hits.*" unless ($keep==1); } else { msg ("LSU-filtered file found, not overwriting"); } @@ -316,6 +316,7 @@ =head1 COPYRIGHT AND LICENSE $ref_minlength); unlink "$dbdir/SILVA_SSU.noLSU.masked.fasta" unless ($keep==1); + # Index database into UDB file, if Vsearch v2.5.0+ # Speeds up run time in search phase of phyloFlash as db can be directly read to mem my $vsearch_ver_check = check_vsearch_version(); @@ -367,6 +368,20 @@ =head1 COPYRIGHT AND LICENSE $overwrite); } +# Clean up log files at end - if breaks in middle they will be available for debug +unless ($keep==1) { + unlink "tmp.bbmask_mask_repeats.log"; # from mask_repeats + unlink "tmp.bbduk_remove_univec.log"; # from univec_trim + if (defined $vsearch_ver_check) { + unlink "tmp.vsearch_make_udb.log"; # from make_vsearch_udb + } + unlink "tmp.bbmap_index.log"; # from bbmap_db + unlink "tmp.bowtiebuild.log"; # from bowtie_index + if ($sortmerna == 1) { + unlink "tmp.indexdb_rna.log"; # from sortmerna_index + } +} + finish(); write_logfile($makedb_log) if defined $makedb_log; diff --git a/test_files/test_html.SSU.collection.fasta.tree~ b/test_files/test_html.SSU.collection.fasta.tree~ deleted file mode 100644 index b925062..0000000 --- a/test_files/test_html.SSU.collection.fasta.tree~ +++ /dev/null @@ -1,19 +0,0 @@ -((((( -1_DQ213024_1_1496_Bacteria_Proteobacteria_Gammaproteobacteria_Xanthomonadales_Xanthomonadaceae_uncultured_Xanthomonas_sp__B05-08_04_0214 -:0.00613, -6_NODE_2_length_2253_cov_5_73254_ID_3_495-2036_-_ -:0.00613):0.00114, -9_test_html_PFemirge_106_0_010582 -:0.00727):0.15860,( -3_EU741098_1_1506_Bacteria_Proteobacteria_Gammaproteobacteria_Pseudomonadales_Pseudomonadaceae_Pseudomonas_Pseudomonas_sp__13650B -:0.02731, -7_NODE_3_length_1851_cov_5_35795_ID_5_40-1535___ -:0.02731):0.13856):0.12247,( -4_AJ491806_1_1490_Bacteria_Actinobacteria_Actinobacteria_Micrococcales_Microbacteriaceae_Microbacterium_Microbacterium_paraoxydans -:0.00778, -5_NODE_1_length_2522_cov_12_6724_ID_1_491-2014___ -:0.00778):0.28056):0.39266,( -2_X03680_934_2693_Eukaryota_Opisthokonta_Holozoa_Metazoa_Animalia_Nematoda_Chromadorea_Rhabditidae_Caenorhabditis_elegans -:0.01772, -8_test_html_PFemirge_0_0_989418 -:0.01772):0.66328)