From bd5888545f4290a303fd072af55ab31b25678ff3 Mon Sep 17 00:00:00 2001 From: RAPT-release Date: Thu, 13 Jan 2022 23:30:17 -0500 Subject: [PATCH] RAPT v0.2.0 - teamcity CI --- GCP RAPT.md | 46 +++++++++++-------------- README.md | 12 +++---- Standalone RAPT.md | 78 ++++++++++++++++++++---------------------- dist/CHANGELOG.md | 24 ------------- dist/README.txt | 2 +- dist/release-notes.txt | 8 ++--- dist/run_rapt.py | 4 +-- dist/run_rapt_gcp.sh | 4 +-- 8 files changed, 71 insertions(+), 107 deletions(-) diff --git a/GCP RAPT.md b/GCP RAPT.md index 721588f..f5a85d6 100644 --- a/GCP RAPT.md +++ b/GCP RAPT.md @@ -11,7 +11,7 @@ Please see our [wiki page](https://github.com/ncbi/rapt/wiki) for References, Li - Cloud Life Sciences API enabled for your project - for help see [Quick start using a Cloud Shell](https://github.com/ncbi/rapt/wiki/GCP%20RAPT%20In-depth%20Documentation%20and%20Examples.md) - Access to a Google storage bucket for your data - for help see [Quick start using a Cloud Shell](https://github.com/ncbi/rapt/wiki/GCP%20RAPT%20In-depth%20Documentation%20and%20Examples.md) -*GCP RAPT* will bring up and shut down Google instances as needed.
+*GCP RAPT* will bring up and shut down Google instances as needed. ## Quick start Here are instructions to execute RAPT once your system is set up. Additional instructions are available on our [wiki page](wiki/GCP%20RAPT%20In-depth%20Documentation%20and%20Examples.md). @@ -19,27 +19,25 @@ Here are instructions to execute RAPT once your system is set up. Additional ins 2. Invoke a Cloud Shell 3. Download the latest release by executing the following commands: - ``` - ~$ curl -sSLo rapt.tar.gz https://github.com/ncbi/rapt/releases/download/v2.2.7/rapt-v2.2.7.tar.gz - ~$ tar -xzf rapt.tar.gz && rm -f rapt.tar.gz - ``` +``` +~$ curl -sSLo rapt.tar.gz https://github.com/ncbi/rapt/releases/download/v0.2.0/rapt-v0.2.0.tar.gz +~$ tar -xzf rapt.tar.gz && rm -f rapt.tar.gz +``` 4. Run `run_rapt_gcp.sh help` to see the *GCP RAPT* usage information. ### Try an example -To run RAPT, you need Illumina-sequenced reads for the genome you wish to assemble and annotate. These can be in a fasta file in a Google storage bucket, or they can be in a run in SRA (an accession).
+To run RAPT, you need Illumina-sequenced reads for the genome you wish to assemble and annotate. These can be in a fasta file in a Google storage bucket, or they can be in a run in SRA (an accession). Important: Only reads sequenced on **Illumina machines** can be used by RAPT. -#### Starting from an SRA run
-To demonstrate how to run RAPT, we are going to use SRR3496277, a set of reads available in SRA for *Mycoplasma pirum*.
+#### Starting from an SRA run +To demonstrate how to run RAPT, we are going to use SRR3496277, a set of reads available in SRA for *Mycoplasma pirum*. This example takes about 1 hour. Run the following command, where [gs://your_results_bucket](https://cloud.google.com/storage/docs/creating-buckets) is the Google storage bucket where the outputs and logs will be copied when the job finishes. ```bash -~$ ./run_rapt_gcp.sh submitacc SRR3496277 --bucket gs://your_results_bucket
+~$ ./run_rapt_gcp.sh submitacc SRR3496277 --bucket gs://your_results_bucket ``` - - If the job is successfully created, the script will print out execution information similar to the following: ``` RAPT job has been created successfully. @@ -64,7 +62,6 @@ For technical details of this job, run: ~$ ``` - Check the status of the jobs executed under this project, run: ```bash ~$ ./run_rapt_gcp.sh joblist @@ -76,10 +73,9 @@ JOB_ID USER LABEL SRR STATUS START_TIME END_TIME ~$ ``` - The results for the job will be available in the bucket you specified after the job is marked 'Done'. Please note that some runs may take up to 24 hours. -#### Starting from fastq or fasta file
+#### Starting from fastq or fasta file You can use a fastq or a fasta file produced by Illumina sequencers as input to RAPT. This file can contain paired-end reads, with the two reads of a pair adjacent to each other in the file or single-end reads. Note that the quality scores are not necessary. The file needs to be copied to the Google storage bucket before you run `run_rapt_gcp.sh`. The genus species of the sequenced organism needs to be provided on the command line. The strain is optional. @@ -89,7 +85,6 @@ Here is an example command using a file available in the bucket named your_input ~$ ./run_rapt_gcp.sh submitfastq gs://your_input_bucket/M_pirum_25960.fastq -b gs://your_results_bucket --label M_pirum_25960 --organism "Mycoplasma pirum" --strain "ATCC 25960" ``` - If the job is successfully created, the script will print out execution information similar to the following: ``` @@ -116,7 +111,6 @@ For technical details of this job, run: ~$ ``` - To get more execution details and examples in our [wiki page](https://github.com/ncbi/rapt/wiki/GCP%20RAPT%20In-depth%20Documentation%20and%20Examples.md). - Setting up GCP with step by step guide - Using fastq files as input @@ -125,16 +119,16 @@ If you have other questions, please visit our [FAQs page](https://github.com/ncb ### Review the output *GCP RAPT* generates a tarball named `output.tar.gz` in your designated bucket, under a "directory" named after the 10-character job-id assigned at the start of the execution (i.e. "2894b72f9f"). The tarball contains the following files: -1. concise.log is file with the log of major stages and status of your RAPT run
-2. verbose.log is a detailed log file of all the actions and console outputs that RAPT performed for your run
-3. skesa.out.fa: multifasta files of the assembled contigs produced by SKESA
-4. ani-tax-report.txt and ani-tax-report.xml: Taxonomy verification results in text or XML format
-5. PGAP annotation results in multiple formats:
- * annot.gbk: annotated genome in GenBank flat file format
- * annot.gff: annotated genome in GFF3 format
- * annot.sqn: annotated genome in ASN format
- * annot.faa: multifasta file of the proteins annotated on the genome
- * annot.fna: multifasta file of the trancripts annotated on the genome
+1. concise.log is file with the log of major stages and status of your RAPT run +2. verbose.log is a detailed log file of all the actions and console outputs that RAPT performed for your run +3. skesa.out.fa: multifasta files of the assembled contigs produced by SKESA +4. ani-tax-report.txt and ani-tax-report.xml: Taxonomy verification results in text or XML format +5. PGAP annotation results in multiple formats: + * annot.gbk: annotated genome in GenBank flat file format + * annot.gff: annotated genome in GFF3 format + * annot.sqn: annotated genome in ASN format + * annot.faa: multifasta file of the proteins annotated on the genome + * annot.fna: multifasta file of the trancripts annotated on the genome * calls.tab: tab-delimited file of the coordinates of detected foreign sequence. Empty if no foreign contaminant was found. Along with the tarball there is also a `run.log` file generated automatically by the Google Life Sciences Pipeline where RAPT is invoked. This file catches all output to stdout and stderr by anything, and may be helpful to identify the problem should any happens. diff --git a/README.md b/README.md index ccace41..d2a90cd 100644 --- a/README.md +++ b/README.md @@ -1,18 +1,16 @@ # Read Assembly and Annotation Pipeline Tool (RAPT) -RAPT is a NCBI pipeline designed for assembling and annotating short genomic sequencing reads obtained from bacterial or archaeal isolates. RAPT consists of two major components, [SKESA](https://github.com/ncbi/SKESA) and [PGAP](https://github.com/ncbi/pgap). SKESA is a *de novo* assembler for microbial genomes based on DeBruijn graphs. PGAP is a prokaryotic genome annotation pipeline that combines *ab initio* gene prediction algorithms with homology-based methods. RAPT takes an SRA run or a fasta or fastq file of Illumina reads as input and produces an assembled and annotated genome. +RAPT is a NCBI pipeline designed for assembling and annotating Illumina genome sequencing reads obtained from bacterial or archaeal isolates. RAPT consists of two major NCBI components, SKESA and PGAP. SKESA is a de-novo assembler for microbial genomes based on DeBruijn graphs. PGAP is a prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. RAPT takes an Illumina SRA run or a fasta file as input and produces an assembled and annotated genome. If you are new to RAPT, please visit our [wiki page](https://github.com/ncbi/rapt/wiki) for detailed information. ![RAPT](RAPT_context2.png) -To use the latest version, download the RAPT command-line interface with the following commands: +To download the latest RAPT, run the following command lines at your linux prompt: ``` -~$ curl -sSLo rapt.tar.gz https://github.com/ncbi/rapt/releases/download/v2.2.7/rapt-v2.2.7.tar.gz +~$ curl -sSLo rapt.tar.gz https://github.com/ncbi/rapt/releases/download/v0.2.0/rapt-v0.2.0.tar.gz ~$ tar -xzf rapt.tar.gz && rm -f rapt.tar.gz ``` +There should be two scripts in your directory now, `run_rapt_gcp.sh` and `run_rapt.py`, corresponding to the two variations of RAPT: Google Cloud Platform (GCP) RAPT and Standalone RAPT. [GCP RAPT](GCP%20RAPT.md) is designed to run on GCP and is for users with GCP accounts (please note this is different from a gmail account), and [Standalone RAPT](Standalone%20RAPT.md) can run on any computing environments meeting a few pre-requisites. - -There should be two scripts in your directory now, `run_rapt_gcp.sh` and `run_rapt.py`, corresponding to two variations of RAPT: Google Cloud Platform (GCP) RAPT and Standalone RAPT. [GCP RAPT](GCP%20RAPT.md) is designed to run on GCP and is for users with GCP accounts (please note this is different from a gmail account), while [Stand-alone RAPT](Standalone%20RAPT.md) can run on any computing environments meeting a few pre-requisites. - -For instructions on running RAPT, please go to their respective documentation pages: [GCP RAPT](GCP%20RAPT.md) or [Stand-alone RAPT](Standalone%20RAPT.md). +For instructions on running RAPT, please go to their respective documentation pages: [GCP RAPT](GCP%20RAPT.md) or [Standalone RAPT](Standalone%20RAPT.md). diff --git a/Standalone RAPT.md b/Standalone RAPT.md index 944add1..0bbdc6a 100644 --- a/Standalone RAPT.md +++ b/Standalone RAPT.md @@ -9,39 +9,40 @@ Please see our [wiki page](https://github.com/ncbi/rapt/wiki) for References, Li The machine must satisfy the following prerequisites: -* At least 4GB memory per CPU core
-* At least 8 CPU cores and 32 GB memory
-* Linux OS preferred, Windows 10 (pro or enterprise version) will also work but extra configuration is required
-* Internet connection
-* Container runner installed (currently supports Docker/Podman/Singularity), Docker is recommended
-* Python installed
-* 100GB free storage space on disk
+• At least 4GB memory per CPU core +• At least 8 CPU cores and 32 GB memory +• Linux OS preferred, Windows 10 (pro or enterprise version) will also work but extra configuration is required +• Internet connection +• Container runner installed (currently supports Docker/Podman/Singularity), Docker is recommended +• Python installed +• 100GB free storage space on disk ### Additional tips if using Windows 10 (pro/enterprise version) -1. Right now it seems to only work on a real physical machine (L0, metal) with CPUs support virtualization (Like INTEL VT-x technology); Make sure this feature is enabled in BIOS
-2. Windows 10 only, must be at least Professional or Enterprise version (hypervisor capability)
-3. Install python and Docker Desktop
-4. Start Docker service with hyper-V enabled
-5. Make sure Docker has switched to 'Linux containers'. It should do so by default if hyper-V is up and running.
+1. Right now it seems to only work on a real physical machine (L0, metal) with CPUs support virtualization (Like INTEL VT-x technology); Make sure this feature is enabled in BIOS +2. Windows 10 only, must be at least Professional or Enterprise version (hypervisor capability) +3. Install python and Docker Desktop +4. Start Docker service with hyper-V enabled +5. Make sure Docker has switched to 'Linux containers'. It should do so by default if hyper-V is up and running. ## Quick start Here are instructions to execute RAPT once your system is set up. Additional instructions are available on our [wiki page](https://github.com/ncbi/rapt/wiki/Standalone%20RAPT%20In-depth%20Documentation%20and%20Recommendations.md). -1. Go to your machine or instance command line
-2. Download the latest release by executing the following commands:
+1. Go to your machine or instance command line +2. Download the latest release by executing the following commands: + +``` +~$ curl -sSLo rapt.tar.gz https://github.com/ncbi/rapt/releases/download/v0.2.0/rapt-v0.2.0.tar.gz +~$ tar -xzf rapt.tar.gz && rm -f rapt.tar.gz +``` +3. Run `./run_rapt.py -h` to see the *Stand-alone RAPT* usage information - ``` - ~$ curl -sSLo rapt.tar.gz https://github.com/ncbi/rapt/releases/download/v2.2.7/rapt-v2.2.7.tar.gz - ~$ tar -xzf rapt.tar.gz && rm -f rapt.tar.gz - ``` -3. Run `./run_rapt.py -h` to see the *Stand-alone RAPT* usage information
### Try an example -To run RAPT, you need Illumina-sequenced reads for the genome you wish to assemble and annotate. These can be in a fasta file on the machine where you wish to run RAPT, or they can be in a run in the NCBI Sequence Read Archive (SRA).
+To run RAPT, you need Illumina-sequenced reads for the genome you wish to assemble and annotate. These can be in a fasta file on the machine where you wish to run RAPT, or they can be in a run in the NCBI Sequence Read Archive (SRA). Important: Only reads sequenced on **Illumina machines** can be used by RAPT. -#### Starting from an SRA run
-To demonstrate how to run RAPT, we are going to use SRR3496277, a set of reads available in SRA for *Mycoplasma pirum*.
+#### Starting from an SRA run +To demonstrate how to run RAPT, we are going to use SRR3496277, a set of reads available in SRA for *Mycoplasma pirum*. This example takes about 1 hour to complete (time may vary depends on the configuration of the computer). Run the following command, the outputs and logs will be located in the current directory when the job finishes. @@ -52,10 +53,9 @@ RAPT is now running, it may take a long time to finish. To see the progress, tra ~$ ``` +All output files and logs will be located in a subdirectory named `raptout_xxxxxxxxxx` under current directory. `xxxxxxxxxx` is the RUNID generated by `run_rapt.py`, unique to each time it is launched. Please note that some runs may take up to 24 hours. -All output files and logs will be located in a subdirectory named `raptout_xxxxxxxxxx` under current directory. `xxxxxxxxxx` is the RUNID generated by `run_rapt.py`, unique to each time it is launched. Please note that some runs may take up to 24 hours.
- -#### Starting from fastq or fasta file
+#### Starting from fastq or fasta file You can use a fastq or a fasta file produced by Illumina sequencers as input to RAPT. This file can contain paired-end reads, with the two reads of a pair adjacent to each other in the file or single-end reads. Note that the quality scores are not necessary. The file needs to be on the local file system. The genus species of the sequenced organism needs to be provided on the command line. The strain is optional. Here is an example command using a file already on your computer: @@ -65,16 +65,15 @@ Here is an example command using a file already on your computer: RAPT is now running, it may take a long time to finish. To see the progress, track the verbose log file /home/username/raptout_d3e7956148/verbose.log. ~$ ``` - - + To get more execution details and examples, see our [wiki page](https://github.com/ncbi/rapt/wiki/Standalone%20RAPT%20In-depth%20Documentation%20and%20Recommendations.md). -- Help Documentation
-- Reference data location
+- Help Documentation +- Reference data location - Advanced Options If you have other questions, please visit our [FAQs page](https://github.com/ncbi/rapt/wiki/FAQ.md). -### Review the output
+### Review the output RAPT generates 11 output files if completes normally without error. The default location of result output is in the current directory. Each run of RAPT will create a subdirectory bearing the name raptout_ where is a random 10-character string. The --tag JOBID switch can be used to specify a human-readable job id which will be appended after the random RUNID for easy recognition. @@ -83,19 +82,18 @@ To store the output in location other than the current directory, use the -o or ~$ ./run_rapt.py -q path/to/srr3496277.fastq --organism "Mycoplasma pirum" --strain "ATCC 25960" --output-dir path/to/output-dir ``` - All messages from RAPT execution are logged, with time stamps, in a file named `verbose.log` in the output directory. A simpler version log file, `concise.log`, is also created with only entries mark the main stages and status. Below is the list of expected output files: -1. concise.log
+1. concise.log 2. verbose.log -3. skesa.out.fa: multifasta files of the assembled contigs produced by SKESA
-4. ani-tax-report.txt and ani-tax-report.xml: Taxonomy verification results in text or XML format
-5. PGAP annotation results in multiple formats:
- * annot.gbk: annotated genome in GenBank flat file format
- * annot.gff: annotated genome in GFF3 format
- * annot.sqn: annotated genome in ASN format
- * annot.faa: multifasta file of the proteins annotated on the genome
- * annot.fna: multifasta file of the trancripts annotated on the genome
+3. skesa.out.fa: multifasta files of the assembled contigs produced by SKESA +4. ani-tax-report.txt and ani-tax-report.xml: Taxonomy verification results in text or XML format +5. PGAP annotation results in multiple formats: + * annot.gbk: annotated genome in GenBank flat file format + * annot.gff: annotated genome in GFF3 format + * annot.sqn: annotated genome in ASN format + * annot.faa: multifasta file of the proteins annotated on the genome + * annot.fna: multifasta file of the trancripts annotated on the genome * calls.tab: tab-delimited file of the coordinates of detected foreign sequence. Empty if no foreign contaminant was found. See a [detailed description of the annotation output files](https://github.com/ncbi/pgap/wiki/Output-Files) for more information. \ No newline at end of file diff --git a/dist/CHANGELOG.md b/dist/CHANGELOG.md index ee938cb..c9d1564 100644 --- a/dist/CHANGELOG.md +++ b/dist/CHANGELOG.md @@ -1,27 +1,3 @@ -### Release v2.2.7 - - GCP-RAPT: added `--project` option to specify custom project. - - GCP-RAPT: log file names are fixed to concise.log and verbose.log - - GCP-RAPT: log files are included in the output archive - - GCP-RAPT: added *metadata.events* to `jobdetails` command output - - GCP-RAPT: `joblist` command displays job status as *Done* instead of *Finished* and *Failed* instead of *Aborted* to reflect the actual job status - - Standalone RAPT: suppress stderr log stream by default and add option to enable it - - PINGER ncbi_app name changed from _rapt_ to _raptdocker_. - - Fix verbose log capture bug - - Includes RAPT build id at the beginning of log files - - Added variation analysis to annotation by new version of PGAPX. - - Simplified PINGER usage report data - -### Release v0.2.2 - - Code refactoring, remove duplicated codes - - All codes are subject to lint with NCBI rules - - Add message to show data-download retrying - - Stand-alone RAPT: Default in silence mode, but print error messages if container returns non-zero status - - Fix NCBI PINGER ```ncbi_app``` values for different flavors (GCP-RAPT, Stand-alone RAPT and web-rapt) - - Remove duplicated sequence file ```annot.fna``` from output - - Added input sequence assemble statistics - - Added retry logic to ```srapath``` to address sporadic failures - - Fix final status error - ### Release v0.2.0 - GCP-RAPT: added `--project` option to specify custom project. - GCP-RAPT: log file names are fixed to concise.log and verbose.log diff --git a/dist/README.txt b/dist/README.txt index 5b1e388..0c1d050 100644 --- a/dist/README.txt +++ b/dist/README.txt @@ -1,4 +1,4 @@ -Read Assembly and Annotation Pipeline Tool (RAPT) v2.2.7 +Read Assembly and Annotation Pipeline Tool (RAPT) v0.2.0 RAPT is a NCBI pipeline designed for assembling and annotating Illumina genome sequencing reads obtained from bacterial or archaeal isolates. RAPT consists of two major NCBI components, SKESA and PGAP. SKESA is a de-novo assembler for microbial genomes based on DeBruijn graphs. PGAP is a prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. RAPT takes an Illumina SRA run or a fasta file as input and produces an assembled and annotated genome. diff --git a/dist/release-notes.txt b/dist/release-notes.txt index a047129..6d154d2 100644 --- a/dist/release-notes.txt +++ b/dist/release-notes.txt @@ -1,11 +1,9 @@ -RELEASE: v2.2.7 -DATE: 12-29-2020 -BUILD: rapt-30372431 +RELEASE: v0.2.0 +DATE: 10-27-2020 +BUILD: rapt-29571188 SKESA: 2.4.0 PGAPX: 2020-09-24.build4894 -DESCRIPTION: - DESCRIPTION: GCP-RAPT now displays job status in "joblist" command as "Done" and "Failed" instead of "Finished" and "Aborted" for more clarity; and more information has been added to the output of "jobdetails" command for easier problem identification. Log files are now included in the result archive "output.tar.gz" so that it is the only file to download under one jobid. diff --git a/dist/run_rapt.py b/dist/run_rapt.py index c880002..7d86426 100755 --- a/dist/run_rapt.py +++ b/dist/run_rapt.py @@ -11,9 +11,9 @@ ##to be compatible with python2 from abc import ABCMeta, abstractmethod -IMAGE_URI="ncbi/rapt:v2.2.7" +IMAGE_URI="ncbi/rapt:v0.2.0" -RAPT_VERSION="rapt-30372431" +RAPT_VERSION="rapt-29571188" ACT_FUNC_TEST = 'functest' ACT_VERSION = 'version' diff --git a/dist/run_rapt_gcp.sh b/dist/run_rapt_gcp.sh index 5f166fe..57fa72e 100755 --- a/dist/run_rapt_gcp.sh +++ b/dist/run_rapt_gcp.sh @@ -1,8 +1,8 @@ #!/usr/bin/env bash ###############################* Global Constants *################################## -IMAGE_URI="ncbi/rapt:v2.2.7" -RAPT_VERSION="rapt-30372431" +IMAGE_URI="ncbi/rapt:v0.2.0" +RAPT_VERSION="rapt-29571188" GCP_LOGS_VIEWER="https://console.cloud.google.com/logs/viewer"