Skip to content

Commit

Permalink
RAPT v0.2.0 - teamcity CI
Browse files Browse the repository at this point in the history
  • Loading branch information
RAPT-release authored and RAPT-release committed Jan 14, 2022
1 parent 92f756e commit bd58885
Show file tree
Hide file tree
Showing 8 changed files with 71 additions and 107 deletions.
46 changes: 20 additions & 26 deletions GCP RAPT.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,35 +11,33 @@ Please see our [wiki page](https://github.com/ncbi/rapt/wiki) for References, Li
- Cloud Life Sciences API enabled for your project - for help see [Quick start using a Cloud Shell](https://github.com/ncbi/rapt/wiki/GCP%20RAPT%20In-depth%20Documentation%20and%20Examples.md)
- Access to a Google storage bucket for your data - for help see [Quick start using a Cloud Shell](https://github.com/ncbi/rapt/wiki/GCP%20RAPT%20In-depth%20Documentation%20and%20Examples.md)

*GCP RAPT* will bring up and shut down Google instances as needed.<br>
*GCP RAPT* will bring up and shut down Google instances as needed.

## Quick start
Here are instructions to execute RAPT once your system is set up. Additional instructions are available on our [wiki page](wiki/GCP%20RAPT%20In-depth%20Documentation%20and%20Examples.md).
1. In a browser, sign into [GCP](https://console.cloud.google.com/)
2. Invoke a Cloud Shell
3. Download the latest release by executing the following commands:

```
~$ curl -sSLo rapt.tar.gz https://github.com/ncbi/rapt/releases/download/v2.2.7/rapt-v2.2.7.tar.gz
~$ tar -xzf rapt.tar.gz && rm -f rapt.tar.gz
```
```
~$ curl -sSLo rapt.tar.gz https://github.com/ncbi/rapt/releases/download/v0.2.0/rapt-v0.2.0.tar.gz
~$ tar -xzf rapt.tar.gz && rm -f rapt.tar.gz
```
4. Run `run_rapt_gcp.sh help` to see the *GCP RAPT* usage information.

### Try an example
To run RAPT, you need Illumina-sequenced reads for the genome you wish to assemble and annotate. These can be in a fasta file in a Google storage bucket, or they can be in a run in SRA (an accession).<br>
To run RAPT, you need Illumina-sequenced reads for the genome you wish to assemble and annotate. These can be in a fasta file in a Google storage bucket, or they can be in a run in SRA (an accession).
Important: Only reads sequenced on **Illumina machines** can be used by RAPT.

#### Starting from an SRA run<br>
To demonstrate how to run RAPT, we are going to use SRR3496277, a set of reads available in SRA for *Mycoplasma pirum*.<br>
#### Starting from an SRA run
To demonstrate how to run RAPT, we are going to use SRR3496277, a set of reads available in SRA for *Mycoplasma pirum*.
This example takes about 1 hour.

Run the following command, where [gs://your_results_bucket](https://cloud.google.com/storage/docs/creating-buckets) is the Google storage bucket where the outputs and logs will be copied when the job finishes.

```bash
~$ ./run_rapt_gcp.sh submitacc SRR3496277 --bucket gs://your_results_bucket<br>
~$ ./run_rapt_gcp.sh submitacc SRR3496277 --bucket gs://your_results_bucket
```


If the job is successfully created, the script will print out execution information similar to the following:
```
RAPT job has been created successfully.
Expand All @@ -64,7 +62,6 @@ For technical details of this job, run:
~$
```


Check the status of the jobs executed under this project, run:
```bash
~$ ./run_rapt_gcp.sh joblist
Expand All @@ -76,10 +73,9 @@ JOB_ID USER LABEL SRR STATUS START_TIME END_TIME
~$
```


The results for the job will be available in the bucket you specified after the job is marked 'Done'. Please note that some runs may take up to 24 hours.

#### Starting from fastq or fasta file<br>
#### Starting from fastq or fasta file
You can use a fastq or a fasta file produced by Illumina sequencers as input to RAPT. This file can contain paired-end reads, with the two reads of a pair adjacent to each other in the file or single-end reads. Note that the quality scores are not necessary. The file needs to be copied to the Google storage bucket before you run `run_rapt_gcp.sh`.

The genus species of the sequenced organism needs to be provided on the command line. The strain is optional.
Expand All @@ -89,7 +85,6 @@ Here is an example command using a file available in the bucket named your_input
~$ ./run_rapt_gcp.sh submitfastq gs://your_input_bucket/M_pirum_25960.fastq -b gs://your_results_bucket --label M_pirum_25960 --organism "Mycoplasma pirum" --strain "ATCC 25960"
```


If the job is successfully created, the script will print out execution information similar to the following:

```
Expand All @@ -116,7 +111,6 @@ For technical details of this job, run:
~$
```


To get more execution details and examples in our [wiki page](https://github.com/ncbi/rapt/wiki/GCP%20RAPT%20In-depth%20Documentation%20and%20Examples.md).
- Setting up GCP with step by step guide
- Using fastq files as input
Expand All @@ -125,16 +119,16 @@ If you have other questions, please visit our [FAQs page](https://github.com/ncb

### Review the output
*GCP RAPT* generates a tarball named `output.tar.gz` in your designated bucket, under a "directory" named after the 10-character job-id assigned at the start of the execution (i.e. "2894b72f9f"). The tarball contains the following files:
1. concise.log is file with the log of major stages and status of your RAPT run<br>
2. verbose.log is a detailed log file of all the actions and console outputs that RAPT performed for your run<br>
3. skesa.out.fa: multifasta files of the assembled contigs produced by SKESA<br>
4. ani-tax-report.txt and ani-tax-report.xml: Taxonomy verification results in text or XML format<br>
5. PGAP annotation results in multiple formats:<br>
* annot.gbk: annotated genome in GenBank flat file format<br>
* annot.gff: annotated genome in GFF3 format<br>
* annot.sqn: annotated genome in ASN format<br>
* annot.faa: multifasta file of the proteins annotated on the genome<br>
* annot.fna: multifasta file of the trancripts annotated on the genome<br>
1. concise.log is file with the log of major stages and status of your RAPT run
2. verbose.log is a detailed log file of all the actions and console outputs that RAPT performed for your run
3. skesa.out.fa: multifasta files of the assembled contigs produced by SKESA
4. ani-tax-report.txt and ani-tax-report.xml: Taxonomy verification results in text or XML format
5. PGAP annotation results in multiple formats:
* annot.gbk: annotated genome in GenBank flat file format
* annot.gff: annotated genome in GFF3 format
* annot.sqn: annotated genome in ASN format
* annot.faa: multifasta file of the proteins annotated on the genome
* annot.fna: multifasta file of the trancripts annotated on the genome
* calls.tab: tab-delimited file of the coordinates of detected foreign sequence. Empty if no foreign contaminant was found.

Along with the tarball there is also a `run.log` file generated automatically by the Google Life Sciences Pipeline where RAPT is invoked. This file catches all output to stdout and stderr by anything, and may be helpful to identify the problem should any happens.
Expand Down
12 changes: 5 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,16 @@
# Read Assembly and Annotation Pipeline Tool (RAPT)

RAPT is a NCBI pipeline designed for assembling and annotating short genomic sequencing reads obtained from bacterial or archaeal isolates. RAPT consists of two major components, [SKESA](https://github.com/ncbi/SKESA) and [PGAP](https://github.com/ncbi/pgap). SKESA is a *de novo* assembler for microbial genomes based on DeBruijn graphs. PGAP is a prokaryotic genome annotation pipeline that combines *ab initio* gene prediction algorithms with homology-based methods. RAPT takes an SRA run or a fasta or fastq file of Illumina reads as input and produces an assembled and annotated genome.
RAPT is a NCBI pipeline designed for assembling and annotating Illumina genome sequencing reads obtained from bacterial or archaeal isolates. RAPT consists of two major NCBI components, SKESA and PGAP. SKESA is a de-novo assembler for microbial genomes based on DeBruijn graphs. PGAP is a prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. RAPT takes an Illumina SRA run or a fasta file as input and produces an assembled and annotated genome.

If you are new to RAPT, please visit our [wiki page](https://github.com/ncbi/rapt/wiki) for detailed information.

![RAPT](RAPT_context2.png)

To use the latest version, download the RAPT command-line interface with the following commands:
To download the latest RAPT, run the following command lines at your linux prompt:
```
~$ curl -sSLo rapt.tar.gz https://github.com/ncbi/rapt/releases/download/v2.2.7/rapt-v2.2.7.tar.gz
~$ curl -sSLo rapt.tar.gz https://github.com/ncbi/rapt/releases/download/v0.2.0/rapt-v0.2.0.tar.gz
~$ tar -xzf rapt.tar.gz && rm -f rapt.tar.gz
```
There should be two scripts in your directory now, `run_rapt_gcp.sh` and `run_rapt.py`, corresponding to the two variations of RAPT: Google Cloud Platform (GCP) RAPT and Standalone RAPT. [GCP RAPT](GCP%20RAPT.md) is designed to run on GCP and is for users with GCP accounts (please note this is different from a gmail account), and [Standalone RAPT](Standalone%20RAPT.md) can run on any computing environments meeting a few pre-requisites.


There should be two scripts in your directory now, `run_rapt_gcp.sh` and `run_rapt.py`, corresponding to two variations of RAPT: Google Cloud Platform (GCP) RAPT and Standalone RAPT. [GCP RAPT](GCP%20RAPT.md) is designed to run on GCP and is for users with GCP accounts (please note this is different from a gmail account), while [Stand-alone RAPT](Standalone%20RAPT.md) can run on any computing environments meeting a few pre-requisites.

For instructions on running RAPT, please go to their respective documentation pages: [GCP RAPT](GCP%20RAPT.md) or [Stand-alone RAPT](Standalone%20RAPT.md).
For instructions on running RAPT, please go to their respective documentation pages: [GCP RAPT](GCP%20RAPT.md) or [Standalone RAPT](Standalone%20RAPT.md).
78 changes: 38 additions & 40 deletions Standalone RAPT.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,39 +9,40 @@ Please see our [wiki page](https://github.com/ncbi/rapt/wiki) for References, Li

The machine must satisfy the following prerequisites:

* At least 4GB memory per CPU core<br>
* At least 8 CPU cores and 32 GB memory<br>
* Linux OS preferred, Windows 10 (pro or enterprise version) will also work but extra configuration is required<br>
* Internet connection<br>
* Container runner installed (currently supports Docker/Podman/Singularity), Docker is recommended<br>
* Python installed<br>
* 100GB free storage space on disk<br>
At least 4GB memory per CPU core
At least 8 CPU cores and 32 GB memory
Linux OS preferred, Windows 10 (pro or enterprise version) will also work but extra configuration is required
Internet connection
Container runner installed (currently supports Docker/Podman/Singularity), Docker is recommended
Python installed
100GB free storage space on disk


### Additional tips if using Windows 10 (pro/enterprise version)
1. Right now it seems to only work on a real physical machine (L0, metal) with CPUs support virtualization (Like INTEL VT-x technology); Make sure this feature is enabled in BIOS<br>
2. Windows 10 only, must be at least Professional or Enterprise version (hypervisor capability)<br>
3. Install python and Docker Desktop<br>
4. Start Docker service with hyper-V enabled<br>
5. Make sure Docker has switched to 'Linux containers'. It should do so by default if hyper-V is up and running.<br>
1. Right now it seems to only work on a real physical machine (L0, metal) with CPUs support virtualization (Like INTEL VT-x technology); Make sure this feature is enabled in BIOS
2. Windows 10 only, must be at least Professional or Enterprise version (hypervisor capability)
3. Install python and Docker Desktop
4. Start Docker service with hyper-V enabled
5. Make sure Docker has switched to 'Linux containers'. It should do so by default if hyper-V is up and running.

## Quick start
Here are instructions to execute RAPT once your system is set up. Additional instructions are available on our [wiki page](https://github.com/ncbi/rapt/wiki/Standalone%20RAPT%20In-depth%20Documentation%20and%20Recommendations.md).
1. Go to your machine or instance command line<br>
2. Download the latest release by executing the following commands:<br>
1. Go to your machine or instance command line
2. Download the latest release by executing the following commands:

```
~$ curl -sSLo rapt.tar.gz https://github.com/ncbi/rapt/releases/download/v0.2.0/rapt-v0.2.0.tar.gz
~$ tar -xzf rapt.tar.gz && rm -f rapt.tar.gz
```
3. Run `./run_rapt.py -h` to see the *Stand-alone RAPT* usage information

```
~$ curl -sSLo rapt.tar.gz https://github.com/ncbi/rapt/releases/download/v2.2.7/rapt-v2.2.7.tar.gz
~$ tar -xzf rapt.tar.gz && rm -f rapt.tar.gz
```
3. Run `./run_rapt.py -h` to see the *Stand-alone RAPT* usage information<br>

### Try an example
To run RAPT, you need Illumina-sequenced reads for the genome you wish to assemble and annotate. These can be in a fasta file on the machine where you wish to run RAPT, or they can be in a run in the NCBI Sequence Read Archive (SRA).<br>
To run RAPT, you need Illumina-sequenced reads for the genome you wish to assemble and annotate. These can be in a fasta file on the machine where you wish to run RAPT, or they can be in a run in the NCBI Sequence Read Archive (SRA).
Important: Only reads sequenced on **Illumina machines** can be used by RAPT.

#### Starting from an SRA run<br>
To demonstrate how to run RAPT, we are going to use SRR3496277, a set of reads available in SRA for *Mycoplasma pirum*.<br>
#### Starting from an SRA run
To demonstrate how to run RAPT, we are going to use SRR3496277, a set of reads available in SRA for *Mycoplasma pirum*.
This example takes about 1 hour to complete (time may vary depends on the configuration of the computer).

Run the following command, the outputs and logs will be located in the current directory when the job finishes.
Expand All @@ -52,10 +53,9 @@ RAPT is now running, it may take a long time to finish. To see the progress, tra
~$
```

All output files and logs will be located in a subdirectory named `raptout_xxxxxxxxxx` under current directory. `xxxxxxxxxx` is the RUNID generated by `run_rapt.py`, unique to each time it is launched. Please note that some runs may take up to 24 hours.

All output files and logs will be located in a subdirectory named `raptout_xxxxxxxxxx` under current directory. `xxxxxxxxxx` is the RUNID generated by `run_rapt.py`, unique to each time it is launched. Please note that some runs may take up to 24 hours.<br>
#### Starting from fastq or fasta file<br>
#### Starting from fastq or fasta file
You can use a fastq or a fasta file produced by Illumina sequencers as input to RAPT. This file can contain paired-end reads, with the two reads of a pair adjacent to each other in the file or single-end reads. Note that the quality scores are not necessary. The file needs to be on the local file system.
The genus species of the sequenced organism needs to be provided on the command line. The strain is optional.
Here is an example command using a file already on your computer:
Expand All @@ -65,16 +65,15 @@ Here is an example command using a file already on your computer:
RAPT is now running, it may take a long time to finish. To see the progress, track the verbose log file /home/username/raptout_d3e7956148/verbose.log.
~$
```



To get more execution details and examples, see our [wiki page](https://github.com/ncbi/rapt/wiki/Standalone%20RAPT%20In-depth%20Documentation%20and%20Recommendations.md).
- Help Documentation<br>
- Reference data location<br>
- Help Documentation
- Reference data location
- Advanced Options

If you have other questions, please visit our [FAQs page](https://github.com/ncbi/rapt/wiki/FAQ.md).

### Review the output<br>
### Review the output

RAPT generates 11 output files if completes normally without error. The default location of result output is in the current directory. Each run of RAPT will create a subdirectory bearing the name raptout_<RUNID> where <RUNID> is a random 10-character string. The --tag JOBID switch can be used to specify a human-readable job id which will be appended after the random RUNID for easy recognition.

Expand All @@ -83,19 +82,18 @@ To store the output in location other than the current directory, use the -o or
~$ ./run_rapt.py -q path/to/srr3496277.fastq --organism "Mycoplasma pirum" --strain "ATCC 25960" --output-dir path/to/output-dir
```


All messages from RAPT execution are logged, with time stamps, in a file named `verbose.log` in the output directory. A simpler version log file, `concise.log`, is also created with only entries mark the main stages and status. Below is the list of expected output files:

1. concise.log<br>
1. concise.log
2. verbose.log
3. skesa.out.fa: multifasta files of the assembled contigs produced by SKESA<br>
4. ani-tax-report.txt and ani-tax-report.xml: Taxonomy verification results in text or XML format<br>
5. PGAP annotation results in multiple formats:<br>
* annot.gbk: annotated genome in GenBank flat file format<br>
* annot.gff: annotated genome in GFF3 format<br>
* annot.sqn: annotated genome in ASN format<br>
* annot.faa: multifasta file of the proteins annotated on the genome<br>
* annot.fna: multifasta file of the trancripts annotated on the genome<br>
3. skesa.out.fa: multifasta files of the assembled contigs produced by SKESA
4. ani-tax-report.txt and ani-tax-report.xml: Taxonomy verification results in text or XML format
5. PGAP annotation results in multiple formats:
* annot.gbk: annotated genome in GenBank flat file format
* annot.gff: annotated genome in GFF3 format
* annot.sqn: annotated genome in ASN format
* annot.faa: multifasta file of the proteins annotated on the genome
* annot.fna: multifasta file of the trancripts annotated on the genome
* calls.tab: tab-delimited file of the coordinates of detected foreign sequence. Empty if no foreign contaminant was found.

See a [detailed description of the annotation output files](https://github.com/ncbi/pgap/wiki/Output-Files) for more information.
24 changes: 0 additions & 24 deletions dist/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,3 @@
### Release v2.2.7
- GCP-RAPT: added `--project` option to specify custom project.
- GCP-RAPT: log file names are fixed to concise.log and verbose.log
- GCP-RAPT: log files are included in the output archive
- GCP-RAPT: added *metadata.events* to `jobdetails` command output
- GCP-RAPT: `joblist` command displays job status as *Done* instead of *Finished* and *Failed* instead of *Aborted* to reflect the actual job status
- Standalone RAPT: suppress stderr log stream by default and add option to enable it
- PINGER ncbi_app name changed from _rapt_ to _raptdocker_.
- Fix verbose log capture bug
- Includes RAPT build id at the beginning of log files
- Added variation analysis to annotation by new version of PGAPX.
- Simplified PINGER usage report data

### Release v0.2.2
- Code refactoring, remove duplicated codes
- All codes are subject to lint with NCBI rules
- Add message to show data-download retrying
- Stand-alone RAPT: Default in silence mode, but print error messages if container returns non-zero status
- Fix NCBI PINGER ```ncbi_app``` values for different flavors (GCP-RAPT, Stand-alone RAPT and web-rapt)
- Remove duplicated sequence file ```annot.fna``` from output
- Added input sequence assemble statistics
- Added retry logic to ```srapath``` to address sporadic failures
- Fix final status error

### Release v0.2.0
- GCP-RAPT: added `--project` option to specify custom project.
- GCP-RAPT: log file names are fixed to concise.log and verbose.log
Expand Down
2 changes: 1 addition & 1 deletion dist/README.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Read Assembly and Annotation Pipeline Tool (RAPT) v2.2.7
Read Assembly and Annotation Pipeline Tool (RAPT) v0.2.0

RAPT is a NCBI pipeline designed for assembling and annotating Illumina genome sequencing reads obtained from bacterial or archaeal isolates. RAPT consists of two major NCBI components, SKESA and PGAP. SKESA is a de-novo assembler for microbial genomes based on DeBruijn graphs. PGAP is a prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. RAPT takes an Illumina SRA run or a fasta file as input and produces an assembled and annotated genome.

Expand Down
Loading

0 comments on commit bd58885

Please sign in to comment.