Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ffq #100

Draft
wants to merge 13 commits into
base: ffq
Choose a base branch
from
15 changes: 14 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,24 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unpublished Version / DEV]
## [[1.7](https://github.com/nf-core/fetchngs/releases/tag/1.7)] - 2022-07-01

### :warning: Major enhancements

Support for GEO ids has been dropped in this release due to breaking changes introduced in the NCBI API. For more detailed information please see [this PR](https://github.com/nf-core/fetchngs/pull/102).

As a workaround, if you have a GEO accession you can directly download a text file containing the appropriate SRA ids to pass to the pipeline:

- Search for your GEO accession on [GEO](https://www.ncbi.nlm.nih.gov/geo)
- Click `SRA Run Selector` at the bottom of the GEO accession page
- Select the desired samples in the `SRA Run Selector` and then download the `Accession List`

This downloads a text file called `SRR_Acc_List.txt` that can be directly provided to the pipeline e.g. `--input SRR_Acc_List.txt`.

### Enhancements & fixes

- [#97](https://github.com/nf-core/fetchngs/pull/97) - Add support for generating nf-core/taxprofiler compatible samplesheets.
- [#99](https://github.com/nf-core/fetchngs/issues/99) - SRA_IDS_TO_RUNINFO fails due to bad request
- Add `enum` field for `--nf_core_pipeline` to parameter schema so only accept supported pipelines are accepted

## [[1.6](https://github.com/nf-core/fetchngs/releases/tag/1.6)] - 2022-05-17
Expand Down
4 changes: 4 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@

## Pipeline tools

- [ffq](https://www.biorxiv.org/content/10.1101/2022.05.18.492548v2)

> Gálvez-Merchán A, Min, KHJ, Pachter L, Booeshaghi SA. Metadata retrieval from sequence databases with ffq. bioRxiv, 19 May 2022. doi: 10.1101/2022.05.18.492548.

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
Expand Down
16 changes: 14 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

## Introduction

**nf-core/fetchngs** is a bioinformatics pipeline to fetch metadata and raw FastQ files from both public and private databases. At present, the pipeline supports SRA / ENA / DDBJ / GEO / Synapse ids (see [usage docs](https://nf-co.re/fetchngs/usage#introduction)).
**nf-core/fetchngs** is a bioinformatics pipeline to fetch metadata and raw FastQ files from both public and private databases. At present, the pipeline supports SRA / ENA / DDBJ / Synapse ids (see [usage docs](https://nf-co.re/fetchngs/usage#introduction)).

The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

Expand All @@ -27,7 +27,7 @@ On release, automated continuous integration tests run the pipeline on a full-si

Via a single file of ids, provided one-per-line (see [example input file](https://raw.githubusercontent.com/nf-core/test-datasets/fetchngs/sra_ids_test.txt)) the pipeline performs the following steps:

### SRA / ENA / DDBJ / GEO ids
### SRA / ENA / DDBJ ids

1. Resolve database ids back to appropriate experiment-level ids and to be compatible with the [ENA API](https://ena-docs.readthedocs.io/en/latest/retrieval/programmatic-access.html)
2. Fetch extensive id metadata via ENA API
Expand All @@ -36,6 +36,18 @@ Via a single file of ids, provided one-per-line (see [example input file](https:
- Otherwise use [`sra-tools`](https://github.com/ncbi/sra-tools) to download `.sra` files and convert them to FastQ
4. Collate id metadata and paths to FastQ files in a single samplesheet

### GEO ids

Support for GEO ids was dropped in [[v1.7](https://github.com/nf-core/fetchngs/releases/tag/1.7)] due to breaking changes introduced in the NCBI API. For more detailed information please see [this PR](https://github.com/nf-core/fetchngs/pull/102).

As a workaround, if you have a GEO accession you can directly download a text file containing the appropriate SRA ids to pass to the pipeline instead:

- Search for your GEO accession on [GEO](https://www.ncbi.nlm.nih.gov/geo)
- Click `SRA Run Selector` at the bottom of the GEO accession page
- Select the desired samples in the `SRA Run Selector` and then download the `Accession List`

This downloads a text file called `SRR_Acc_List.txt` that can be directly provided to the pipeline e.g. `--input SRR_Acc_List.txt`.

### Synapse ids

1. Resolve Synapse directory ids to their corresponding FastQ files ids via the `synapse list` command.
Expand Down
4 changes: 2 additions & 2 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@
"type": "array",
"items": {
"type": "string",
"pattern": "^(((SR|ER|DR)[APRSX])|(SAM(N|EA|D))|(PRJ(NA|EB|DB))|(GS[EM])|(syn))(\\d+)$",
"errorMessage": "Please provide a valid SRA, ENA, DDBJ or GEO identifier"
"pattern": "^(((SR|ER|DR)[APRSX])|(SAM(N|EA|D))|(PRJ(NA|EB|DB))|(syn))(\\d+)$",
"errorMessage": "Please provide a valid SRA, ENA, DDBJ identifier"
}
}
}
25 changes: 17 additions & 8 deletions bin/sra_ids_to_runinfo.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,9 +193,11 @@ def is_valid(cls, identifier):
class DatabaseResolver:
"""Define a service class for resolving various identifiers to experiments."""

_GEO_PREFIXES = {"GSE"}
_GEO_PREFIXES = {
"GSE",
"GSM"
}
_SRA_PREFIXES = {
"GSM",
"PRJNA",
"SAMN",
"SRR",
Expand All @@ -207,7 +209,9 @@ class DatabaseResolver:
"PRJDB",
"SAMD",
}
_ENA_PREFIXES = {"ERR"}
_ENA_PREFIXES = {
"ERR"
}

@classmethod
def expand_identifier(cls, identifier):
Expand Down Expand Up @@ -246,13 +250,13 @@ def _content_check(cls, response, identifier):
def _id_to_srx(cls, identifier):
"""Resolve the identifier to SRA experiments."""
params = {
"save": "efetch",
"id": identifier,
"db": "sra",
"rettype": "runinfo",
"term": identifier,
"retmode": "text"
}
response = fetch_url(
f"https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?{urlencode(params)}"
f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?{urlencode(params)}"
)
cls._content_check(response, identifier)
return [row["Experiment"] for row in open_table(response, delimiter=",")]
Expand All @@ -261,9 +265,14 @@ def _id_to_srx(cls, identifier):
def _gse_to_srx(cls, identifier):
"""Resolve the identifier to SRA experiments."""
ids = []
params = {"acc": identifier, "targ": "gsm", "view": "data", "form": "text"}
params = {
"id": identifier,
"db": "gds",
"rettype": "runinfo",
"retmode": "text"
}
response = fetch_url(
f"https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?{urlencode(params)}"
f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?{urlencode(params)}"
)
cls._content_check(response, identifier)
gsm_ids = [
Expand Down
8 changes: 8 additions & 0 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,14 @@ if (params.input_type == 'sra') {

process {

withName: FFQ {
publishDir = [
path: { "${params.outdir}/metadata" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: SRA_IDS_TO_RUNINFO {
publishDir = [
path: { "${params.outdir}/metadata" },
Expand Down
6 changes: 3 additions & 3 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,19 @@ This document describes the output produced by the pipeline. The directories lis
The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data depending on the type of ids provided:

- Download FastQ files and create samplesheet from:
1. [SRA / ENA / DDBJ / GEO ids](#sra--ena--ddbj--geo-ids)
1. [SRA / ENA / DDBJ ids](#sra--ena--ddbj-ids)
2. [Synapse ids](#synapse-ids)
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution

Please see the [usage documentation](https://nf-co.re/fetchngs/usage#introduction) for a list of supported public repository identifiers and how to provide them to the pipeline.

### SRA / ENA / DDBJ / GEO ids
### SRA / ENA / DDBJ ids

<details markdown="1">
<summary>Output files</summary>

- `fastq/`
- `*.fastq.gz`: Paired-end/single-end reads downloaded from the SRA / ENA / DDBJ / GEO.
- `*.fastq.gz`: Paired-end/single-end reads downloaded from the SRA / ENA / DDBJ.
- `fastq/md5/`
- `*.md5`: Files containing `md5` sum for FastQ files downloaded from the ENA.
- `samplesheet/`
Expand Down
18 changes: 9 additions & 9 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,15 @@

The pipeline has been set-up to automatically download and process the raw FastQ files from both public and private repositories. Identifiers can be provided in a file, one-per-line via the `--input` parameter. Currently, the following types of example identifiers are supported:

| `SRA` | `ENA` | `DDBJ` | `GEO` | `Synapse` |
| ------------ | ------------ | ------------ | ---------- | ----------- |
| SRR11605097 | ERR4007730 | DRR171822 | GSM4432381 | syn26240435 |
| SRX8171613 | ERX4009132 | DRX162434 | GSE147507 | |
| SRS6531847 | ERS4399630 | DRS090921 | | |
| SAMN14689442 | SAMEA6638373 | SAMD00114846 | | |
| SRP256957 | ERP120836 | DRP004793 | | |
| SRA1068758 | ERA2420837 | DRA008156 | | |
| PRJNA625551 | PRJEB37513 | PRJDB4176 | | |
| `SRA` | `ENA` | `DDBJ` | `Synapse` |
| ------------ | ------------ | ------------ | ----------- |
| SRR11605097 | ERR4007730 | DRR171822 | syn26240435 |
| SRX8171613 | ERX4009132 | DRX162434 | |
| SRS6531847 | ERS4399630 | DRS090921 | |
| SAMN14689442 | SAMEA6638373 | SAMD00114846 | |
| SRP256957 | ERP120836 | DRP004793 | |
| SRA1068758 | ERA2420837 | DRA008156 | |
| PRJNA625551 | PRJEB37513 | PRJDB4176 | |

### SRR / ERR / DRR ids

Expand Down
4 changes: 2 additions & 2 deletions lib/WorkflowMain.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ class WorkflowMain {
if (num_match == total_ids) {
is_sra = true
} else {
log.error "Mixture of ids provided via --input: ${no_match_ids.join(', ')}\nPlease provide either SRA / ENA / DDBJ / GEO or Synapse ids!"
log.error "Mixture of ids provided via --input: ${no_match_ids.join(', ')}\nPlease provide either SRA / ENA / DDBJ or Synapse ids!"
System.exit(1)
}
}
Expand All @@ -129,7 +129,7 @@ class WorkflowMain {
if (num_match == total_ids) {
is_synapse = true
} else {
log.error "Mixture of ids provided via --input: ${no_match_ids.join(', ')}\nPlease provide either SRA / ENA / DDBJ / GEO or Synapse ids!"
log.error "Mixture of ids provided via --input: ${no_match_ids.join(', ')}\nPlease provide either SRA / ENA / DDBJ or Synapse ids!"
System.exit(1)
}
}
Expand Down
17 changes: 17 additions & 0 deletions lib/WorkflowSra.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,21 @@ class WorkflowSra {
" running nf-core/other pipelines.\n" +
"==================================================================================="
}

// Fail pipeline if input ids are from the GEO
public static void isGeoFail(ids, log) {
def pattern = /^(GS[EM])(\d+)$/
for (id in ids) {
if (id =~ pattern) {
log.error "===================================================================================\n" +
" GEO id detected: ${id}\n" +
" Support for GEO ids was dropped in v1.7 due to breaking changes in the NCBI API.\n" +
" Please remove any GEO ids from the input samplesheet.\n\n" +
" Please see:\n" +
" https://github.com/nf-core/fetchngs/pull/102\n" +
"==================================================================================="
System.exit(1)
}
}
}
}
4 changes: 2 additions & 2 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ if (WorkflowMain.isSraId(ch_input, log)) {
} else if (WorkflowMain.isSynapseId(ch_input, log)) {
input_type = 'synapse'
} else {
exit 1, 'Ids provided via --input not recognised please make sure they are either SRA / ENA / DDBJ / GEO or Synapse ids!'
exit 1, 'Ids provided via --input not recognised please make sure they are either SRA / ENA / DDBJ or Synapse ids!'
}

if (params.input_type == input_type) {
Expand All @@ -63,7 +63,7 @@ if (params.input_type == input_type) {
workflow NFCORE_FETCHNGS {

//
// WORKFLOW: Download FastQ files for SRA / ENA / DDBJ / GEO ids
// WORKFLOW: Download FastQ files for SRA / ENA / DDBJ ids
//
if (params.input_type == 'sra') {
SRA ( ch_ids )
Expand Down
3 changes: 3 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@
"custom/sratoolsncbisettings": {
"git_sha": "b2dbaa99309a2057efc32ef9d029ed91140068df"
},
"ffq": {
"git_sha": "b96066565a52fdd42901f62e03c4ff5551980b43"
},
"sratools/fasterqdump": {
"git_sha": "0cdf7767a79faf424645beeff83ecfa5528b6a7c"
},
Expand Down
36 changes: 36 additions & 0 deletions modules/nf-core/modules/ffq/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

36 changes: 36 additions & 0 deletions modules/nf-core/modules/ffq/meta.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ manifest {
description = 'Pipeline to fetch metadata and raw FastQ files from public databases'
mainScript = 'main.nf'
nextflowVersion = '!>=21.10.3'
version = '1.7dev'
version = '1.7'
}

// Load modules.config for DSL2 module specific options
Expand Down
2 changes: 1 addition & 1 deletion nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
"pattern": "^\\S+\\.txt$",
"schema": "assets/schema_input.json",
"fa_icon": "fas fa-file-excel",
"description": "File containing SRA/ENA/DDBJ/GEO identifiers one per line to download their associated metadata and FastQ files."
"description": "File containing SRA/ENA/DDBJ identifiers one per line to download their associated metadata and FastQ files."
},
"input_type": {
"type": "string",
Expand Down
16 changes: 16 additions & 0 deletions workflows/sra.nf
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ include { SRAFASTQ } from '../subworkflows/nf-core/srafastq/main'
========================================================================================
*/

include { FFQ } from '../modules/nf-core/modules/ffq/main'
include { CUSTOM_DUMPSOFTWAREVERSIONS } from '../modules/nf-core/modules/custom/dumpsoftwareversions/main'

/*
Expand All @@ -50,6 +51,21 @@ workflow SRA {
main:
ch_versions = Channel.empty()

// //
// // MODULE: Get id metadata from ffq
// //
// FFQ (
// ids.map { [it] }
// )
// ch_versions = ch_versions.mix(FFQ.out.versions.first())

//
// Fail the pipeline if GEO ids detected
//
ids
.collect()
.map { WorkflowSra.isGeoFail(it, log) }

//
// MODULE: Get SRA run information for public database ids
//
Expand Down