Skip to content

Commit

Permalink
Merge pull request #23 from insect-biome-atlas/main
Browse files Browse the repository at this point in the history
Update README and default config
  • Loading branch information
johnne authored May 9, 2024
2 parents 9f9f177 + 40c183f commit fbe2dd9
Show file tree
Hide file tree
Showing 2 changed files with 221 additions and 33 deletions.
248 changes: 218 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,160 @@
# COI reference sequences from BOLD DB
# COI DB

## Installation and usage
## Overview
This tool downloads sequences + metadata from [GBIF](https://hosted-datasets.gbif.org/)
and formats sequences of interest for use with downstream metabarcoding analyses.

Please see the [Wiki](https://github.com/biodiversitydata-se/coidb/wiki) for
detailed install and usage information.
## Installation options

## General information
**Option 1**: Install with conda:

Author: SBDI molecular data team
Contact e-mail: [email protected]
DOI: 10.17044/scilifelab.20514192
License: CC BY 4.0
Categories: Bioinformatics and computational biology not elsewhere classified, Computational ecology and phylogenetics, Ecology not elsewhere classified
Item type: Dataset
Keywords: COI sequence analysis, Ampliseq
Funding: Swedish Research Council (VR), grant number 2019-00242. National Bioinformatic Infrastructure Sweden
```bash
conda install -c bioconda coidb
```

This README file was last updated: 2022-08-23
**Option 2**: Download a release from the '**Releases**' section, unpack it and then create and activate the conda environment. Finally, install the software:

Please cite as: Swedish Biodiversity Data Infrastructure (SBDI; 2022). COI reference sequences from BOLD DB
```bash
conda env create -f environment.yml
conda activate coidb
python -m pip install .
```

## Dataset description
## Quick start
To see the steps that will be run, without actually running them, do:

This repository contains COI (mitochondrial cytochrome oxidase subunit I) sequences
collected from the [BOLD database](https://boldsystems.org/). The fasta file
bold_clustered_cleaned.fasta.gz has record ids that can be queried in the [Public
Data Portal](https://boldsystems.org/index.php/Public_BINSearch?searchtype=records)
and each fasta header contains the taxonomic ranks + the BIN ID assigned to the
record. The taxonomic information for each record is also given in the tab-separated
file bold_info_filtered.tsv.gz.
```bash
coidb -n
```

The dataset was last created on February 18, 2022.
Remove the `-n` flag to actually run the steps.

### Methods
The code used to generate this dataset consists of a snakemake workflow wrapped
into a python package that can be installed with [conda](https://docs.conda.io/en/latest/miniconda.html)
(`conda install -c bioconda coidb`).
## Output

The primary outputs of the tool are:

1. bold_clustered_cleaned.fasta: A fasta file with sequences clustered at whatever threshold set in the config file (default is 1.0 which means 100% identity). The header of each sequence in this file has the format

```
>GMGMN070-14 Animalia;Arthropoda;Insecta;Lepidoptera;Pieridae;Gonepteryx;Gonepteryx rhamni;BOLD:AAA9222
```

In this example `GMGMN070-14` is the representative id for the sequence and can be viewed in the BOLD database at https://www.boldsystems.org/index.php/Public_RecordView?processid=GMGMN070-14.

2. bold_clustered.sintax.fasta: This fasta file is compatible with the SINTAX classification tool implemented in [vsearch](https://github.com/torognes/vsearch) and has headers with the format:

```
>GMGMN070-14;tax=d:Animalia,k:Arthropoda,p:Insecta,c:Lepidoptera,o:Pieridae,f:Gonepteryx,g:Gonepteryx rhamni,s:BOLD:AAA9222
```

> [!NOTE]
> In the SINTAX formatted headers, the taxonomic ranks are shifted to allow classification down to BOLD_bin. Since SINTAX only allows for ranks prefixed with 'd' (for domain) 'k' (kingdom), 'p' (phylum), 'c' (class), 'o' (order), 'f' (family), 'g' (genus), or 's' (species) we shift the taxonomy so that kingdom becomes domain, etc., and prefix the BOLD bin id with 's'.
3. bold_clustered.assigntaxonomy.fasta and bold_clustered.addSpecies.fasta: These fasta files are compatible with the assignTaxonomy and addSpecies functions implemented in [DADA2](https://github.com/benjjneb/dada2/). For the assignTaxonomy file the headers have the format:

```
>Animalia;Arthropoda;Insecta;Lepidoptera;Pieridae;Gonepteryx;Gonepteryx rhamni;
```

and for the addSpecies file the headers have the format:

```
>GMGMN070-14 Gonepteryx rhamni
```

## Configuration
There are a few configurable parameters that modifies how sequences are filtered
and clustered. You can modify these parameters using a config file in `yaml`
format. The default setup looks like this:

```yaml
database:
# url to download info and sequence files from
url: "https://hosted-datasets.gbif.org/ibol/ibol.zip"
# gene of interest (will be used to filter sequences)
gene:
- "COI-5P"
# phyla of interest (omit this in order to include all phyla)
phyla: []
# Percent identity to cluster seqs in the database by
pid: 1.0
```
### Gene types
By default, only sequences named 'COI-5P' are included in the
final output. To modify this behaviour you can supply a config file in `yaml`
format via `-c <path-to-configfile.yaml>`. For example, to also include
'COI-3P' sequences you can create a config file, _e.g._ named `config.yaml` with
these contents:

```yaml
database:
gene:
- 'COI-5P'
- 'COI-3P'
```

Then run `coidb` as:

```bash
coidb -c config.yaml
```

### Phyla

The default is to include sequences from all taxa. However, you can filter the
resulting sequences to only those from one or more phyla. For instance, to only
include sequences from the phyla 'Arthropoda' and 'Chordata' you supply a
config file with these contents:

```yaml
database:
phyla:
- 'Arthropoda'
- 'Chordata'
```

### Clustering

After sequences have been filtered to the genes and phyla of interest they are
clustered on a per-species (or BOLD `BIN` id where applicable) basis using
`vsearch`. By default this clustering is performed at 100% identity. To change
this behaviour, to _e.g._ 95% identity make sure your config file contains:

```yaml
database:
pid: 0.95
```

## Command line options

The `coidb` tool is a wrapper for a small snakemake workflow that handles
all the downloading, filtering and clustering.

```
usage: coidb [-h] [-n] [-j CORES] [-f] [-u] [-c [CONFIG_FILE ...]] [--cluster-config CLUSTER_CONFIG] [--workdir WORKDIR] [-p] [-t]
[targets ...]

positional arguments:
targets File(s) to create or steps to run. If omitted, the full pipeline is run.

options:
-h, --help show this help message and exit
-n, --dryrun Only print what to do, don't do anything [False]
-j CORES, --cores CORES
Number of cores to run with [4]
-f, --force Force workflow run
-u, --unlock Unlock working directory
-c [CONFIG_FILE ...], --config-file [CONFIG_FILE ...]
Path to configuration file
--cluster-config CLUSTER_CONFIG
Path to cluster config (for running on SLURM)
--workdir WORKDIR Working directory. Defaults to current dir
-p, --printshellcmds Print shell commands
-t, --touch Touch output files (mark them up to date without really changing them) instead of running their commands.
```
## How it works
Firstly sequence and taxonomic information for records in the BOLD database is
downloaded from the [GBIF Hosted Datasets](https://hosted-datasets.gbif.org/ibol/).
Expand Down Expand Up @@ -102,6 +221,75 @@ Sequences are then clustered at 100% identity using [vsearch](https://github.com
(Rognes _et al._ 2016). This clustering is done separately for sequences assigned
to each BIN ID.
## References
### Step-by-step
You can also run the `coidb` tool in steps, _e.g._ if you are only interested
in some of the files or if you want to inspect the results before proceeding
to the next step. This is done using the positional argument `targets`.
Valid targets are `download`, `filter` and `cluster`.
#### Step 1: Download
For example, to only
download files from GBIF you can run:
```bash
coidb download
```

This should produce two files `bold_info.tsv` and `bold_seqs.txt` containing
metadata and nucleotide sequences, respectively.

#### Step 2: Filter

To also filter the `bold_info.tsv` and `bold_seqs.txt` files (according to the
default 'COI-5P' gene or any other genes/phyla you've defined in the optional
config file) you can run:

```bash
coidb filter
```

This filters sequences in `bold_seqs.txt` and entries in `bold_info.tsv` to
potential genes and phyla of interest, respectively. Entries are then merged
so that only sequences with relevant information are kept. Output files from
this step are `bold_filtered.fasta` and `bold_info_filtered.tsv`.


#### Step 3: Clustering

The final step clusters sequences in `bold_filtered.fasta` on a per-species
basis. This means that for each species, the sequences are gathered,
clustered with `vsearch` and only the representative sequences are kept. In this
step sequences can either have a species name or a BOLD `BIN` ID
(_e.g._ `BOLD:AAY5017`) and are treated as being equivalent.

To run the clustering step, do:

```bash
coidb cluster
```

The end result is a file `bold_clustered.fasta`.

#### Step 4: Clean headers

The `clean` step removes extra information from sequence headers generated as part of clustering. To run this step, do:

```bash
coidb clean
```

#### Step 5: Generate SINTAX/DADA2 formatted fasta

To also get the SINTAX and/or DADA2 formatted fasta file, do:

```bash
coidb format_sintax
```

or

Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584. doi: 10.7717/peerj.2584
```bash
coidb format_dada2
```
6 changes: 3 additions & 3 deletions src/coidb/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@ database:
# Percent identity to cluster seqs in the database by
pid: 1.0
# url to download info and sequence files from
bold.zip: "https://hosted-datasets.gbif.org/ibol/ibol.zip"
bold.zip: "https://hosted-datasets.gbif.org/ibol/ibol_2024_01_02.zip"
# url to download zip file with 'taxon.txt' file
bold_bins.zip: "https://hosted-datasets.gbif.org/ibol/ibol_bins_2022_02_08.zip"
backbone.zip: "https://hosted-datasets.gbif.org/datasets/backbone/2022-11-23/backbone.zip"
bold_bins.zip: "https://hosted-datasets.gbif.org/ibol/ibol_bins_2024_01_03.zip"
backbone.zip: "https://hosted-datasets.gbif.org/datasets/backbone/2023-08-28/backbone.zip"
# gene of interest (will be used to filter sequences)
gene:
- COI-5P
Expand Down

0 comments on commit fbe2dd9

Please sign in to comment.