Skip to content

Commit

Permalink
Merge pull request #84 from qiyunzhu/2.0b3
Browse files Browse the repository at this point in the history
2.0b3
  • Loading branch information
qiyunzhu authored Nov 25, 2021
2 parents 1101fd1 + bdb606c commit 31e28a2
Show file tree
Hide file tree
Showing 13 changed files with 458 additions and 269 deletions.
13 changes: 12 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,27 @@
Change Log
==========

## Version 2.0b3 (12/27/2020)
## Version 2.0b3 (2021, ongoing)

### Added
- Predicted HGT list now includes potential donors. Users can optionally specify a taxonomic rank at which they will be reported.
- A quick-start guide added to the homepage.
- Let the database construction task resume from an interrupted task and use already downloaded files, while detecting and skipping corrupt files.
- Added an option `--above` to sample genomes at all ranks from the designated one to phylum during database construction.
- Added an option `--typemater` to include NCBI-defined type materials in genome sampling during database construction.
- Added an option `--manual` to export URLs of sampled genomes during database construction, and let the user download them manually.

### Changed
- Updated pre-built default database to 2021-11-21 (after NCBI RefSeq 209)
- Repository transferred from [DittmarLab](https://github.com/DittmarLab) to [qiyunlab](https://github.com/qiyunlab).
- Updated recommended dependency versions, however the program should continue to be compatible with previous versions.
- Minor tweaks with no visible impact on program behavior.

### Fixed
- Fixed an issue with the NCBI FTP server connection during database construction. NCBI now recommends rsync over ftp. Therefore the protocol has been updated accordingly.
- Fixed compatibility with latest scikit-learn (1.0.1).
- Fixed compatibility with latest DIAMOND (2.0.13).


## Version 2.0b2 (5/3/2020)

Expand Down
10 changes: 6 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ References
Set up a Conda environment and install dependencies:

```bash
conda create -n hgtector python=3 pyyaml pandas matplotlib scikit-learn bioconda::diamond
conda create -n hgtector -c conda-forge python=3 pyyaml pandas matplotlib scikit-learn bioconda::diamond
conda activate hgtector
```

Expand All @@ -40,13 +40,15 @@ Install HGTector2:
pip install git+https://github.com/qiyunlab/HGTector.git
```

Build a reference database using the default protocol:
Then you will be able to type `hgtector` to run the program. Here are more details of [installation](doc/install.md).

Build a reference [database](doc/database.md) using the default protocol:

```bash
hgtector database -o db_dir --default
```

This will retrieve the latest genomic data from NCBI. If this does not work (e.g., due to network issues), or you need some customization, please read the [database](doc/database.md) page.
Or [download](https://www.dropbox.com/s/tszxy9etp52id3u/hgtdb_20211121.tar.xz?dl=0) a pre-built database as of 2021-11-21, and [compile](doc/database.md#Manual-compiling) it.

Prepare input file(s). They should be multi-Fasta files of amino acid sequences (faa). Each file represents the whole protein set of a complete or partial genome.

Expand All @@ -69,7 +71,7 @@ It is recommended that you read the [first run](doc/1strun.md), [second run](doc

## License

Copyright (c) 2013-2020, [Qiyun Zhu](mailto:[email protected]) and [Katharina Dittmar](mailto:[email protected]). Licensed under [BSD 3-clause](http://opensource.org/licenses/BSD-3-Clause). See full license [statement](LICENSE).
Copyright (c) 2013-2021, [Qiyun Zhu](mailto:[email protected]) and [Katharina Dittmar](mailto:[email protected]). Licensed under [BSD 3-clause](http://opensource.org/licenses/BSD-3-Clause). See full license [statement](LICENSE).


## Citation
Expand Down
2 changes: 1 addition & 1 deletion doc/1strun.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ A small example is provided in the subdirectory [example](../example). The input
Let's analyze this small example using HGTector.

**Note**: It has been increasingly infeasible as of 2020 to run remote search through the NCBI BLAST server. If you experience an very slow run when going through this tutorial, please skip and move on to [second run](2ndrun.md).
**Note**: Automatic remote BLAST search using URL API is inefficient as of 2021 and has been [deprecated](https://ncbi.github.io/blast-cloud/dev/api.html) by NCBI. Therefore this tutorial is for reference only. Unless you want to wait for long hours, please skip this tutorial and move on to the [second run](2ndrun.md).


## Search
Expand Down
2 changes: 1 addition & 1 deletion doc/2ndrun.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ Here we will analyze a single genome of [_Escherichia coli_ O55:H7](https://www.
Download the whole protein set of _E. coli_ O55:H7 (5,278 protein sequences in total) from the NCBI server:

```bash
wget -O o55h7.faa.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/165/GCF_000025165.1_ASM2516v1/GCF_000025165.1_ASM2516v1_protein.faa.gz
wget -O o55h7.faa.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/165/GCF_000025165.1_ASM2516v1/GCF_000025165.1_ASM2516v1_protein.faa.gz
```

You don't need to unzip the file, as HGTector can automatically parse compressed files in the format of gzip, bzip2 and lzma.
Expand Down
77 changes: 53 additions & 24 deletions doc/database.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
Database
========

## Index
- [Overview](#overview) | [Default protocol](#default-protocol) | [Pre-built database](#pre-built-database) | [Database files](#database-files) | [Considerations](#considerations) | [Command-line reference](#command-line-reference)

## Overview

The `database` command is an automated workflow for sampling reference genomes, downloading non-redundant protein sequences, and building local databases for sequence homology search. It provides various options for flexible customization of the database, to address specific research goals including HGT prediction or other general purposes.
Expand All @@ -9,33 +12,36 @@ The `database` command is an automated workflow for sampling reference genomes,
hgtector database -o <output_dir> <parameters...>
```

### Default protocol
The workflow consists of the following steps:

HGTector provides a default protocol for database building.
1. Download NCBI taxonomy database (taxdump).
2. Download NCBI RefSeq assembly summary.
3. Sample genomes based on various properties and taxonomic information.
4. Download protein sequences associated with sampled genomes.
5. Compile local databases using DIAMOND and/or BLAST.

```bash
hgtector database -o <output_dir> --default
```

This will download all protein sequences of NCBI RefSeq genomes of bacteria, archaea, fungi and protozoa, keep one genome per species, plus all NCBI-defined reference and representative genomes. Finally it will attempt to compile the database using DIAMOND, if available. The command is equivalent to:
## Default protocol
Database files
This will download all protein sequences of NCBI RefSeq genomes of **bacteria**, **archaea**, **fungi** and **protozoa**, keep _one genome per species_ that has a Latinate name, plus one genome per taxonomic group at higher ranks, regardless whether that genome has a Latinate species name, plus all NCBI-defined **reference**, **representative** and **type material** genomes (prioritized during taxonomy-based sampling, and added afterwards if not sampled). Finally it will attempt to compile the database using DIAMOND, if available. The command is equivalent to:

```bash
hgtector database --output <output_dir> --cats microbe --sample 1 --rank species --reference --representative --compile diamond
hgtector database --output <output_dir> --cats microbe --sample 1 --rank species_latin --above --reference --represent --typemater --compile diamond
```

A pre-built default database as of 2019-10-21 is available for [download](https://www.dropbox.com/s/qdnfgzdcjadlm4i/hgtdb_20191021.tar.xz?dl=0). It needs to be [compiled](#Manual-compiling) using choice of aligner.

### Procedures
## Pre-built database

The workflow consists of the following steps:
A database built using the default protocol on 2021-11-21 is available for [download](https://www.dropbox.com/s/tszxy9etp52id3u/hgtdb_20211121.tar.xz?dl=0) \([MD5](https://www.dropbox.com/s/kdopz946pk088wr/hgtdb_20211121.tar.xz.md5?dl=0)\). It needs to be [compiled](#Manual-compiling) using choice of aligner.

1. Download NCBI taxonomy database (taxdump).
2. Download NCBI RefSeq assembly summary.
3. Sample genomes based on various properties and taxonomic information.
4. Download protein sequences associated with sampled genomes.
5. Compile local databases using DIAMOND and/or BLAST.
This database, sampled from NCBI RefSeq after release, 209 contains 68,977,351 unique protein sequences from 21,754 microbial genomes, representing 3 domains, 74 phyla, 145 classes, 337 orders, 783 families, 3,753 genera and 15,932 species.

### Database files
Building this database used a maximum of 63 GB memory. Searching this database using DIAMOND v2.0.13 requires ~7 GB memory.

A previous version of the database built on 2019-10-21 is available [here](https://www.dropbox.com/s/qdnfgzdcjadlm4i/hgtdb_20191021.tar.xz?dl=0).


## Database files

File or directory | Description
--- | ---
Expand All @@ -60,6 +66,9 @@ The protein-to-TaxID map is already integrated into the compiled databases, so o

Feel free to delete (e.g., `download/`) or compress the intermediate files (e.g., `db.faa`) to save disk space.


## Considerations

### More examples

```bash
Expand All @@ -80,12 +89,31 @@ hgtector database -g gids.txt -o .

This will only download genomes specified in the file `gids.txt`. Useful for controlled tests.

### Clean up

After the database is successfully built, you may consider compressing `db.faa` and deleting `download/` (or just `download/faa/`) to save disk space. HGTector won't do this automatically.

### Break and resume

Should any of the download steps be interrupted by e.g., a network failure, one can resume the downloading process by re-executing the same command. The program will skip the already downloaded files in this new run. In some instances, one may need to manually remove the last file from the failed run (because that file may be corrupt), before re-running the program.

If one wants to overwrite downloaded files (e.g., upgrading), add `--overwrite` to the command.

### Manual downloading

One may want to download genomes manually in a more controled manner, instead of letting HGTector running for hours to days to retrieve them one after another before moving to the next step. In this case, add `--manual` to the command, and the program will generate `urls.txt`, a list of URLs of the sampled genomes, and quit.

Then one can choose the most appropriate method to download them. For example, one may use the [rsync protocol](https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#protocols), as recommended by NCBI:

```bash
while read url
do
rsync -Ltq rsync${url#https}/${url##*/}_protein.faa.gz download/faa/
done < urls.txt
```

After all genomes (protein sequences) are downloaded to `download/faa/`, one may restart the program without `--manual`, and the program will take the downloaded files and move to the next step.

### Manual compiling

The sampling & downloading steps (1-4) require extensive network traffics (usually several hours, if the bottleneck is not on the recipient side) but little local computing load; whereas the compling step (5) requires extensive local computing power, but no networking.
Expand All @@ -95,12 +123,11 @@ Therefore, it is a reasonable plan to only download database files without compi
For DIAMOND:

```bash
echo $'accession\taccession.version\ttaxid' > prot.accession2taxid
zcat taxon.map.gz | awk -v OFS='\t' '{split($1, a, "."); print a[1], $1, $2}' >> prot.accession2taxid
echo $'accession.version\ttaxid' | cat - <(zcat taxon.map.gz) > prot.accession2taxid.FULL

diamond makedb --threads 16 --in db.faa --taxonmap prot.accession2taxid --taxonnodes taxdump/nodes.dmp --taxonnames taxdump/names.dmp --db diamond/db
diamond makedb --threads 16 --in db.faa --taxonmap prot.accession2taxid.FULL --taxonnodes taxdump/nodes.dmp --taxonnames taxdump/names.dmp --db diamond/db

rm prot.accession2taxid
rm prot.accession2taxid.FULL
```

For BLAST:
Expand Down Expand Up @@ -174,17 +201,19 @@ Option | Default | Description

Option | Default | Description
--- | --- | ---
`-s`, `--sample` | 0 | Sample up to this number of genomes per taxonomic group at the given rank. "0" is for all (disable sampling).
`-r`, `--rank` | species | Taxonomic rank at which subsampling will be performed.
`-s`, `--sample` | 0 | Sample up to this number of genomes per taxonomic group at the given rank. "0" is for all (disable sampling). Prior to sampling, genomes will be sorted by NCBI genome category: reference > representative > type material, then by assembly level: complete genome or chromosome > scaffolds > contigs. Sampling will start from the top of the list.
`-r`, `--rank` | species | Taxonomic rank at which subsampling will be performed. Can be any taxonomic rank defined in the NCBI taxonomy database. A special case is "species_latin", which will sample from species that have Latinate names.
`--above` | - | Sampling will also be performed on ranks from the one given by `-r` to phylum (low to high). They will not overlap the already sampled ones. For example, if two _E. coli_ genomes are already sampled, no more genome will be added when sampling in genus _Escherichia_. This flag is useful in the case of `-r species_latin`, because some ranks above species may be undersampled.

### Genome sampling

Option | Default | Description
--- | --- | ---
`--genbank` | - | By default the program only downloads RefSeq genomes (`GCF`). This flag will let the program also download GenBank genomes (`GCA`). But RefSeq has higher priority than GenBank if the same genome is hosted by both catalogs.
`--complete` | - | Only include complete genomes, i.e., `assembly_level` is `Complete Genome` or `Chromosome`.
`--reference` | - | Add NCBI-defined reference genomes to selection (after taxon sampling).
`--representative` | - | Add NCBI-defined representative genomes to selection (after taxon sampling).
`--reference` | - | Include NCBI-defined reference genomes.
`--represent` | - | Include NCBI-defined representative genomes.
`--typemater` | - | Include NCBI-defined type material genomes.

### Taxonomic filter

Expand Down
8 changes: 4 additions & 4 deletions doc/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ HGTector is written in Python 3. One needs at least Python 3.6 to run the progra
### Option 1: Through Conda (recommended)

```bash
conda create -n hgtector python=3 pyyaml pandas matplotlib scikit-learn
conda create -n hgtector -c conda-forge python=3 pyyaml pandas matplotlib scikit-learn
conda activate hgtector
pip install git+https://github.com/qiyunlab/HGTector.git
```
Expand Down Expand Up @@ -49,7 +49,7 @@ conda install -c bioconda diamond blast

HGTector has a command `database` for automated database construction. It defaults to the **NCBI** RefSeq microbial genomes and taxonomy. Meanwhile, we also provide instructions for using **GTDB** and custom databases. See [details](database.md).

A standard database built using the default protocol on 2019-10-21 is available for [download](https://www.dropbox.com/s/qdnfgzdcjadlm4i/hgtdb_20191021.tar.xz?dl=0), together with [instruction](database.md#Manual-compiling) for compiling.
A standard database built using the default protocol on 2021-11-21 is available for [download](https://www.dropbox.com/s/tszxy9etp52id3u/hgtdb_20211121.tar.xz?dl=0) \([MD5](https://www.dropbox.com/s/kdopz946pk088wr/hgtdb_20211121.tar.xz.md5?dl=0)\), together with [instruction](database.md#Manual-compiling) for compiling.

A small, pre-compiled test database is also available for [download](https://www.dropbox.com/s/46v3uc708rvc5rc/ref107.tar.xz?dl=0).

Expand Down Expand Up @@ -80,8 +80,8 @@ conda env remove -n hgtector --all

## Compatibility

If in the future some dependencies have changes that are not compatible with the current release of HGTector, the following "safe" command can be used to install the current versions of dependencies (note: DIAMOND version is too tricky to specify).
If in the future some dependencies have changes that are not compatible with the current release of HGTector, the following "safe" command can be used to install the current versions of dependencies.

```bash
conda create -n hgtector -c bioconda python=3.7.7 pyyaml=5.3.1 pandas=1.0.3 matplotlib=3.1.3 scikit-learn=0.22.1 diamond blast=2.9.0
conda create -n hgtector-dev -c conda-forge python=3.10.0 pyyaml=6.0 pandas=1.3.4 matplotlib=3.5.0 scikit-learn=1.0.1 bioconda::diamond=2.0.13 bioconda::blast=2.12.0
```
16 changes: 8 additions & 8 deletions hgtector/analyze.py
Original file line number Diff line number Diff line change
Expand Up @@ -887,9 +887,8 @@ def perform_kde(self, data):
https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/
"""
bw = self.bandwidth
data_ = data[:, np.newaxis]
scaler = StandardScaler()
data_ = scaler.fit_transform(data[:, np.newaxis])
data_ = scaler.fit_transform(data.reshape(-1, 1))
estimator = KernelDensity(kernel='gaussian')

# grid search optimization
Expand All @@ -914,8 +913,8 @@ def perform_kde(self, data):

# get density function
x, y = self.density_func(data_, kde)
x = scaler.inverse_transform(x)
y = scaler.inverse_transform(y)
x = scaler.inverse_transform(x.reshape(-1, 1)).reshape(-1)
y = scaler.inverse_transform(y.reshape(-1, 1)).reshape(-1)
return x, y, bw

@staticmethod
Expand Down Expand Up @@ -1127,7 +1126,7 @@ def smart_kde(self, group):
"""
data = self.df[group].values
scaler = StandardScaler()
data_ = scaler.fit_transform(data[:, np.newaxis])
data_ = scaler.fit_transform(data.reshape(-1, 1))
estimator = KernelDensity(kernel='gaussian')
bwspace = np.logspace(0, -1, self.bw_steps)

Expand All @@ -1137,8 +1136,8 @@ def smart_kde(self, group):
setattr(estimator, 'bandwidth', bw)
kde = estimator.fit(data_)
x, y = self.density_func(data_, kde)
x = scaler.inverse_transform(x)
y = scaler.inverse_transform(y)
x = scaler.inverse_transform(x.reshape(-1, 1)).reshape(-1)
y = scaler.inverse_transform(y.reshape(-1, 1)).reshape(-1)
try:
peak, valley = self.first_hill(x, y)
except ValueError:
Expand Down Expand Up @@ -1177,7 +1176,8 @@ def calc_cluster_props(self):
# calculate centroid
clf = NearestCentroid()
clf.fit(data_, self.df['hgt'])
cent = scaler.inverse_transform(clf.centroids_[1])
cent = scaler.inverse_transform(clf.centroids_[1].reshape(
-1, data_.shape[1])).reshape(-1)
return cent

def refine_cluster(self, cent):
Expand Down
8 changes: 4 additions & 4 deletions hgtector/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -101,8 +101,8 @@ blast:
remote:

# remote search database
# options: nr, refseq_protein, swissprot, pdb, etc.
db: nr
# options: nr, refseq_select_prot, refseq_protein, swissprot, pdb, etc.
db: refseq_select_prot

# remote search algorithm
# options: blastp psiBlast deltaBlast kmerBlastp phiBlast, etc.
Expand All @@ -114,13 +114,13 @@ remote:
# typically cannot exceed 8,000 characters)
retries: 5 # maximum number of retries per search
delay: 60 # seconds between two search requests
timeout: 900 # seconds before program gives up waiting
timeout: 1800 # seconds before program gives up waiting

# entrez query text
entrez: all [filter] NOT(environmental samples[filter] OR metagenomes[orgn]) txid131567[orgn]

# extra URL arguments for remote search
extrargs: "&FILTER=m%20S"
extrargs: "&WORD_SIZE=6&FILTER=m%20S"


## Fetch information from remote server
Expand Down
Loading

0 comments on commit 31e28a2

Please sign in to comment.