Skip to content

Commit

Permalink
Merge pull request #66 from qiyunlab/2.0b3
Browse files Browse the repository at this point in the history
added donor reporting
  • Loading branch information
qiyunzhu authored Dec 28, 2020
2 parents 438c2a8 + 32a4646 commit 45dfb1d
Show file tree
Hide file tree
Showing 14 changed files with 211 additions and 80 deletions.
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,16 @@
Change Log
==========

## Version 2.0b3 (12/27/2020)

### Added
- Predicted HGT list now includes potential donors. Users can optionally specify a taxonomic rank at which they will be reported.

### Changed
- Repository transferred from [DittmarLab](https://github.com/DittmarLab) to [qiyunlab](https://github.com/qiyunlab).
- Minor tweaks with no visible impact on program behavior.


## Version 2.0b2 (5/3/2020)

Note: The command-line interface and the format and content of input, output and database files are identical to last version (2.0b1). You may safely combine analysis results of this and last versions without the need for re-analysis.
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
HGTector2
=========

The development of HGTector is now at [qiyunlab](https://qiyunlab.github.io/). Versions starting from 2.0b3 will be released from this repo. Please access HGTector using the new URL: https://github.com/qiyunlab/HGTector.

**HGTector2** is a completely re-engineered software tool, featuring a fully automated analytical pipeline with smart determination of parameters which requires minimum human involvement, a re-designed command-line interface which facilitates standardized scientific computing, and a high-quality Python 3 codebase.

**HGTector** is a computational pipeline for genome-wide detection of putative horizontal gene transfer (HGT) events based on sequence homology search hit distribution statistics.
Expand Down
4 changes: 2 additions & 2 deletions doc/1strun.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,13 +125,13 @@ Refining cluster... done.

Starting from this candidate list, the program performs a refinement in the 2D space. This process involves calculating the "**silhouette scores**", a metric which measures how confident a data point is assigned to the current cluster in the whole dataset. The higher the better. The program drops low-score genes from the candidate list, and the refined list will be the final result.

The 13 predicted HGT-derived genes, together with their silhouette scores, are listed in [hgts/gsul.txt](../example/output/hgts/gsul.txt). Meanwhile, a scatter plot showing "distal" score vs. "close" score is generated. The yellow dots represent the predicted genes. As you can see, they are all located at the right boundary of the plot, meaning they have no to few "close" hits, but a decent amount of "distal" hits.
The 13 predicted HGT-derived genes, together with their silhouette scores and potential donors (a rough range), are listed in [hgts/gsul.txt](../example/output/hgts/gsul.txt). Meanwhile, a scatter plot showing "distal" score vs. "close" score is generated. The yellow dots represent the predicted genes. As you can see, they are all located at the right boundary of the plot, meaning they have no to few "close" hits, but a decent amount of "distal" hits.

![gsul.scatter](../example/output/scatter.png "Distal vs. close scatter plot")

## Aftermath

Let's do a sanity check on the result. The original paper, [Schönknecht _et al_. (2013)](https://science.sciencemag.org/content/339/6124/1207.long) reported that two genes encoding for ArsB (arsenite efflux protein) (`EME29520` and `EME26816`) were horizontally acquired (see Figure 3 of the paper). You should find that the two genes are among the 13 predicted genes in this small and quick analysis.
Let's do a sanity check on the result. The original paper, [Schönknecht _et al_. (2013)](https://science.sciencemag.org/content/339/6124/1207.long) reported that two genes encoding for ArsB (arsenite efflux protein) (`EME29520` and `EME26816`) were horizontally acquired from Bacteria (see Figure 3 of the paper). You should find that the two genes are among the 13 predicted genes in this small and quick analysis.

Next, one may examine these predicted genes and their hit tables. This will provide clues for further inferring the putative donors of the genes. For example, in the hit table of `EME29520`, there are multiple hits from beta- and gammaproteobacteria...

Expand Down
58 changes: 24 additions & 34 deletions doc/2ndrun.md
Original file line number Diff line number Diff line change
Expand Up @@ -324,16 +324,36 @@ Genes that are located in the "grey zone" (i.e., not-so-low "close" score, not-s

In this case no gene is dropped.

The refined list of putatively HGT-derived genes is printed to `hgts/o55h7.txt`. This file also reports the silhouette score of each candidate gene. I added this function because users may want to report some statistical measurements of individual prediction results.

But it is important not to over-interpret the silhouette scores. They are NOT likelihoods of genes being horizontally derived. They are measurements of how well particular candidate genes are clustered with other candidate genes.
## Final output

### Predicted HGTs

The refined list of putatively HGT-derived genes is printed to `hgts/o55h7.txt`. This file also reports the silhouette score and the potential donor of each candidate gene. It looks like (taking the grid search result for example):

Proten | Score | Donor
| --- | --- | --- |
WP_001285914.1 | 0.691785 | 1224
WP_000173226.1 | 0.82044 | 1239
WP_000084086.1 | 0.764173 | 2
WP_000890958.1 | 0.826649 | 214092
WP_000026143.1 | 0.748203 | 286
WP_000064228.1 | 0.819306 | 393305
WP_001296814.1 | 0.770322 | 629

The **silhouette score** is reported because users may want to report some statistical measurements of individual prediction results. But it is important not to over-interpret the silhouette scores. They are NOT likelihoods of genes being horizontally derived. They are measurements of how well particular candidate genes are clustered with other candidate genes.

Therefore, if multiple genes got horizontally transferred in a bulk, i.e., a "[genomic island](https://en.wikipedia.org/wiki/Genomic_island)", there is higher chance for them to cluster tightly, and high silhouette scores are expected. In contrast, individual HGT-derived genes may be located far from the cluster core (hence moderate silhouette scores), but in the right direction (low close, high distal), and that is still a strong implication of HGT.

The **potential donor** is defined as the lowest common ancestor (LCA) of the top hits in the "distal" group. By default, hits with a bit score within 10% less than the best hit are considered as "top hits". This threshold is consistent with to DIAMOND's taxonomic classification function, and one can customize it using the `--distal-top` parameter.

## Final output
The resulting TaxID (or "0" if not found) is appended to the score table under column "match", as well as to the predicted HGT list as the last column (in which case it is considered as the potential donor in this HGT event).

### Scatter plot
One can let the potential donors be reported by their taxon names instead of TaxIDs suing the `--donor-name` flag.

One can force the potential donors to be reported at a certain rank using the `--donor-rank` parameter (e.g., "genus"). Donors below this rank will be raised to this rank (e.g., "_E. coli_" becomes "_Escherichia_"), however donors above this rank will be discarded. Since it is not uncommon that the true donor cannot be accurately determined using the taxonomy of extant organisms, we recommend not using this parameter, or setting it to a high rank (e.g., "phylum").

### Distribution plot

In this demo, under the default settings, 33 genes are predicted to be horizontally acquired. A scatter plot is automatically generated, showing the predicted genes (yellow) in the background of the whole genome (purple).

Expand Down Expand Up @@ -396,34 +416,4 @@ fig.savefig('scatter.enhance.png')
![o55h7.auto.scatter.x](img/o55h7.auto.scatter.x.png "Distal vs. close scatter plot auto enhanced")


## Potential donors

You may be interested in knowing which organism(s) did those predicted genes come from.

HGTector reports potential donors by summarizing the top several hits from the "distal" group of each gene. Specifically, it finds the lowest common ancestor (LCA) of hits whose bit scores are only lower than the top hit within a certain range (default: 10%, which is consistent with DIAMOND, controlled by parameter `--distal-top`). The resulting TaxID (or "0" if not found) is appended to the score table, as the last column "match".

You can label HGT candidates with potential donor TaxIDs by:

```bash
join -t$'\t' -j1 <(sort hgts/o55h7.txt) <(tail -n+2 scores.tsv | grep ^o55h7$'\t' | cut -f2,8 | sort) > o55h7.donor.txt
```

You can further append taxon names to the table by:

```bash
join -t$'\t' -13 -21 -o 1.1,1.2,1.3,2.2 <(sort -k3,3 o55h7.donor.txt) <(grep 'scientific name' <taxdump_dir>/names.dmp | sed 's/\t|\t/\t/g' | cut -f1,2 | sort -k1,1) > o55h7.donor.name.txt
```

The output table will be like (taking the grid search result for example):

Proten | Score | Donor TaxID | Donor name
| --- | --- | --- | --- |
WP_001285914.1 | 0.691785 | 1224 | Proteobacteria
WP_000173226.1 | 0.82044 | 1239 | Firmicutes
WP_000084086.1 | 0.764173 | 2 | Bacteria
WP_000890958.1 | 0.826649 | 214092 | Yersinia pestis CO92
WP_000026143.1 | 0.748203 | 286 | Pseudomonas
WP_000064228.1 | 0.819306 | 393305 | Yersinia enterocolitica subsp. enterocolitica 8081
WP_001296814.1 | 0.770322 | 629 | Yersinia

This tutorial should have covered major elements of the HGTector workflow. Yet the small database limits the accuracy of the analysis. Next we will see a [real run](realrun.md), using life-sized database and datasets.
18 changes: 16 additions & 2 deletions doc/analyze.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,14 @@ Field | Description
`self`, `close`, `distal` | Score (sum of normalized bit scores) of each group
`match` | Best match in "distal" group which implicates potential donor for predicted HGTs

Format of `hgts/<sample>.txt`:

Field | Description
--- | ---
1 | Protein ID
2 | Silhouette score
3 | Potential donor


## Command-line reference

Expand Down Expand Up @@ -71,8 +79,6 @@ Option | Default | Description
`--close-tax` | - | TaxIDs of "close" group (a comma-delimited string, or a file of one TaxID per line). Will auto-infer if omitted.
`--self-rank` | - | For auto-inference: "self" group must be at or above this rank (e.g., species, genus, family...).
`--close-size` | 10 | For auto-inference: "close" group must have at least this number of taxa.
`--distal-top` | 10 | Find a match in "distal" group which is LCA of hits with bit score at most this percentage lower than the best hit. The behavior is consistent with DIAMOND's `--top` parameter. This match implicates the potential donor of an HGT-derived gene.


### Scoring

Expand All @@ -99,6 +105,14 @@ Option | Default | Description
`--silhouette` | 0.5 | Silhouette score threshold for cluster refinement.
`--self-low` | no | HGT has low "self" score (an optional criterion).

### Donor reporting

Option | Default | Description
--- | --- | ---
`--distal-top` | 10 | Find a match in "distal" group which is LCA of hits with bit score at most this percentage lower than the best hit. The behavior is consistent with DIAMOND's `--top` parameter. This match implicates the potential donor of an HGT-derived gene.
`--donor-name` | - | Report taxon name instead of TaxID of donor.
`--donor-rank` | - | Report donor at this rank (e.g., genus, family...). A donor below this rank will be raise to this rank; a donor above this rank will be discarded. If unspecified, the donor will be the LCA of top hits from the distal group.

### Program behavior

Option | Default | Description
Expand Down
2 changes: 1 addition & 1 deletion doc/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Therefore one may prepare analysis-specific configuration files, if the same set
Note: If you installed HGTector using Conda, this file may be located at:

```
<conda_dir>/envs/hgtector/lib/python3.7/site-packages/hgtector/config.yml
<conda_dir>/envs/hgtector/lib/python3.<x>/site-packages/hgtector/config.yml
```

Note for HGTector1 users: the configuration file used to be a must and a headache. But for HGTector2, you can completely ignore it! It is only relevant when you want to save typing commands.
6 changes: 3 additions & 3 deletions doc/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,12 @@ HGTector is written in Python 3. One needs at least Python 3.6 to run the progra
```bash
conda create -n hgtector python=3 pyyaml pandas matplotlib scikit-learn
conda activate hgtector
pip install git+https://github.com/DittmarLab/HGTector.git
pip install git+https://github.com/qiyunlab/HGTector.git
```

### Option 2: Native installation

Download this [repository](https://github.com/DittmarLab/HGTector/archive/master.zip). Unzip. Then execute:
Download this [repository](https://github.com/qiyunlab/HGTector/archive/master.zip). Unzip. Then execute:

```bash
python setup.py install
Expand Down Expand Up @@ -59,7 +59,7 @@ A small, pre-compiled test database is also available for [download](https://www
Just add `--upgrade` or `-U` to the pip command:

```bash
pip install -U git+https://github.com/DittmarLab/HGTector.git
pip install -U git+https://github.com/qiyunlab/HGTector.git
```

Note: You can only upgrade from HGTector 2.0b1 or above. You cannot upgrade from older versions, which were written in Perl.
Expand Down
2 changes: 1 addition & 1 deletion example/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,6 @@ hgtector search -i gsul.txt -o .
hgtector analyze -i gsul.tsv -o .
```

The sample output files are provided in `output`. They were generated on 2019-10-16, based on the NCBI nr database by time.
The sample output files are provided in `output`. They were generated on 2019-10-16, based on the NCBI nr database by time. The file `hgts/gsul.txt` was manually modified to include potential donors (a feature of more recent versions of HGTector).

Detailed instruction for running this example and interpreting outputs is provided in [first run](../doc/1strun.md).
26 changes: 13 additions & 13 deletions example/output/hgts/gsul.txt
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
EME26816 0.695565
EME29520 0.676924
EME29768 0.544733
EME29968 0.800982
EME30135 0.804195
EME30198 0.793129
EME31510 0.798976
EME31723 0.792499
EME32182 0.658787
EME32633 0.796032
EME28073 0.802311
EME28324 0.795161
EME28399 0.572625
EME26816 0.695565 2
EME29520 0.676924 2
EME29768 0.544733 2759
EME29968 0.800982 131567
EME30135 0.804195 131567
EME30198 0.793129 2759
EME31510 0.798976 2759
EME31723 0.792499 2759
EME32182 0.658787 2759
EME32633 0.796032 131567
EME28073 0.802311 2759
EME28324 0.795161 131567
EME28399 0.572625 2759
8 changes: 4 additions & 4 deletions hgtector/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@
# ----------------------------------------------------------------------------

__name__ = 'hgtector'
__version__ = '2.0b2'
__version__ = '2.0b3'
__description__ = (
'Genome-wide detection of putative horizontal gene transfer (HGT) events '
'based on sequence homology search hit distribution statistics')
__license__ = 'BSD-3-Clause'
__author__ = 'Qiyun Zhu',
__email__ = '[email protected]',
__url__ = 'https://github.com/DittmarLab/HGTector'
__author__ = 'Qiyun Zhu'
__email__ = '[email protected]'
__url__ = 'https://github.com/qiyunlab/HGTector'
57 changes: 44 additions & 13 deletions hgtector/analyze.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,6 @@
'this rank'],
['--close-size', 'for auto-inference: "close" group must have at least '
'this number of taxa', {'type': int}],
['--distal-top', 'find match in "distal" group which is LCA of hits with '
'bit score at most this percentage lower than best hit',
{'type': int}],

'scoring',
['--weighted', 'score is sum of weighted bit scores; otherwise simple '
Expand Down Expand Up @@ -101,6 +98,14 @@
['--self-low', 'HGT has low "self" score (an optional criterion)',
{'choices': ['yes', 'no']}],

'donor reporting',
['--distal-top', 'find match in "distal" group which is LCA of hits with '
'bit score at most this percentage lower than best hit',
{'type': int}],
['--donor-name', 'report taxon name instead of taxId of donor.',
{'action': 'store_true'}],
['--donor-rank', 'report donor at this rank (genus, phylum, etc.).'],

'program behavior',
['--from-scores', 'if score table already exists, use it and skip search '
'result parsing and taxonomy inference; otherwise '
Expand Down Expand Up @@ -202,14 +207,17 @@ def set_parameters(self, args):
get_config(self, 'evalue', 'analyze.evalue', float)
for key in ('maxhits', 'identity', 'coverage'):
get_config(self, key, f'analyze.{key}')
for key in ('input_cov', 'self_rank', 'close_size', 'distal_top'):
for key in ('input_cov', 'self_rank', 'close_size'):
get_config(self, key, f'grouping.{key.replace("_", "")}')
for key in ('weighted', 'outliers', 'orphans', 'bandwidth', 'bw_steps',
'low_part', 'noise', 'fixed', 'silhouette', 'self_low'):
get_config(self, key, f'predict.{key.replace("_", "")}')
get_config(self, 'distal_top', 'donor.distaltop')
for key in ('name', 'rank'):
get_config(self, f'donor_{key}', f'donor.{key}')

# convert boolean values
for key in ('weighted', 'orphans', 'self_low'):
for key in ('weighted', 'orphans', 'self_low', 'donor_name'):
setattr(self, key, arg2bool(getattr(self, key, None)))

# convert fractions to percentages
Expand Down Expand Up @@ -572,17 +580,14 @@ def calc_scores(self):

def find_match(self, df):
"""Find a taxId that best describes top hits.
Parameters
----------
df : pd.DataFrame
hit table
Returns
-------
str
taxId of match, or '0' if not found
Notes
-----
The best match taxId is the LCA of top hits. The "top hits" are
Expand Down Expand Up @@ -786,17 +791,43 @@ def predict_hgt(self):
print('Predicted HGTs by sample:')
makedirs(join(self.output, 'hgts'), exist_ok=True)
for sample in self.df['sample'].unique():
df_ = self.df[self.df['hgt'] & (self.df['sample'] == sample)]
print(f' {sample}: {df_.shape[0]}.')
df_[['protein', 'silh']].to_csv(
join(self.output, 'hgts', f'{sample}.txt'),
sep='\t', index=False, header=False, float_format='%g')
self.write_hgt_list(sample)
print('Prediction results saved to hgts/.')

# plot prediction results
self.plot_hgts()
return self.df[self.df['hgt']].shape[0]

def write_hgt_list(self, sample):
"""Write a list of predicted HGTs to an output file.
Parameters
----------
sample : str
sample Id
Notes
-----
The output file has three columns: protein Id, silhouette score,
potential donor.
The donor will be the LCA of top hits as determined by `find_match`.
However, if `donor_rank` is specified, a donor below this rank will be
raise to this rank; a donor above this rank will be discarded.
"""
taxdump, name, rank = self.taxdump, self.donor_name, self.donor_rank
df_ = self.df[self.df['hgt'] & (self.df['sample'] == sample)]
print(f' {sample}: {df_.shape[0]}.')
with open(join(self.output, 'hgts', f'{sample}.txt'), 'w') as f:
for row in df_[['protein', 'silh', 'match']].itertuples():
# format donor taxon
match = row.match
if rank and match != '0':
match = taxid_at_rank(match, rank, self.taxdump) or '0'
if name:
match = taxdump[match]['name'] if match != '0' else 'N/A'
f.write(f'{row.protein}\t{row.silh:g}\t{match}\n')

def cluster_kde(self, group):
"""Cluster data by KDE.
Expand Down
20 changes: 15 additions & 5 deletions hgtector/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -212,11 +212,6 @@ grouping:
# statistically informative, but consider biological question)
closesize: 10

# find a taxId that best describes the potential donor of a gene if it is
# HGT-derived; it is the LCA of distal hits with bit score at most this
# percentage lower than the best distal hit
distaltop: 10


## HGT prediction statistics
predict:
Expand Down Expand Up @@ -269,4 +264,19 @@ predict:
# low
selflow: no


## Potential donor reporting
donor:

# find a taxId that best describes the potential donor of a gene if it is
# HGT-derived; it is the LCA of distal hits with bit score at most this
# percentage lower than the best distal hit
distaltop: 10

# report taxon name instead of taxId
name: no

# report donor at this rank
rank:

...
Loading

0 comments on commit 45dfb1d

Please sign in to comment.