Merge pull request #66 from qiyunlab/2.0b3

added donor reporting
qiyunlab · Dec 28, 2020 · 45dfb1d · 45dfb1d
2 parents 438c2a8 + 32a4646
commit 45dfb1d
Show file tree

Hide file tree

Showing 14 changed files with 211 additions and 80 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,16 @@
 Change Log
 ==========
 
+## Version 2.0b3 (12/27/2020)
+
+### Added
+- Predicted HGT list now includes potential donors. Users can optionally specify a taxonomic rank at which they will be reported.
+
+### Changed
+- Repository transferred from [DittmarLab](https://github.com/DittmarLab) to [qiyunlab](https://github.com/qiyunlab).
+- Minor tweaks with no visible impact on program behavior.
+
+
 ## Version 2.0b2 (5/3/2020)
 
 Note: The command-line interface and the format and content of input, output and database files are identical to last version (2.0b1). You may safely combine analysis results of this and last versions without the need for re-analysis.

diff --git a/README.md b/README.md
@@ -1,6 +1,8 @@
 HGTector2
 =========
 
+The development of HGTector is now at [qiyunlab](https://qiyunlab.github.io/). Versions starting from 2.0b3 will be released from this repo. Please access HGTector using the new URL: https://github.com/qiyunlab/HGTector.
+
 **HGTector2** is a completely re-engineered software tool, featuring a fully automated analytical pipeline with smart determination of parameters which requires minimum human involvement, a re-designed command-line interface which facilitates standardized scientific computing, and a high-quality Python 3 codebase.
 
 **HGTector** is a computational pipeline for genome-wide detection of putative horizontal gene transfer (HGT) events based on sequence homology search hit distribution statistics.

diff --git a/doc/1strun.md b/doc/1strun.md
@@ -125,13 +125,13 @@ Refining cluster... done.
 
 Starting from this candidate list, the program performs a refinement in the 2D space. This process involves calculating the "**silhouette scores**", a metric which measures how confident a data point is assigned to the current cluster in the whole dataset. The higher the better. The program drops low-score genes from the candidate list, and the refined list will be the final result.
 
-The 13 predicted HGT-derived genes, together with their silhouette scores, are listed in [hgts/gsul.txt](../example/output/hgts/gsul.txt). Meanwhile, a scatter plot showing "distal" score vs. "close" score is generated. The yellow dots represent the predicted genes. As you can see, they are all located at the right boundary of the plot, meaning they have no to few "close" hits, but a decent amount of "distal" hits.
+The 13 predicted HGT-derived genes, together with their silhouette scores and potential donors (a rough range), are listed in [hgts/gsul.txt](../example/output/hgts/gsul.txt). Meanwhile, a scatter plot showing "distal" score vs. "close" score is generated. The yellow dots represent the predicted genes. As you can see, they are all located at the right boundary of the plot, meaning they have no to few "close" hits, but a decent amount of "distal" hits.
 
 ![gsul.scatter](../example/output/scatter.png "Distal vs. close scatter plot")
 
 ## Aftermath
 
-Let's do a sanity check on the result. The original paper, [Schönknecht _et al_. (2013)](https://science.sciencemag.org/content/339/6124/1207.long) reported that two genes encoding for ArsB (arsenite efflux protein) (`EME29520` and `EME26816`) were horizontally acquired (see Figure 3 of the paper). You should find that the two genes are among the 13 predicted genes in this small and quick analysis.
+Let's do a sanity check on the result. The original paper, [Schönknecht _et al_. (2013)](https://science.sciencemag.org/content/339/6124/1207.long) reported that two genes encoding for ArsB (arsenite efflux protein) (`EME29520` and `EME26816`) were horizontally acquired from Bacteria (see Figure 3 of the paper). You should find that the two genes are among the 13 predicted genes in this small and quick analysis.
 
 Next, one may examine these predicted genes and their hit tables. This will provide clues for further inferring the putative donors of the genes. For example, in the hit table of `EME29520`, there are multiple hits from beta- and gammaproteobacteria...
 

diff --git a/doc/2ndrun.md b/doc/2ndrun.md
@@ -324,16 +324,36 @@ Genes that are located in the "grey zone" (i.e., not-so-low "close" score, not-s
 
 In this case no gene is dropped.
 
-The refined list of putatively HGT-derived genes is printed to `hgts/o55h7.txt`. This file also reports the silhouette score of each candidate gene. I added this function because users may want to report some statistical measurements of individual prediction results.
 
-But it is important not to over-interpret the silhouette scores. They are NOT likelihoods of genes being horizontally derived. They are measurements of how well particular candidate genes are clustered with other candidate genes.
+## Final output
+
+### Predicted HGTs
+
+The refined list of putatively HGT-derived genes is printed to `hgts/o55h7.txt`. This file also reports the silhouette score and the potential donor of each candidate gene. It looks like (taking the grid search result for example):
+
+Proten | Score | Donor
+| --- | --- | --- |
+WP_001285914.1 | 0.691785 | 1224
+WP_000173226.1 | 0.82044 | 1239
+WP_000084086.1 | 0.764173 | 2
+WP_000890958.1 | 0.826649 | 214092
+WP_000026143.1 | 0.748203 | 286
+WP_000064228.1 | 0.819306 | 393305
+WP_001296814.1 | 0.770322 | 629
+
+The **silhouette score** is reported because users may want to report some statistical measurements of individual prediction results. But it is important not to over-interpret the silhouette scores. They are NOT likelihoods of genes being horizontally derived. They are measurements of how well particular candidate genes are clustered with other candidate genes.
 
 Therefore, if multiple genes got horizontally transferred in a bulk, i.e., a "[genomic island](https://en.wikipedia.org/wiki/Genomic_island)", there is higher chance for them to cluster tightly, and high silhouette scores are expected. In contrast, individual HGT-derived genes may be located far from the cluster core (hence moderate silhouette scores), but in the right direction (low close, high distal), and that is still a strong implication of HGT.
 
+The **potential donor** is defined as the lowest common ancestor (LCA) of the top hits in the "distal" group. By default, hits with a bit score within 10% less than the best hit are considered as "top hits". This threshold is consistent with to DIAMOND's taxonomic classification function, and one can customize it using the `--distal-top` parameter.
 
-## Final output
+The resulting TaxID (or "0" if not found) is appended to the score table under column "match", as well as to the predicted HGT list as the last column (in which case it is considered as the potential donor in this HGT event).
 
-### Scatter plot
+One can let the potential donors be reported by their taxon names instead of TaxIDs suing the `--donor-name` flag.
+
+One can force the potential donors to be reported at a certain rank using the `--donor-rank` parameter (e.g., "genus"). Donors below this rank will be raised to this rank (e.g., "_E. coli_" becomes "_Escherichia_"), however donors above this rank will be discarded. Since it is not uncommon that the true donor cannot be accurately determined using the taxonomy of extant organisms, we recommend not using this parameter, or setting it to a high rank (e.g., "phylum").
+
+### Distribution plot
 
 In this demo, under the default settings, 33 genes are predicted to be horizontally acquired. A scatter plot is automatically generated, showing the predicted genes (yellow) in the background of the whole genome (purple).
 
@@ -396,34 +416,4 @@ fig.savefig('scatter.enhance.png')
 ![o55h7.auto.scatter.x](img/o55h7.auto.scatter.x.png "Distal vs. close scatter plot auto enhanced")
 
 
-## Potential donors
-
-You may be interested in knowing which organism(s) did those predicted genes come from.
-
-HGTector reports potential donors by summarizing the top several hits from the "distal" group of each gene. Specifically, it finds the lowest common ancestor (LCA) of hits whose bit scores are only lower than the top hit within a certain range (default: 10%, which is consistent with DIAMOND, controlled by parameter `--distal-top`). The resulting TaxID (or "0" if not found) is appended to the score table, as the last column "match".
-
-You can label HGT candidates with potential donor TaxIDs by:
-
-```bash
-join -t$'\t' -j1 <(sort hgts/o55h7.txt) <(tail -n+2 scores.tsv | grep ^o55h7$'\t' | cut -f2,8 | sort) > o55h7.donor.txt
-```
-
-You can further append taxon names to the table by:
-
-```bash
-join -t$'\t' -13 -21 -o 1.1,1.2,1.3,2.2 <(sort -k3,3 o55h7.donor.txt) <(grep 'scientific name' <taxdump_dir>/names.dmp | sed 's/\t|\t/\t/g' | cut -f1,2 | sort -k1,1) > o55h7.donor.name.txt
-```
-
-The output table will be like (taking the grid search result for example):
-
-Proten | Score | Donor TaxID | Donor name
-| --- | --- | --- | --- |
-WP_001285914.1 | 0.691785 | 1224 | Proteobacteria
-WP_000173226.1 | 0.82044 | 1239 | Firmicutes
-WP_000084086.1 | 0.764173 | 2 | Bacteria
-WP_000890958.1 | 0.826649 | 214092 | Yersinia pestis CO92
-WP_000026143.1 | 0.748203 | 286 | Pseudomonas
-WP_000064228.1 | 0.819306 | 393305 | Yersinia enterocolitica subsp. enterocolitica 8081
-WP_001296814.1 | 0.770322 | 629 | Yersinia
-
 This tutorial should have covered major elements of the HGTector workflow. Yet the small database limits the accuracy of the analysis. Next we will see a [real run](realrun.md), using life-sized database and datasets.
diff --git a/doc/analyze.md b/doc/analyze.md
@@ -36,6 +36,14 @@ Field | Description
 `self`, `close`, `distal` | Score (sum of normalized bit scores) of each group
 `match` | Best match in "distal" group which implicates potential donor for predicted HGTs
 
+Format of `hgts/<sample>.txt`:
+
+Field | Description
+--- | ---
+1 | Protein ID
+2 | Silhouette score
+3 | Potential donor
+
 
 ## Command-line reference
 
@@ -71,8 +79,6 @@ Option | Default | Description
 `--close-tax` | - | TaxIDs of "close" group (a comma-delimited string, or a file of one TaxID per line). Will auto-infer if omitted.
 `--self-rank` | - | For auto-inference: "self" group must be at or above this rank (e.g., species, genus, family...).
 `--close-size` | 10 | For auto-inference: "close" group must have at least this number of taxa.
-`--distal-top` | 10 | Find a match in "distal" group which is LCA of hits with bit score at most this percentage lower than the best hit. The behavior is consistent with DIAMOND's `--top` parameter. This match implicates the potential donor of an HGT-derived gene.
-
 
 ### Scoring
 
@@ -99,6 +105,14 @@ Option | Default | Description
 `--silhouette` | 0.5 | Silhouette score threshold for cluster refinement.
 `--self-low` | no | HGT has low "self" score (an optional criterion).
 
+### Donor reporting
+
+Option | Default | Description
+--- | --- | ---
+`--distal-top` | 10 | Find a match in "distal" group which is LCA of hits with bit score at most this percentage lower than the best hit. The behavior is consistent with DIAMOND's `--top` parameter. This match implicates the potential donor of an HGT-derived gene.
+`--donor-name` | - | Report taxon name instead of TaxID of donor.
+`--donor-rank` | - | Report donor at this rank (e.g., genus, family...). A donor below this rank will be raise to this rank; a donor above this rank will be discarded. If unspecified, the donor will be the LCA of top hits from the distal group.
+
 ### Program behavior
 
 Option | Default | Description

diff --git a/doc/config.md b/doc/config.md
@@ -32,7 +32,7 @@ Therefore one may prepare analysis-specific configuration files, if the same set
 Note: If you installed HGTector using Conda, this file may be located at:
 
 ```
-<conda_dir>/envs/hgtector/lib/python3.7/site-packages/hgtector/config.yml
+<conda_dir>/envs/hgtector/lib/python3.<x>/site-packages/hgtector/config.yml
 ```
 
 Note for HGTector1 users: the configuration file used to be a must and a headache. But for HGTector2, you can completely ignore it! It is only relevant when you want to save typing commands.
diff --git a/doc/install.md b/doc/install.md
@@ -20,12 +20,12 @@ HGTector is written in Python 3. One needs at least Python 3.6 to run the progra
 ```bash
 conda create -n hgtector python=3 pyyaml pandas matplotlib scikit-learn
 conda activate hgtector
-pip install git+https://github.com/DittmarLab/HGTector.git
+pip install git+https://github.com/qiyunlab/HGTector.git
 ```
 
 ### Option 2: Native installation
 
-Download this [repository](https://github.com/DittmarLab/HGTector/archive/master.zip). Unzip. Then execute:
+Download this [repository](https://github.com/qiyunlab/HGTector/archive/master.zip). Unzip. Then execute:
 
 ```bash
 python setup.py install
@@ -59,7 +59,7 @@ A small, pre-compiled test database is also available for [download](https://www
 Just add `--upgrade` or `-U` to the pip command:
 
 ```bash
-pip install -U git+https://github.com/DittmarLab/HGTector.git
+pip install -U git+https://github.com/qiyunlab/HGTector.git
 ```
 
 Note: You can only upgrade from HGTector 2.0b1 or above. You cannot upgrade from older versions, which were written in Perl.

diff --git a/example/README.md b/example/README.md
@@ -11,6 +11,6 @@ hgtector search -i gsul.txt -o .
 hgtector analyze -i gsul.tsv -o .
 ```
 
-The sample output files are provided in `output`. They were generated on 2019-10-16, based on the NCBI nr database by time.
+The sample output files are provided in `output`. They were generated on 2019-10-16, based on the NCBI nr database by time. The file `hgts/gsul.txt` was manually modified to include potential donors (a feature of more recent versions of HGTector).
 
 Detailed instruction for running this example and interpreting outputs is provided in [first run](../doc/1strun.md).
diff --git a/example/output/hgts/gsul.txt b/example/output/hgts/gsul.txt
@@ -1,13 +1,13 @@
-EME26816	0.695565
-EME29520	0.676924
-EME29768	0.544733
-EME29968	0.800982
-EME30135	0.804195
-EME30198	0.793129
-EME31510	0.798976
-EME31723	0.792499
-EME32182	0.658787
-EME32633	0.796032
-EME28073	0.802311
-EME28324	0.795161
-EME28399	0.572625
+EME26816	0.695565	2
+EME29520	0.676924	2
+EME29768	0.544733	2759
+EME29968	0.800982	131567
+EME30135	0.804195	131567
+EME30198	0.793129	2759
+EME31510	0.798976	2759
+EME31723	0.792499	2759
+EME32182	0.658787	2759
+EME32633	0.796032	131567
+EME28073	0.802311	2759
+EME28324	0.795161	131567
+EME28399	0.572625	2759
diff --git a/hgtector/__init__.py b/hgtector/__init__.py
@@ -9,11 +9,11 @@
 # ----------------------------------------------------------------------------
 
 __name__ = 'hgtector'
-__version__ = '2.0b2'
+__version__ = '2.0b3'
 __description__ = (
     'Genome-wide detection of putative horizontal gene transfer (HGT) events '
     'based on sequence homology search hit distribution statistics')
 __license__ = 'BSD-3-Clause'
-__author__ = 'Qiyun Zhu',
-__email__ = '[email protected]',
-__url__ = 'https://github.com/DittmarLab/HGTector'
+__author__ = 'Qiyun Zhu'
+__email__ = '[email protected]'
+__url__ = 'https://github.com/qiyunlab/HGTector'
diff --git a/hgtector/analyze.py b/hgtector/analyze.py
@@ -70,9 +70,6 @@
                      'this rank'],
     ['--close-size', 'for auto-inference: "close" group must have at least '
                      'this number of taxa', {'type': int}],
-    ['--distal-top', 'find match in "distal" group which is LCA of hits with '
-                     'bit score at most this percentage lower than best hit',
-                     {'type': int}],
 
     'scoring',
     ['--weighted',   'score is sum of weighted bit scores; otherwise simple '
@@ -101,6 +98,14 @@
     ['--self-low',   'HGT has low "self" score (an optional criterion)',
                      {'choices': ['yes', 'no']}],
 
+    'donor reporting',
+    ['--distal-top', 'find match in "distal" group which is LCA of hits with '
+                     'bit score at most this percentage lower than best hit',
+                     {'type': int}],
+    ['--donor-name', 'report taxon name instead of taxId of donor.',
+                     {'action': 'store_true'}],
+    ['--donor-rank', 'report donor at this rank (genus, phylum, etc.).'],
+
     'program behavior',
     ['--from-scores', 'if score table already exists, use it and skip search '
                       'result parsing and taxonomy inference; otherwise '
@@ -202,14 +207,17 @@ def set_parameters(self, args):
         get_config(self, 'evalue', 'analyze.evalue', float)
         for key in ('maxhits', 'identity', 'coverage'):
             get_config(self, key, f'analyze.{key}')
-        for key in ('input_cov', 'self_rank', 'close_size', 'distal_top'):
+        for key in ('input_cov', 'self_rank', 'close_size'):
             get_config(self, key, f'grouping.{key.replace("_", "")}')
         for key in ('weighted', 'outliers', 'orphans', 'bandwidth', 'bw_steps',
                     'low_part', 'noise', 'fixed', 'silhouette', 'self_low'):
             get_config(self, key, f'predict.{key.replace("_", "")}')
+        get_config(self, 'distal_top', 'donor.distaltop')
+        for key in ('name', 'rank'):
+            get_config(self, f'donor_{key}', f'donor.{key}')
 
         # convert boolean values
-        for key in ('weighted', 'orphans', 'self_low'):
+        for key in ('weighted', 'orphans', 'self_low', 'donor_name'):
             setattr(self, key, arg2bool(getattr(self, key, None)))
 
         # convert fractions to percentages
@@ -572,17 +580,14 @@ def calc_scores(self):
 
     def find_match(self, df):
         """Find a taxId that best describes top hits.
-
         Parameters
         ----------
         df : pd.DataFrame
             hit table
-
         Returns
         -------
         str
             taxId of match, or '0' if not found
-
         Notes
         -----
         The best match taxId is the LCA of top hits. The "top hits" are
@@ -786,17 +791,43 @@ def predict_hgt(self):
         print('Predicted HGTs by sample:')
         makedirs(join(self.output, 'hgts'), exist_ok=True)
         for sample in self.df['sample'].unique():
-            df_ = self.df[self.df['hgt'] & (self.df['sample'] == sample)]
-            print(f'  {sample}: {df_.shape[0]}.')
-            df_[['protein', 'silh']].to_csv(
-                join(self.output, 'hgts', f'{sample}.txt'),
-                sep='\t', index=False, header=False, float_format='%g')
+            self.write_hgt_list(sample)
         print('Prediction results saved to hgts/.')
 
         # plot prediction results
         self.plot_hgts()
         return self.df[self.df['hgt']].shape[0]
 
+    def write_hgt_list(self, sample):
+        """Write a list of predicted HGTs to an output file.
+
+        Parameters
+        ----------
+        sample : str
+            sample Id
+
+        Notes
+        -----
+        The output file has three columns: protein Id, silhouette score,
+        potential donor.
+
+        The donor will be the LCA of top hits as determined by `find_match`.
+        However, if `donor_rank` is specified, a donor below this rank will be
+        raise to this rank; a donor above this rank will be discarded.
+        """
+        taxdump, name, rank = self.taxdump, self.donor_name, self.donor_rank
+        df_ = self.df[self.df['hgt'] & (self.df['sample'] == sample)]
+        print(f'  {sample}: {df_.shape[0]}.')
+        with open(join(self.output, 'hgts', f'{sample}.txt'), 'w') as f:
+            for row in df_[['protein', 'silh', 'match']].itertuples():
+                # format donor taxon
+                match = row.match
+                if rank and match != '0':
+                    match = taxid_at_rank(match, rank, self.taxdump) or '0'
+                if name:
+                    match = taxdump[match]['name'] if match != '0' else 'N/A'
+                f.write(f'{row.protein}\t{row.silh:g}\t{match}\n')
+
     def cluster_kde(self, group):
         """Cluster data by KDE.
 

diff --git a/hgtector/config.yml b/hgtector/config.yml
@@ -212,11 +212,6 @@ grouping:
   # statistically informative, but consider biological question)
   closesize: 10
 
-  # find a taxId that best describes the potential donor of a gene if it is
-  # HGT-derived; it is the LCA of distal hits with bit score at most this
-  # percentage lower than the best distal hit
-  distaltop: 10
-
 
 ## HGT prediction statistics
 predict:
@@ -269,4 +264,19 @@ predict:
   # low
   selflow: no
 
+
+## Potential donor reporting
+donor:
+
+  # find a taxId that best describes the potential donor of a gene if it is
+  # HGT-derived; it is the LCA of distal hits with bit score at most this
+  # percentage lower than the best distal hit
+  distaltop: 10
+
+  # report taxon name instead of taxId
+  name: no
+
+  # report donor at this rank
+  rank:
+
 ...