Merge pull request #42 from Darcy220606/dev

Merge dev into main
Darcy220606 · Oct 18, 2022 · 362eca0 · 362eca0
2 parents 4d799b6 + ac30c8a
commit 362eca0
Show file tree

Hide file tree

Showing 9 changed files with 113 additions and 62 deletions.
diff --git a/README.md b/README.md
@@ -1,10 +1,10 @@
-# AMPcombi : Antimicorbial peptides parsing and functional classification tool
+# AMPcombi : AntiMicrobial Peptides parsing and functional classification tool
 
 # ![Logo](docs/amp-combi-logo.png)
 
-This tool parses the results of amp prediction tools into a single table and aligns the hits against a reference database of antimicrobial peptides for functional classifications.
+This tool parses the results of antimicrobial peptide (AMP) prediction tools into a single table and aligns the hits against a reference AMP database for functional classifications.
 
-For parsing: AMpcombi is developed to parse the output of these **amp prediction tools**:
+For parsing: AMPcombi is developed to parse the output of these **AMP prediction tools**:
 
 | Tool | Version | Link |
 | ------------- | ------------- | ------------- |
@@ -15,7 +15,7 @@ For parsing: AMpcombi is developed to parse the output of these **amp prediction
 | EnsembleAMPpred  | - | https://pubmed.ncbi.nlm.nih.gov/33494403/ |
 | NeuBI  | -  | https://github.com/nafizh/NeuBI |
 
-For classification: AMPcombi is developed to offer functional annotation of the detcted AMPs by alignemnt to **AMP reference databases**, for e.g.,:
+For classification: AMPcombi is developed to offer functional annotation of the detected AMPs by alignment to an **AMP reference databases**, for e.g.,:
 
 | Tool | Version | Link |
 | ------------- | ------------- | ------------- |
@@ -29,7 +29,7 @@ Alignment to the reference database is done using [diamond blastp v.2.0.15](http
 
 To install AMPcombi:
 
-Add dependencies of the tool; python > 3.0, biopython, pandas and diamond.
+Add dependencies of the tool; `python` > 3.0, `biopython`, `pandas` and `diamond`.
 Installation can be done using:
 
  - pip installation
@@ -46,7 +46,7 @@ conda env create -f ampcombi/environment.yml
 ```
 or
 ```
- conda install AMPcombi
+ conda install -c bioconda AMPcombi
 ```
 
 ======================
@@ -57,14 +57,18 @@ There are two basic commands to run AMPcombi:
 
 1. Using `--amp_results`
 ```console
-ampcombi --amp_results path/to/my/result_folder/ --faa_folder path/to/sample_faa_files/
+ampcombi \
+--amp_results path/to/my/result_folder/ \
+--faa_folder path/to/sample_faa_files/
 ```
 
 Here the head folder containing output files has to be given. AMPcombi finds and summarizes the output files from different tools, if the folder is structured  and named as: `/result_folder/toolsubdir/samplesubdir/sample.tool.filetype`. 
- - Note that the filetype ending might vary and can be specified with `--tooldict`, if it is different from the default.
+ - Note that the filetype ending might vary and can be specified with `--tooldict`, if it is different from the default. When passing a dictionary via command line, this has to be done as a string with single quotes `' '` and the dictionary keys and items with double quotes `" "`. i.e. `'{"key1":"item1", "key2":"item2"}'`
 
 The path to the folder containing the respective protein fasta files has to be provided with `--faa_folder`. The files have to be named with `<samplename>.faa`.
 
+Structure of the results folder:
+
 ```console
 amp_results/
 ├── tool_1/
@@ -87,43 +91,68 @@ amp_results/
 2. Using `--path_list` and `--sample_list`
 
 ```console
-ampcombi --path_list [[list of paths to sample_1-outputs][list of paths to sample_2-outputs]] --sample_list [sample_1, sample_2] --faa_folder path/to/sample_faa_files/
+ampcombi \
+--path_list path_to_sample_1_tool_1.csv path_to_sample_1_tool_1.csv \
+--path_list path_to_sample_2_tool_1.csv path_to_sample_2_tool_1.csv \
+--sample_list sample_1 sample_2 \
+--faa_folder path/to/sample_faa_files/
 ```
 
-Here the paths to the output-files to be summarized can be given as a list for each sample. Together with this option a list of sample-names has to be supplied. 
+Here the paths to the output-files to be summarized can be given by `--path_list` for each sample. Together with this option a list of sample-names has to be supplied.
 The path to the folder containing the respective protein fasta files has to be provided with `--faa_folder`. The files have to be named with `<samplename>.faa`.
 
 
 ### Input options:
 | command | definition | default | example |
 | ------------- | ------------- | ------------- | ------------- |
 | --amp_results | path to the folder containing different tool's output files | ./test_files/ | ../amp_results/ |
-| --sample_list  | list of samples' names | [] | [sample_1, sample_2] |
-| --path_list  | list of paths to output files | [] | [[paths to sample_1 output], [paths to sample_2 outputs]] |
-| --outdir  | name of the output directory | ./ampcombi_results/ | ./ampcombi_results/ |
+| --sample_list  | list of samples' names | - | sample_1 sample_2 |
+| --path_list  | list of paths to output files | - | path_to_sample_1_tool_1.csv path_to_sample_1_tool_1.csv |
 | --cutoff  | probability cutoff to filter AMPs | 0 | 0.5 |
 | --faa_folder  | path to the folder containing the samples` .faa files, Filenames have to contain the corresponding sample-name, i.e. sample_1.faa | ./test_faa/ | ./faa_files/|
-| --tooldict | dictionary of AMP-tools and their respective output file endings | {'ampir':'ampir.tsv', 'amplify':'amplify.tsv', 'macrel':'macrel.tsv', 'hmmer_hmmsearch':'hmmsearch.txt', 'ensembleamppred':'ensembleamppred.txt'} | - |
+| --tooldict | dictionary of AMP-tools and their respective output file endings | '{"ampir":"ampir.tsv", "amplify":"amplify.tsv", "macrel":"macrel.tsv", "hmmer_hmmsearch":"hmmsearch.txt", "ensembleamppred":"ensembleamppred.txt"}' | - |
 | --amp_database | path to the folder containing the reference database files: (1) a fasta file with <.fasta> file extension and (2) the corresponding table with with functional and taxonomic classifications in <.tsv> file extension | [DRAMP 'general amps'](http://dramp.cpu-bioinfor.org/downloads/) database | ./amp_ref_database/ |
+| --complete_summary | Concatenates all samples' summarized tables into one | False | True |
+| --log  | print messages into log file instead of stdout | False | True |
+| --version  | print the version number into stdout | - | 0.1.4 |
 
- - Note: The fasta file corresponding to the AMP database should not contain any characters other than ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W',',Y']
-  - Note: The refernce database table should be tab delimited.
+ - Note: The fasta file corresponding to the AMP database should not contain any characters other than ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y']
+  - Note: The reference database table should be tab delimited.
 
+### Output:
+The output will be written into your working directory, containing the following files and folders:
+```console
+<pwd>/
+├── amp_ref_database/
+|   ├── amp_ref.dmnd
+|   ├── general_amps_<DATE>_clean.fasta
+|   └── general_amps_<DATE>.tsv
+├── sample_1/
+|   ├── sample_1_amp.faa
+|   ├── sample_1_ampcombi.csv
+|   └── sample_1_diamond_matches.txt
+├── sample_2/
+|   ├── sample_2_amp.faa
+|   ├── sample_2_ampcombi.csv
+|   └── sample_2_diamond_matches.txt
+├── AMPcombi_summary.csv
+└── ampcombi.log
+```
 
 ======================
 ## Contribution:
 ======================
 
-AMPcombi is a tool developed for parsing results from published AMP prediction tools. We therfore welcome fellow contributers who would like to add new AMP prediction tools results for parsing and alignment.
+AMPcombi is a tool developed for parsing results from published AMP prediction tools. We therefore welcome fellow contributors who would like to add new AMP prediction tools results for parsing and alignment.
 
 ### Adding a new tool to AMPcombi
-In `ampcombi/reformat_tables.py` 
-- add a new tool function to read the output to a pandas dataframe
+In `ampcombi/reformat_tables.py`
+- add a new tool function to read the output to a pandas dataframe and return two columns named `contig_id` and `prob_<toolname>`
 - add the new function to the `read_path` function
 
 
 In `ampcombi/main.py`
-- add your default `tool:tool.fileending`to the default of `--tooldict`
+- add your default `tool:tool.fileending` to the default of `--tooldict`
 
 
 ======================

diff --git a/ampcombi/ampcombi.py b/ampcombi/ampcombi.py
@@ -3,13 +3,16 @@
 import os
 import argparse
 import warnings
+from contextlib import redirect_stdout
+from version import __version__
+import json
+import os.path
 # import functions from sub-scripts to main:
 from reformat_tables import *
 from amp_fasta import *
 from check_input import *
 from amp_database import *
 from print_header import *
-from contextlib import redirect_stdout
 
 # Define input arguments:
 parser = argparse.ArgumentParser(prog = 'ampcombi', formatter_class=argparse.RawDescriptionHelpFormatter,
@@ -26,22 +29,23 @@
 
 parser.add_argument("--amp_results", dest="amp", nargs='?', help="Enter the path to the folder that contains the different tool's output files in sub-folders named by sample name. \n If paths are to be inferred, sub-folders in this results-directory have to be organized like '/amp_results/toolsubdir/samplesubdir/tool.sample.filetype' \n (default: %(default)s)",
                     type=str, default="./test_files/")
-parser.add_argument("--sample_list", dest="samples", nargs='*', help="Enter a list of sample-names, e.g. ['sample_1', 'sample_2', 'sample_n']. \n If not given, the sample-names will be inferred from the folder structure",
+parser.add_argument("--sample_list", dest="samples", nargs='*', help="Enter a list of sample-names, e.g. sample_1 sample_2 sample_n. \n If not given, the sample-names will be inferred from the folder structure",
                     default=[])
-parser.add_argument("--path_list", dest="files", nargs='*', action='append', help="Enter the list of paths to the files to be summarized as a list of lists, e.g. [['path/to/my/sample1.ampir.tsv', 'path/to/my/sample1.amplify.tsv'], ['path/to/my/sample2.ampir.tsv', 'path/to/my/sample2.amplify.tsv']]. \n If not given, the file-paths will be inferred from the folder structure",
+parser.add_argument("--path_list", dest="files", nargs='*', action='append', help="Enter the list of paths to the files to be summarized as a list of lists, e.g. --path_list path/to/my/sample1.ampir.tsv path/to/my/sample1.amplify.tsv --path_list path/to/my/sample2.ampir.ts path/to/my/sample2.amplify.tsv. \n If not given, the file-paths will be inferred from the folder structure",
                     default=[])
-parser.add_argument("--outdir", dest="out", help="Enter the name of the output directory \n (default: %(default)s)",
-                    type=str, default="./ampcombi_results/")
 parser.add_argument("--cutoff", dest="p", help="Enter the probability cutoff for AMPs \n (default: %(default)s)",
                     type=int, default=0)
 parser.add_argument("--faa_folder", dest="faa", help="Enter the path to the folder containing the reference .faa files. Filenames have to contain the corresponding sample-name, i.e. sample_1.faa \n (default: %(default)s)",
                     type=str, default='./test_faa/')
 parser.add_argument("--tooldict", dest="tools", help="Enter a dictionary of the AMP-tools used with their output file endings (as they appear in the directory tree), \n Tool-names have to be written as in default:\n default={'ampir':'ampir.tsv', 'amplify':'amplify.tsv', 'macrel':'macrel.tsv', 'hmmer_hmmsearch':'hmmsearch.txt', 'ensembleamppred':'ensembleamppred.txt'}",
-                    type=dict, default={'ampir':'ampir.tsv', 'amplify':'amplify.tsv', 'macrel':'macrel.tsv', 'neubi':'neubi.fasta', 'hmmer_hmmsearch':'hmmsearch.txt', 'ensembleamppred':'ensembleamppred.txt'})
+                    type=str, default='{"ampir":"ampir.tsv", "amplify":"amplify.tsv", "macrel":"macrel.tsv", "neubi":"neubi.fasta", "hmmer_hmmsearch":"hmmsearch.txt", "ensembleamppred":"ensembleamppred.txt"}')
 parser.add_argument("--amp_database", dest="ref_db", nargs='?', help="Enter the path to the folder containing the reference database files (.fa and .tsv); a fasta file and the corresponding table with functional and taxonomic classifications. \n (default: DRAMP database)",
                     type=str, default=None)
-parser.add_argument("--log", dest="log_file", nargs='?', help="Silences the standardoutput and captures it in a log file)",
+parser.add_argument("--complete_summary", dest="complete", nargs='?', help="Concatenates all sample summaries to one final summary",
                     type=bool, default=False)
+parser.add_argument("--log", dest="log_file", nargs='?', help="Silences the standard output and captures it in a log file)",
+                    type=bool, default=False)
+parser.add_argument('--version', action='version', version='%(prog)s ' + __version__)
 
 # get command line arguments
 args = parser.parse_args()
@@ -50,21 +54,18 @@
 path = args.amp
 samplelist_in = args.samples
 filepaths_in = args.files
-outdir = args.out
 p = args.p
 faa_path = args.faa
-tooldict = args.tools
+tooldict = json.loads(args.tools)
 database = args.ref_db
+complete_summary = args.complete
 
 # additional variables
 # extract list of tools from input dictionary. If not given, default dict contains all possible tools
 tools = [key for key in tooldict]
 # extract list of tool-output file-endings. If not given, default dict contains default endings.
 fileending = [val for val in tooldict.values()]
 
-# create output directory
-os.makedirs(outdir, exist_ok=True)
-
 # supress panda warnings
 warnings.simplefilter(action='ignore', category=FutureWarning)
 
@@ -81,7 +82,11 @@ def main_workflow():
     # check input filepaths and create list of list of filepaths per sample if input empty
     filepaths = check_pathlist(filepaths_in, samplelist, fileending, path)
     # check amp_ref_database filepaths and create a directory if input empty
-    db = check_ref_database(database, outdir)
+    db = check_ref_database(database)
+
+    # initiate a final_summary dataframe to concatenate each new sample-summary
+    if (complete_summary):
+        complete_summary_df = pd.DataFrame([])
 
     # generate summary for each sample
     amp_faa_paths = []
@@ -90,29 +95,48 @@ def main_workflow():
         main_list = []
         print('\n ########################################################## ')
         print(f'Processing AMP-files from sample: {samplelist[i]}')
-        os.makedirs(outdir + '/'+ samplelist[i], exist_ok=True)
+        os.makedirs(samplelist[i], exist_ok=True)
         # fill main_list with tool-output filepaths for sample i
         read_path(main_list, filepaths[i], p, tooldict, faa_path, samplelist[i])
         # use main_list to create the summary file for sample i
-        summary_df = summary(main_list, samplelist[i], faa_path, outdir)
+        summary_df = summary(main_list, samplelist[i], faa_path)
         # Generate the AMP-faa.fasta for sample i
-        out_path = outdir+ '/'+samplelist[i] +'/'+samplelist[i]+'_amp.faa'
+        out_path = samplelist[i] +'/'+samplelist[i]+'_amp.faa'
         faa_name = faa_path+samplelist[i]+'.faa'
         amp_fasta(summary_df, faa_name, out_path)
         amp_faa_paths.append(out_path)
-        print(f'The fasta containing AMP sequences for {samplelist[i]} was saved to {outdir}/{samplelist[i]}/ \n')
-        amp_matches = outdir + '/'+samplelist[i] +'/'+samplelist[i]+'_diamond_matches.txt'
+        print(f'The fasta containing AMP sequences for {samplelist[i]} was saved to {samplelist[i]}/ \n')
+        amp_matches = samplelist[i] +'/'+samplelist[i]+'_diamond_matches.txt'
         print(f'The diamond alignment for {samplelist[i]} in process....')
         diamond_df = diamond_alignment(db, amp_faa_paths, amp_matches)
-        print(f'The diamond alignment for {samplelist[i]} was saved to {outdir}/{samplelist[i]}/.')
+        print(f'The diamond alignment for {samplelist[i]} was saved to {samplelist[i]}/.')
         # Merge summary_df and diamond_df
-        complete_summary_df = pd.merge(summary_df, diamond_df, on = 'contig_id', how='left')
-        complete_summary_df.to_csv(outdir +'/'+samplelist[i] +'/'+samplelist[i]+'_ampcombi.csv', sep=',')
-        print(f'The summary file for {samplelist[i]} was saved to {outdir}/{samplelist[i]}/.')
-
+        sample_summary_df = pd.merge(summary_df, diamond_df, on = 'contig_id', how='left')
+        # Insert column with sample name on position 0
+        sample_summary_df.insert(0, 'name', samplelist[i])
+        # Write sample summary into sample output folder
+        sample_summary_df.to_csv(samplelist[i] +'/'+samplelist[i]+'_ampcombi.csv', sep=',', index=False)
+        print(f'The summary file for {samplelist[i]} was saved to {samplelist[i]}/.')
+        if (complete_summary):
+        # concatenate the sample summary to the complete summary and overwrite it
+            complete_summary_df = pd.concat([complete_summary_df, sample_summary_df])
+            complete_summary_df.to_csv('AMPcombi_summary.csv', sep=',', index=False)
+        else: 
+            continue
+    if (complete_summary):
+        print(f'\n FINISHED: The AMPcombi_summary.csv file was saved to your current working directory.')
+    else: 
+        print(f'\n FINISHED: AMPcombi created summaries for all input samples.')
+
 def main():
-    if args.log_file == True:
-        with open(f'{outdir}/ampcombi.log', 'w') as f:
+    if (args.log_file == True and not os.path.exists('ampcombi.log')):
+        with open(f'ampcombi.log', 'w') as f:
+            #print(f'AMPcombi version: {args.version}')
+            with redirect_stdout(f):
+                main_workflow()
+    elif(args.log_file == True and os.path.exists('ampcombi.log')):
+        with open(f'ampcombi.log', 'a') as f:
+            #print(f'AMPcombi version: {args.version}')
             with redirect_stdout(f):
                 main_workflow()
     else: main_workflow()

diff --git a/ampcombi/check_input.py b/ampcombi/check_input.py
@@ -29,10 +29,10 @@ def check_pathlist(filepaths, samplelist, fileending, path):
     else:
         return filepaths
 
-def check_ref_database(database, outdir):
+def check_ref_database(database):
     if(database==None):
         print('<--AMP_database> was not given, the current DRAMP general-AMP database will be downloaded and used')
-        database = os.path.join(outdir, r'amp_ref_database')
+        database = 'amp_ref_database'
         os.makedirs(database, exist_ok=True)
         db = database
         download_DRAMP(db)

diff --git a/ampcombi/print_header.py b/ampcombi/print_header.py
@@ -2,12 +2,12 @@
 
 def print_header():
     print("""
-#$$$$$$\  $$\      $$\ $$$$$$$\                                     $$\       $$\|
-#$$  __$$\ $$$\    $$$ |$$  __$$\                                   $$ |      \__|
-#$ /   $$ |$$$$\  $$$$ |$$ |  $$ | $$$$$$$\  $$$$$$\  $$$$$$\$$$$\  $$$$$$$\  $$\ 
-#$$$$$$$$ |$$\$$\$$ $$ |$$$$$$$  |$$  _____|$$  __$$\ $$  _$$  _$$\ $$  __$$\ $$ |
-#$$  __$$ |$$ \$$$  $$ |$$  ____/ $$ /      $$ /  $$ |$$ / $$ / $$ |$$ |  $$ |$$ |
-#$$ |  $$ |$$ |\$  /$$ |$$ |      $$ |      $$ |  $$ |$$ | $$ | $$ |$$ |  $$ |$$ |
-#$$ |  $$ |$$ | \_/ $$ |$$ |      \$$$$$$$\ \$$$$$$  |$$ | $$ | $$ |$$$$$$$  |$$ |
-#\__|  \__|\__|     \__|\__|       \_______| \______/ \__| \__| \__|\_______/ \__|
+$$$$$$\  $$\      $$\ $$$$$$$\                                     $$\       $$\|
+$$  __$$\ $$$\    $$$ |$$  __$$\                                   $$ |      \__|
+$ /   $$ |$$$$\  $$$$ |$$ |  $$ | $$$$$$$\  $$$$$$\  $$$$$$\$$$$\  $$$$$$$\  $$\ 
+$$$$$$$$ |$$\$$\$$ $$ |$$$$$$$  |$$  _____|$$  __$$\ $$  _$$  _$$\ $$  __$$\ $$ |
+$$  __$$ |$$ \$$$  $$ |$$  ____/ $$ /      $$ /  $$ |$$ / $$ / $$ |$$ |  $$ |$$ |
+$$ |  $$ |$$ |\$  /$$ |$$ |      $$ |      $$ |  $$ |$$ | $$ | $$ |$$ |  $$ |$$ |
+$$ |  $$ |$$ | \_/ $$ |$$ |      \$$$$$$$\ \$$$$$$  |$$ | $$ | $$ |$$$$$$$  |$$ |
+\__|  \__|\__|     \__|\__|       \_______| \______/ \__| \__| \__|\_______/ \__|
 """)