Skip to content

Commit

Permalink
Merge pull request #42 from Darcy220606/dev
Browse files Browse the repository at this point in the history
Merge dev into main
  • Loading branch information
louperelo authored Oct 18, 2022
2 parents 4d799b6 + ac30c8a commit 362eca0
Show file tree
Hide file tree
Showing 9 changed files with 113 additions and 62 deletions.
69 changes: 49 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# AMPcombi : Antimicorbial peptides parsing and functional classification tool
# AMPcombi : AntiMicrobial Peptides parsing and functional classification tool

# ![Logo](docs/amp-combi-logo.png)

This tool parses the results of amp prediction tools into a single table and aligns the hits against a reference database of antimicrobial peptides for functional classifications.
This tool parses the results of antimicrobial peptide (AMP) prediction tools into a single table and aligns the hits against a reference AMP database for functional classifications.

For parsing: AMpcombi is developed to parse the output of these **amp prediction tools**:
For parsing: AMPcombi is developed to parse the output of these **AMP prediction tools**:

| Tool | Version | Link |
| ------------- | ------------- | ------------- |
Expand All @@ -15,7 +15,7 @@ For parsing: AMpcombi is developed to parse the output of these **amp prediction
| EnsembleAMPpred | - | https://pubmed.ncbi.nlm.nih.gov/33494403/ |
| NeuBI | - | https://github.com/nafizh/NeuBI |

For classification: AMPcombi is developed to offer functional annotation of the detcted AMPs by alignemnt to **AMP reference databases**, for e.g.,:
For classification: AMPcombi is developed to offer functional annotation of the detected AMPs by alignment to an **AMP reference databases**, for e.g.,:

| Tool | Version | Link |
| ------------- | ------------- | ------------- |
Expand All @@ -29,7 +29,7 @@ Alignment to the reference database is done using [diamond blastp v.2.0.15](http

To install AMPcombi:

Add dependencies of the tool; python > 3.0, biopython, pandas and diamond.
Add dependencies of the tool; `python` > 3.0, `biopython`, `pandas` and `diamond`.
Installation can be done using:

- pip installation
Expand All @@ -46,7 +46,7 @@ conda env create -f ampcombi/environment.yml
```
or
```
conda install AMPcombi
conda install -c bioconda AMPcombi
```

======================
Expand All @@ -57,14 +57,18 @@ There are two basic commands to run AMPcombi:

1. Using `--amp_results`
```console
ampcombi --amp_results path/to/my/result_folder/ --faa_folder path/to/sample_faa_files/
ampcombi \
--amp_results path/to/my/result_folder/ \
--faa_folder path/to/sample_faa_files/
```

Here the head folder containing output files has to be given. AMPcombi finds and summarizes the output files from different tools, if the folder is structured and named as: `/result_folder/toolsubdir/samplesubdir/sample.tool.filetype`.
- Note that the filetype ending might vary and can be specified with `--tooldict`, if it is different from the default.
- Note that the filetype ending might vary and can be specified with `--tooldict`, if it is different from the default. When passing a dictionary via command line, this has to be done as a string with single quotes `' '` and the dictionary keys and items with double quotes `" "`. i.e. `'{"key1":"item1", "key2":"item2"}'`

The path to the folder containing the respective protein fasta files has to be provided with `--faa_folder`. The files have to be named with `<samplename>.faa`.

Structure of the results folder:

```console
amp_results/
├── tool_1/
Expand All @@ -87,43 +91,68 @@ amp_results/
2. Using `--path_list` and `--sample_list`

```console
ampcombi --path_list [[list of paths to sample_1-outputs][list of paths to sample_2-outputs]] --sample_list [sample_1, sample_2] --faa_folder path/to/sample_faa_files/
ampcombi \
--path_list path_to_sample_1_tool_1.csv path_to_sample_1_tool_1.csv \
--path_list path_to_sample_2_tool_1.csv path_to_sample_2_tool_1.csv \
--sample_list sample_1 sample_2 \
--faa_folder path/to/sample_faa_files/
```

Here the paths to the output-files to be summarized can be given as a list for each sample. Together with this option a list of sample-names has to be supplied.
Here the paths to the output-files to be summarized can be given by `--path_list` for each sample. Together with this option a list of sample-names has to be supplied.
The path to the folder containing the respective protein fasta files has to be provided with `--faa_folder`. The files have to be named with `<samplename>.faa`.


### Input options:
| command | definition | default | example |
| ------------- | ------------- | ------------- | ------------- |
| --amp_results | path to the folder containing different tool's output files | ./test_files/ | ../amp_results/ |
| --sample_list | list of samples' names | [] | [sample_1, sample_2] |
| --path_list | list of paths to output files | [] | [[paths to sample_1 output], [paths to sample_2 outputs]] |
| --outdir | name of the output directory | ./ampcombi_results/ | ./ampcombi_results/ |
| --sample_list | list of samples' names | - | sample_1 sample_2 |
| --path_list | list of paths to output files | - | path_to_sample_1_tool_1.csv path_to_sample_1_tool_1.csv |
| --cutoff | probability cutoff to filter AMPs | 0 | 0.5 |
| --faa_folder | path to the folder containing the samples` .faa files, Filenames have to contain the corresponding sample-name, i.e. sample_1.faa | ./test_faa/ | ./faa_files/|
| --tooldict | dictionary of AMP-tools and their respective output file endings | {'ampir':'ampir.tsv', 'amplify':'amplify.tsv', 'macrel':'macrel.tsv', 'hmmer_hmmsearch':'hmmsearch.txt', 'ensembleamppred':'ensembleamppred.txt'} | - |
| --tooldict | dictionary of AMP-tools and their respective output file endings | '{"ampir":"ampir.tsv", "amplify":"amplify.tsv", "macrel":"macrel.tsv", "hmmer_hmmsearch":"hmmsearch.txt", "ensembleamppred":"ensembleamppred.txt"}' | - |
| --amp_database | path to the folder containing the reference database files: (1) a fasta file with <.fasta> file extension and (2) the corresponding table with with functional and taxonomic classifications in <.tsv> file extension | [DRAMP 'general amps'](http://dramp.cpu-bioinfor.org/downloads/) database | ./amp_ref_database/ |
| --complete_summary | Concatenates all samples' summarized tables into one | False | True |
| --log | print messages into log file instead of stdout | False | True |
| --version | print the version number into stdout | - | 0.1.4 |

- Note: The fasta file corresponding to the AMP database should not contain any characters other than ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W',',Y']
- Note: The refernce database table should be tab delimited.
- Note: The fasta file corresponding to the AMP database should not contain any characters other than ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y']
- Note: The reference database table should be tab delimited.

### Output:
The output will be written into your working directory, containing the following files and folders:
```console
<pwd>/
├── amp_ref_database/
| ├── amp_ref.dmnd
| ├── general_amps_<DATE>_clean.fasta
| └── general_amps_<DATE>.tsv
├── sample_1/
| ├── sample_1_amp.faa
| ├── sample_1_ampcombi.csv
| └── sample_1_diamond_matches.txt
├── sample_2/
| ├── sample_2_amp.faa
| ├── sample_2_ampcombi.csv
| └── sample_2_diamond_matches.txt
├── AMPcombi_summary.csv
└── ampcombi.log
```

======================
## Contribution:
======================

AMPcombi is a tool developed for parsing results from published AMP prediction tools. We therfore welcome fellow contributers who would like to add new AMP prediction tools results for parsing and alignment.
AMPcombi is a tool developed for parsing results from published AMP prediction tools. We therefore welcome fellow contributors who would like to add new AMP prediction tools results for parsing and alignment.

### Adding a new tool to AMPcombi
In `ampcombi/reformat_tables.py`
- add a new tool function to read the output to a pandas dataframe
In `ampcombi/reformat_tables.py`
- add a new tool function to read the output to a pandas dataframe and return two columns named `contig_id` and `prob_<toolname>`
- add the new function to the `read_path` function


In `ampcombi/main.py`
- add your default `tool:tool.fileending`to the default of `--tooldict`
- add your default `tool:tool.fileending` to the default of `--tooldict`


======================
Expand Down
74 changes: 49 additions & 25 deletions ampcombi/ampcombi.py
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,16 @@
import os
import argparse
import warnings
from contextlib import redirect_stdout
from version import __version__
import json
import os.path
# import functions from sub-scripts to main:
from reformat_tables import *
from amp_fasta import *
from check_input import *
from amp_database import *
from print_header import *
from contextlib import redirect_stdout

# Define input arguments:
parser = argparse.ArgumentParser(prog = 'ampcombi', formatter_class=argparse.RawDescriptionHelpFormatter,
Expand All @@ -26,22 +29,23 @@

parser.add_argument("--amp_results", dest="amp", nargs='?', help="Enter the path to the folder that contains the different tool's output files in sub-folders named by sample name. \n If paths are to be inferred, sub-folders in this results-directory have to be organized like '/amp_results/toolsubdir/samplesubdir/tool.sample.filetype' \n (default: %(default)s)",
type=str, default="./test_files/")
parser.add_argument("--sample_list", dest="samples", nargs='*', help="Enter a list of sample-names, e.g. ['sample_1', 'sample_2', 'sample_n']. \n If not given, the sample-names will be inferred from the folder structure",
parser.add_argument("--sample_list", dest="samples", nargs='*', help="Enter a list of sample-names, e.g. sample_1 sample_2 sample_n. \n If not given, the sample-names will be inferred from the folder structure",
default=[])
parser.add_argument("--path_list", dest="files", nargs='*', action='append', help="Enter the list of paths to the files to be summarized as a list of lists, e.g. [['path/to/my/sample1.ampir.tsv', 'path/to/my/sample1.amplify.tsv'], ['path/to/my/sample2.ampir.tsv', 'path/to/my/sample2.amplify.tsv']]. \n If not given, the file-paths will be inferred from the folder structure",
parser.add_argument("--path_list", dest="files", nargs='*', action='append', help="Enter the list of paths to the files to be summarized as a list of lists, e.g. --path_list path/to/my/sample1.ampir.tsv path/to/my/sample1.amplify.tsv --path_list path/to/my/sample2.ampir.ts path/to/my/sample2.amplify.tsv. \n If not given, the file-paths will be inferred from the folder structure",
default=[])
parser.add_argument("--outdir", dest="out", help="Enter the name of the output directory \n (default: %(default)s)",
type=str, default="./ampcombi_results/")
parser.add_argument("--cutoff", dest="p", help="Enter the probability cutoff for AMPs \n (default: %(default)s)",
type=int, default=0)
parser.add_argument("--faa_folder", dest="faa", help="Enter the path to the folder containing the reference .faa files. Filenames have to contain the corresponding sample-name, i.e. sample_1.faa \n (default: %(default)s)",
type=str, default='./test_faa/')
parser.add_argument("--tooldict", dest="tools", help="Enter a dictionary of the AMP-tools used with their output file endings (as they appear in the directory tree), \n Tool-names have to be written as in default:\n default={'ampir':'ampir.tsv', 'amplify':'amplify.tsv', 'macrel':'macrel.tsv', 'hmmer_hmmsearch':'hmmsearch.txt', 'ensembleamppred':'ensembleamppred.txt'}",
type=dict, default={'ampir':'ampir.tsv', 'amplify':'amplify.tsv', 'macrel':'macrel.tsv', 'neubi':'neubi.fasta', 'hmmer_hmmsearch':'hmmsearch.txt', 'ensembleamppred':'ensembleamppred.txt'})
type=str, default='{"ampir":"ampir.tsv", "amplify":"amplify.tsv", "macrel":"macrel.tsv", "neubi":"neubi.fasta", "hmmer_hmmsearch":"hmmsearch.txt", "ensembleamppred":"ensembleamppred.txt"}')
parser.add_argument("--amp_database", dest="ref_db", nargs='?', help="Enter the path to the folder containing the reference database files (.fa and .tsv); a fasta file and the corresponding table with functional and taxonomic classifications. \n (default: DRAMP database)",
type=str, default=None)
parser.add_argument("--log", dest="log_file", nargs='?', help="Silences the standardoutput and captures it in a log file)",
parser.add_argument("--complete_summary", dest="complete", nargs='?', help="Concatenates all sample summaries to one final summary",
type=bool, default=False)
parser.add_argument("--log", dest="log_file", nargs='?', help="Silences the standard output and captures it in a log file)",
type=bool, default=False)
parser.add_argument('--version', action='version', version='%(prog)s ' + __version__)

# get command line arguments
args = parser.parse_args()
Expand All @@ -50,21 +54,18 @@
path = args.amp
samplelist_in = args.samples
filepaths_in = args.files
outdir = args.out
p = args.p
faa_path = args.faa
tooldict = args.tools
tooldict = json.loads(args.tools)
database = args.ref_db
complete_summary = args.complete

# additional variables
# extract list of tools from input dictionary. If not given, default dict contains all possible tools
tools = [key for key in tooldict]
# extract list of tool-output file-endings. If not given, default dict contains default endings.
fileending = [val for val in tooldict.values()]

# create output directory
os.makedirs(outdir, exist_ok=True)

# supress panda warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

Expand All @@ -81,7 +82,11 @@ def main_workflow():
# check input filepaths and create list of list of filepaths per sample if input empty
filepaths = check_pathlist(filepaths_in, samplelist, fileending, path)
# check amp_ref_database filepaths and create a directory if input empty
db = check_ref_database(database, outdir)
db = check_ref_database(database)

# initiate a final_summary dataframe to concatenate each new sample-summary
if (complete_summary):
complete_summary_df = pd.DataFrame([])

# generate summary for each sample
amp_faa_paths = []
Expand All @@ -90,29 +95,48 @@ def main_workflow():
main_list = []
print('\n ########################################################## ')
print(f'Processing AMP-files from sample: {samplelist[i]}')
os.makedirs(outdir + '/'+ samplelist[i], exist_ok=True)
os.makedirs(samplelist[i], exist_ok=True)
# fill main_list with tool-output filepaths for sample i
read_path(main_list, filepaths[i], p, tooldict, faa_path, samplelist[i])
# use main_list to create the summary file for sample i
summary_df = summary(main_list, samplelist[i], faa_path, outdir)
summary_df = summary(main_list, samplelist[i], faa_path)
# Generate the AMP-faa.fasta for sample i
out_path = outdir+ '/'+samplelist[i] +'/'+samplelist[i]+'_amp.faa'
out_path = samplelist[i] +'/'+samplelist[i]+'_amp.faa'
faa_name = faa_path+samplelist[i]+'.faa'
amp_fasta(summary_df, faa_name, out_path)
amp_faa_paths.append(out_path)
print(f'The fasta containing AMP sequences for {samplelist[i]} was saved to {outdir}/{samplelist[i]}/ \n')
amp_matches = outdir + '/'+samplelist[i] +'/'+samplelist[i]+'_diamond_matches.txt'
print(f'The fasta containing AMP sequences for {samplelist[i]} was saved to {samplelist[i]}/ \n')
amp_matches = samplelist[i] +'/'+samplelist[i]+'_diamond_matches.txt'
print(f'The diamond alignment for {samplelist[i]} in process....')
diamond_df = diamond_alignment(db, amp_faa_paths, amp_matches)
print(f'The diamond alignment for {samplelist[i]} was saved to {outdir}/{samplelist[i]}/.')
print(f'The diamond alignment for {samplelist[i]} was saved to {samplelist[i]}/.')
# Merge summary_df and diamond_df
complete_summary_df = pd.merge(summary_df, diamond_df, on = 'contig_id', how='left')
complete_summary_df.to_csv(outdir +'/'+samplelist[i] +'/'+samplelist[i]+'_ampcombi.csv', sep=',')
print(f'The summary file for {samplelist[i]} was saved to {outdir}/{samplelist[i]}/.')

sample_summary_df = pd.merge(summary_df, diamond_df, on = 'contig_id', how='left')
# Insert column with sample name on position 0
sample_summary_df.insert(0, 'name', samplelist[i])
# Write sample summary into sample output folder
sample_summary_df.to_csv(samplelist[i] +'/'+samplelist[i]+'_ampcombi.csv', sep=',', index=False)
print(f'The summary file for {samplelist[i]} was saved to {samplelist[i]}/.')
if (complete_summary):
# concatenate the sample summary to the complete summary and overwrite it
complete_summary_df = pd.concat([complete_summary_df, sample_summary_df])
complete_summary_df.to_csv('AMPcombi_summary.csv', sep=',', index=False)
else:
continue
if (complete_summary):
print(f'\n FINISHED: The AMPcombi_summary.csv file was saved to your current working directory.')
else:
print(f'\n FINISHED: AMPcombi created summaries for all input samples.')

def main():
if args.log_file == True:
with open(f'{outdir}/ampcombi.log', 'w') as f:
if (args.log_file == True and not os.path.exists('ampcombi.log')):
with open(f'ampcombi.log', 'w') as f:
#print(f'AMPcombi version: {args.version}')
with redirect_stdout(f):
main_workflow()
elif(args.log_file == True and os.path.exists('ampcombi.log')):
with open(f'ampcombi.log', 'a') as f:
#print(f'AMPcombi version: {args.version}')
with redirect_stdout(f):
main_workflow()
else: main_workflow()
Expand Down
4 changes: 2 additions & 2 deletions ampcombi/check_input.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,10 @@ def check_pathlist(filepaths, samplelist, fileending, path):
else:
return filepaths

def check_ref_database(database, outdir):
def check_ref_database(database):
if(database==None):
print('<--AMP_database> was not given, the current DRAMP general-AMP database will be downloaded and used')
database = os.path.join(outdir, r'amp_ref_database')
database = 'amp_ref_database'
os.makedirs(database, exist_ok=True)
db = database
download_DRAMP(db)
Expand Down
16 changes: 8 additions & 8 deletions ampcombi/print_header.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@

def print_header():
print("""
#$$$$$$\ $$\ $$\ $$$$$$$\ $$\ $$\|
#$$ __$$\ $$$\ $$$ |$$ __$$\ $$ | \__|
#$ / $$ |$$$$\ $$$$ |$$ | $$ | $$$$$$$\ $$$$$$\ $$$$$$\$$$$\ $$$$$$$\ $$\
#$$$$$$$$ |$$\$$\$$ $$ |$$$$$$$ |$$ _____|$$ __$$\ $$ _$$ _$$\ $$ __$$\ $$ |
#$$ __$$ |$$ \$$$ $$ |$$ ____/ $$ / $$ / $$ |$$ / $$ / $$ |$$ | $$ |$$ |
#$$ | $$ |$$ |\$ /$$ |$$ | $$ | $$ | $$ |$$ | $$ | $$ |$$ | $$ |$$ |
#$$ | $$ |$$ | \_/ $$ |$$ | \$$$$$$$\ \$$$$$$ |$$ | $$ | $$ |$$$$$$$ |$$ |
#\__| \__|\__| \__|\__| \_______| \______/ \__| \__| \__|\_______/ \__|
$$$$$$\ $$\ $$\ $$$$$$$\ $$\ $$\|
$$ __$$\ $$$\ $$$ |$$ __$$\ $$ | \__|
$ / $$ |$$$$\ $$$$ |$$ | $$ | $$$$$$$\ $$$$$$\ $$$$$$\$$$$\ $$$$$$$\ $$\
$$$$$$$$ |$$\$$\$$ $$ |$$$$$$$ |$$ _____|$$ __$$\ $$ _$$ _$$\ $$ __$$\ $$ |
$$ __$$ |$$ \$$$ $$ |$$ ____/ $$ / $$ / $$ |$$ / $$ / $$ |$$ | $$ |$$ |
$$ | $$ |$$ |\$ /$$ |$$ | $$ | $$ | $$ |$$ | $$ | $$ |$$ | $$ |$$ |
$$ | $$ |$$ | \_/ $$ |$$ | \$$$$$$$\ \$$$$$$ |$$ | $$ | $$ |$$$$$$$ |$$ |
\__| \__|\__| \__|\__| \_______| \______/ \__| \__| \__|\_______/ \__|
""")
Loading

0 comments on commit 362eca0

Please sign in to comment.