diff --git a/README.md b/README.md index 384d3d3..b46f0d6 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,10 @@ -# AMPcombi : Antimicorbial peptides parsing and functional classification tool +# AMPcombi : AntiMicrobial Peptides parsing and functional classification tool # ![Logo](docs/amp-combi-logo.png) -This tool parses the results of amp prediction tools into a single table and aligns the hits against a reference database of antimicrobial peptides for functional classifications. +This tool parses the results of antimicrobial peptide (AMP) prediction tools into a single table and aligns the hits against a reference AMP database for functional classifications. -For parsing: AMpcombi is developed to parse the output of these **amp prediction tools**: +For parsing: AMPcombi is developed to parse the output of these **AMP prediction tools**: | Tool | Version | Link | | ------------- | ------------- | ------------- | @@ -15,7 +15,7 @@ For parsing: AMpcombi is developed to parse the output of these **amp prediction | EnsembleAMPpred | - | https://pubmed.ncbi.nlm.nih.gov/33494403/ | | NeuBI | - | https://github.com/nafizh/NeuBI | -For classification: AMPcombi is developed to offer functional annotation of the detcted AMPs by alignemnt to **AMP reference databases**, for e.g.,: +For classification: AMPcombi is developed to offer functional annotation of the detected AMPs by alignment to an **AMP reference databases**, for e.g.,: | Tool | Version | Link | | ------------- | ------------- | ------------- | @@ -29,7 +29,7 @@ Alignment to the reference database is done using [diamond blastp v.2.0.15](http To install AMPcombi: -Add dependencies of the tool; python > 3.0, biopython, pandas and diamond. +Add dependencies of the tool; `python` > 3.0, `biopython`, `pandas` and `diamond`. Installation can be done using: - pip installation @@ -46,7 +46,7 @@ conda env create -f ampcombi/environment.yml ``` or ``` - conda install AMPcombi + conda install -c bioconda AMPcombi ``` ====================== @@ -57,14 +57,18 @@ There are two basic commands to run AMPcombi: 1. Using `--amp_results` ```console -ampcombi --amp_results path/to/my/result_folder/ --faa_folder path/to/sample_faa_files/ +ampcombi \ +--amp_results path/to/my/result_folder/ \ +--faa_folder path/to/sample_faa_files/ ``` Here the head folder containing output files has to be given. AMPcombi finds and summarizes the output files from different tools, if the folder is structured and named as: `/result_folder/toolsubdir/samplesubdir/sample.tool.filetype`. - - Note that the filetype ending might vary and can be specified with `--tooldict`, if it is different from the default. + - Note that the filetype ending might vary and can be specified with `--tooldict`, if it is different from the default. When passing a dictionary via command line, this has to be done as a string with single quotes `' '` and the dictionary keys and items with double quotes `" "`. i.e. `'{"key1":"item1", "key2":"item2"}'` The path to the folder containing the respective protein fasta files has to be provided with `--faa_folder`. The files have to be named with `.faa`. +Structure of the results folder: + ```console amp_results/ ├── tool_1/ @@ -87,10 +91,14 @@ amp_results/ 2. Using `--path_list` and `--sample_list` ```console -ampcombi --path_list [[list of paths to sample_1-outputs][list of paths to sample_2-outputs]] --sample_list [sample_1, sample_2] --faa_folder path/to/sample_faa_files/ +ampcombi \ +--path_list path_to_sample_1_tool_1.csv path_to_sample_1_tool_1.csv \ +--path_list path_to_sample_2_tool_1.csv path_to_sample_2_tool_1.csv \ +--sample_list sample_1 sample_2 \ +--faa_folder path/to/sample_faa_files/ ``` -Here the paths to the output-files to be summarized can be given as a list for each sample. Together with this option a list of sample-names has to be supplied. +Here the paths to the output-files to be summarized can be given by `--path_list` for each sample. Together with this option a list of sample-names has to be supplied. The path to the folder containing the respective protein fasta files has to be provided with `--faa_folder`. The files have to be named with `.faa`. @@ -98,32 +106,53 @@ The path to the folder containing the respective protein fasta files has to be p | command | definition | default | example | | ------------- | ------------- | ------------- | ------------- | | --amp_results | path to the folder containing different tool's output files | ./test_files/ | ../amp_results/ | -| --sample_list | list of samples' names | [] | [sample_1, sample_2] | -| --path_list | list of paths to output files | [] | [[paths to sample_1 output], [paths to sample_2 outputs]] | -| --outdir | name of the output directory | ./ampcombi_results/ | ./ampcombi_results/ | +| --sample_list | list of samples' names | - | sample_1 sample_2 | +| --path_list | list of paths to output files | - | path_to_sample_1_tool_1.csv path_to_sample_1_tool_1.csv | | --cutoff | probability cutoff to filter AMPs | 0 | 0.5 | | --faa_folder | path to the folder containing the samples` .faa files, Filenames have to contain the corresponding sample-name, i.e. sample_1.faa | ./test_faa/ | ./faa_files/| -| --tooldict | dictionary of AMP-tools and their respective output file endings | {'ampir':'ampir.tsv', 'amplify':'amplify.tsv', 'macrel':'macrel.tsv', 'hmmer_hmmsearch':'hmmsearch.txt', 'ensembleamppred':'ensembleamppred.txt'} | - | +| --tooldict | dictionary of AMP-tools and their respective output file endings | '{"ampir":"ampir.tsv", "amplify":"amplify.tsv", "macrel":"macrel.tsv", "hmmer_hmmsearch":"hmmsearch.txt", "ensembleamppred":"ensembleamppred.txt"}' | - | | --amp_database | path to the folder containing the reference database files: (1) a fasta file with <.fasta> file extension and (2) the corresponding table with with functional and taxonomic classifications in <.tsv> file extension | [DRAMP 'general amps'](http://dramp.cpu-bioinfor.org/downloads/) database | ./amp_ref_database/ | +| --complete_summary | Concatenates all samples' summarized tables into one | False | True | +| --log | print messages into log file instead of stdout | False | True | +| --version | print the version number into stdout | - | 0.1.4 | - - Note: The fasta file corresponding to the AMP database should not contain any characters other than ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W',',Y'] - - Note: The refernce database table should be tab delimited. + - Note: The fasta file corresponding to the AMP database should not contain any characters other than ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y'] + - Note: The reference database table should be tab delimited. +### Output: +The output will be written into your working directory, containing the following files and folders: +```console +/ +├── amp_ref_database/ +| ├── amp_ref.dmnd +| ├── general_amps__clean.fasta +| └── general_amps_.tsv +├── sample_1/ +| ├── sample_1_amp.faa +| ├── sample_1_ampcombi.csv +| └── sample_1_diamond_matches.txt +├── sample_2/ +| ├── sample_2_amp.faa +| ├── sample_2_ampcombi.csv +| └── sample_2_diamond_matches.txt +├── AMPcombi_summary.csv +└── ampcombi.log +``` ====================== ## Contribution: ====================== -AMPcombi is a tool developed for parsing results from published AMP prediction tools. We therfore welcome fellow contributers who would like to add new AMP prediction tools results for parsing and alignment. +AMPcombi is a tool developed for parsing results from published AMP prediction tools. We therefore welcome fellow contributors who would like to add new AMP prediction tools results for parsing and alignment. ### Adding a new tool to AMPcombi -In `ampcombi/reformat_tables.py` -- add a new tool function to read the output to a pandas dataframe +In `ampcombi/reformat_tables.py` +- add a new tool function to read the output to a pandas dataframe and return two columns named `contig_id` and `prob_` - add the new function to the `read_path` function In `ampcombi/main.py` -- add your default `tool:tool.fileending`to the default of `--tooldict` +- add your default `tool:tool.fileending` to the default of `--tooldict` ====================== diff --git a/ampcombi/ampcombi.py b/ampcombi/ampcombi.py old mode 100755 new mode 100644 index fb5a1d0..cc4e7df --- a/ampcombi/ampcombi.py +++ b/ampcombi/ampcombi.py @@ -3,13 +3,16 @@ import os import argparse import warnings +from contextlib import redirect_stdout +from version import __version__ +import json +import os.path # import functions from sub-scripts to main: from reformat_tables import * from amp_fasta import * from check_input import * from amp_database import * from print_header import * -from contextlib import redirect_stdout # Define input arguments: parser = argparse.ArgumentParser(prog = 'ampcombi', formatter_class=argparse.RawDescriptionHelpFormatter, @@ -26,22 +29,23 @@ parser.add_argument("--amp_results", dest="amp", nargs='?', help="Enter the path to the folder that contains the different tool's output files in sub-folders named by sample name. \n If paths are to be inferred, sub-folders in this results-directory have to be organized like '/amp_results/toolsubdir/samplesubdir/tool.sample.filetype' \n (default: %(default)s)", type=str, default="./test_files/") -parser.add_argument("--sample_list", dest="samples", nargs='*', help="Enter a list of sample-names, e.g. ['sample_1', 'sample_2', 'sample_n']. \n If not given, the sample-names will be inferred from the folder structure", +parser.add_argument("--sample_list", dest="samples", nargs='*', help="Enter a list of sample-names, e.g. sample_1 sample_2 sample_n. \n If not given, the sample-names will be inferred from the folder structure", default=[]) -parser.add_argument("--path_list", dest="files", nargs='*', action='append', help="Enter the list of paths to the files to be summarized as a list of lists, e.g. [['path/to/my/sample1.ampir.tsv', 'path/to/my/sample1.amplify.tsv'], ['path/to/my/sample2.ampir.tsv', 'path/to/my/sample2.amplify.tsv']]. \n If not given, the file-paths will be inferred from the folder structure", +parser.add_argument("--path_list", dest="files", nargs='*', action='append', help="Enter the list of paths to the files to be summarized as a list of lists, e.g. --path_list path/to/my/sample1.ampir.tsv path/to/my/sample1.amplify.tsv --path_list path/to/my/sample2.ampir.ts path/to/my/sample2.amplify.tsv. \n If not given, the file-paths will be inferred from the folder structure", default=[]) -parser.add_argument("--outdir", dest="out", help="Enter the name of the output directory \n (default: %(default)s)", - type=str, default="./ampcombi_results/") parser.add_argument("--cutoff", dest="p", help="Enter the probability cutoff for AMPs \n (default: %(default)s)", type=int, default=0) parser.add_argument("--faa_folder", dest="faa", help="Enter the path to the folder containing the reference .faa files. Filenames have to contain the corresponding sample-name, i.e. sample_1.faa \n (default: %(default)s)", type=str, default='./test_faa/') parser.add_argument("--tooldict", dest="tools", help="Enter a dictionary of the AMP-tools used with their output file endings (as they appear in the directory tree), \n Tool-names have to be written as in default:\n default={'ampir':'ampir.tsv', 'amplify':'amplify.tsv', 'macrel':'macrel.tsv', 'hmmer_hmmsearch':'hmmsearch.txt', 'ensembleamppred':'ensembleamppred.txt'}", - type=dict, default={'ampir':'ampir.tsv', 'amplify':'amplify.tsv', 'macrel':'macrel.tsv', 'neubi':'neubi.fasta', 'hmmer_hmmsearch':'hmmsearch.txt', 'ensembleamppred':'ensembleamppred.txt'}) + type=str, default='{"ampir":"ampir.tsv", "amplify":"amplify.tsv", "macrel":"macrel.tsv", "neubi":"neubi.fasta", "hmmer_hmmsearch":"hmmsearch.txt", "ensembleamppred":"ensembleamppred.txt"}') parser.add_argument("--amp_database", dest="ref_db", nargs='?', help="Enter the path to the folder containing the reference database files (.fa and .tsv); a fasta file and the corresponding table with functional and taxonomic classifications. \n (default: DRAMP database)", type=str, default=None) -parser.add_argument("--log", dest="log_file", nargs='?', help="Silences the standardoutput and captures it in a log file)", +parser.add_argument("--complete_summary", dest="complete", nargs='?', help="Concatenates all sample summaries to one final summary", type=bool, default=False) +parser.add_argument("--log", dest="log_file", nargs='?', help="Silences the standard output and captures it in a log file)", + type=bool, default=False) +parser.add_argument('--version', action='version', version='%(prog)s ' + __version__) # get command line arguments args = parser.parse_args() @@ -50,11 +54,11 @@ path = args.amp samplelist_in = args.samples filepaths_in = args.files -outdir = args.out p = args.p faa_path = args.faa -tooldict = args.tools +tooldict = json.loads(args.tools) database = args.ref_db +complete_summary = args.complete # additional variables # extract list of tools from input dictionary. If not given, default dict contains all possible tools @@ -62,9 +66,6 @@ # extract list of tool-output file-endings. If not given, default dict contains default endings. fileending = [val for val in tooldict.values()] -# create output directory -os.makedirs(outdir, exist_ok=True) - # supress panda warnings warnings.simplefilter(action='ignore', category=FutureWarning) @@ -81,7 +82,11 @@ def main_workflow(): # check input filepaths and create list of list of filepaths per sample if input empty filepaths = check_pathlist(filepaths_in, samplelist, fileending, path) # check amp_ref_database filepaths and create a directory if input empty - db = check_ref_database(database, outdir) + db = check_ref_database(database) + + # initiate a final_summary dataframe to concatenate each new sample-summary + if (complete_summary): + complete_summary_df = pd.DataFrame([]) # generate summary for each sample amp_faa_paths = [] @@ -90,29 +95,48 @@ def main_workflow(): main_list = [] print('\n ########################################################## ') print(f'Processing AMP-files from sample: {samplelist[i]}') - os.makedirs(outdir + '/'+ samplelist[i], exist_ok=True) + os.makedirs(samplelist[i], exist_ok=True) # fill main_list with tool-output filepaths for sample i read_path(main_list, filepaths[i], p, tooldict, faa_path, samplelist[i]) # use main_list to create the summary file for sample i - summary_df = summary(main_list, samplelist[i], faa_path, outdir) + summary_df = summary(main_list, samplelist[i], faa_path) # Generate the AMP-faa.fasta for sample i - out_path = outdir+ '/'+samplelist[i] +'/'+samplelist[i]+'_amp.faa' + out_path = samplelist[i] +'/'+samplelist[i]+'_amp.faa' faa_name = faa_path+samplelist[i]+'.faa' amp_fasta(summary_df, faa_name, out_path) amp_faa_paths.append(out_path) - print(f'The fasta containing AMP sequences for {samplelist[i]} was saved to {outdir}/{samplelist[i]}/ \n') - amp_matches = outdir + '/'+samplelist[i] +'/'+samplelist[i]+'_diamond_matches.txt' + print(f'The fasta containing AMP sequences for {samplelist[i]} was saved to {samplelist[i]}/ \n') + amp_matches = samplelist[i] +'/'+samplelist[i]+'_diamond_matches.txt' print(f'The diamond alignment for {samplelist[i]} in process....') diamond_df = diamond_alignment(db, amp_faa_paths, amp_matches) - print(f'The diamond alignment for {samplelist[i]} was saved to {outdir}/{samplelist[i]}/.') + print(f'The diamond alignment for {samplelist[i]} was saved to {samplelist[i]}/.') # Merge summary_df and diamond_df - complete_summary_df = pd.merge(summary_df, diamond_df, on = 'contig_id', how='left') - complete_summary_df.to_csv(outdir +'/'+samplelist[i] +'/'+samplelist[i]+'_ampcombi.csv', sep=',') - print(f'The summary file for {samplelist[i]} was saved to {outdir}/{samplelist[i]}/.') - + sample_summary_df = pd.merge(summary_df, diamond_df, on = 'contig_id', how='left') + # Insert column with sample name on position 0 + sample_summary_df.insert(0, 'name', samplelist[i]) + # Write sample summary into sample output folder + sample_summary_df.to_csv(samplelist[i] +'/'+samplelist[i]+'_ampcombi.csv', sep=',', index=False) + print(f'The summary file for {samplelist[i]} was saved to {samplelist[i]}/.') + if (complete_summary): + # concatenate the sample summary to the complete summary and overwrite it + complete_summary_df = pd.concat([complete_summary_df, sample_summary_df]) + complete_summary_df.to_csv('AMPcombi_summary.csv', sep=',', index=False) + else: + continue + if (complete_summary): + print(f'\n FINISHED: The AMPcombi_summary.csv file was saved to your current working directory.') + else: + print(f'\n FINISHED: AMPcombi created summaries for all input samples.') + def main(): - if args.log_file == True: - with open(f'{outdir}/ampcombi.log', 'w') as f: + if (args.log_file == True and not os.path.exists('ampcombi.log')): + with open(f'ampcombi.log', 'w') as f: + #print(f'AMPcombi version: {args.version}') + with redirect_stdout(f): + main_workflow() + elif(args.log_file == True and os.path.exists('ampcombi.log')): + with open(f'ampcombi.log', 'a') as f: + #print(f'AMPcombi version: {args.version}') with redirect_stdout(f): main_workflow() else: main_workflow() diff --git a/ampcombi/check_input.py b/ampcombi/check_input.py index 0e942fa..bf5fa86 100755 --- a/ampcombi/check_input.py +++ b/ampcombi/check_input.py @@ -29,10 +29,10 @@ def check_pathlist(filepaths, samplelist, fileending, path): else: return filepaths -def check_ref_database(database, outdir): +def check_ref_database(database): if(database==None): print('<--AMP_database> was not given, the current DRAMP general-AMP database will be downloaded and used') - database = os.path.join(outdir, r'amp_ref_database') + database = 'amp_ref_database' os.makedirs(database, exist_ok=True) db = database download_DRAMP(db) diff --git a/ampcombi/print_header.py b/ampcombi/print_header.py index 7d26b48..bd5ed25 100644 --- a/ampcombi/print_header.py +++ b/ampcombi/print_header.py @@ -2,12 +2,12 @@ def print_header(): print(""" -#$$$$$$\ $$\ $$\ $$$$$$$\ $$\ $$\| -#$$ __$$\ $$$\ $$$ |$$ __$$\ $$ | \__| -#$ / $$ |$$$$\ $$$$ |$$ | $$ | $$$$$$$\ $$$$$$\ $$$$$$\$$$$\ $$$$$$$\ $$\ -#$$$$$$$$ |$$\$$\$$ $$ |$$$$$$$ |$$ _____|$$ __$$\ $$ _$$ _$$\ $$ __$$\ $$ | -#$$ __$$ |$$ \$$$ $$ |$$ ____/ $$ / $$ / $$ |$$ / $$ / $$ |$$ | $$ |$$ | -#$$ | $$ |$$ |\$ /$$ |$$ | $$ | $$ | $$ |$$ | $$ | $$ |$$ | $$ |$$ | -#$$ | $$ |$$ | \_/ $$ |$$ | \$$$$$$$\ \$$$$$$ |$$ | $$ | $$ |$$$$$$$ |$$ | -#\__| \__|\__| \__|\__| \_______| \______/ \__| \__| \__|\_______/ \__| +$$$$$$\ $$\ $$\ $$$$$$$\ $$\ $$\| +$$ __$$\ $$$\ $$$ |$$ __$$\ $$ | \__| +$ / $$ |$$$$\ $$$$ |$$ | $$ | $$$$$$$\ $$$$$$\ $$$$$$\$$$$\ $$$$$$$\ $$\ +$$$$$$$$ |$$\$$\$$ $$ |$$$$$$$ |$$ _____|$$ __$$\ $$ _$$ _$$\ $$ __$$\ $$ | +$$ __$$ |$$ \$$$ $$ |$$ ____/ $$ / $$ / $$ |$$ / $$ / $$ |$$ | $$ |$$ | +$$ | $$ |$$ |\$ /$$ |$$ | $$ | $$ | $$ |$$ | $$ | $$ |$$ | $$ |$$ | +$$ | $$ |$$ | \_/ $$ |$$ | \$$$$$$$\ \$$$$$$ |$$ | $$ | $$ |$$$$$$$ |$$ | +\__| \__|\__| \__|\__| \_______| \______/ \__| \__| \__|\_______/ \__| """) \ No newline at end of file diff --git a/ampcombi/reformat_tables.py b/ampcombi/reformat_tables.py index 2699710..e4623e8 100755 --- a/ampcombi/reformat_tables.py +++ b/ampcombi/reformat_tables.py @@ -149,7 +149,7 @@ def read_path(df_list, file_list, p, dict, faa_path, samplename): # FUNCTION: MERGE DATAFRAMES ######################################### # merge dataframes from list to summary output per sample -def summary(df_list, samplename, faa_path, outdir): +def summary(df_list, samplename, faa_path): #initiate merge_df merge_df = pd.DataFrame(columns=['contig_id']) #merge all dfs in the df-list on contig_id @@ -164,8 +164,6 @@ def summary(df_list, samplename, faa_path, outdir): merge_df = merge_df.set_index('contig_id') merge_df['p_sum']= merge_df.sum(axis=1)#.sort_values(ascending=False) merge_df = merge_df.sort_values('p_sum', ascending=False).drop('p_sum', axis=1).reset_index() - # write summary to outdir - #merge_df.to_csv(outdir+'/'+samplename+'_AMPsummary.csv', sep=',') return merge_df ######################################### diff --git a/ampcombi/version.py b/ampcombi/version.py new file mode 100644 index 0000000..bad32e9 --- /dev/null +++ b/ampcombi/version.py @@ -0,0 +1 @@ +__version__ = '0.1.4' \ No newline at end of file diff --git a/samplesheet.csv b/samplesheet.csv deleted file mode 100644 index 8ace0ed..0000000 --- a/samplesheet.csv +++ /dev/null @@ -1,2 +0,0 @@ -sample,fasta -sample_2,https://raw.githubusercontent.com/nf-core/test-datasets/funcscan/wastewater_metagenome_contigs_2.fasta.gz diff --git a/setup.py b/setup.py index 53b7f25..371fbee 100644 --- a/setup.py +++ b/setup.py @@ -5,7 +5,7 @@ setup( name='AMPcombi', - version='0.1.3', + version='0.1.4', author='Anan Ibrahim, Louisa Perelo', author_email='ananhamido@hotmail.com, louperelo@gmail.com', packages=['ampcombi'], @@ -16,7 +16,8 @@ 'ampcombi/diamond_alignment.sh', 'ampcombi/diamond_makedb.sh', 'ampcombi/reformat_tables.py', - 'ampcombi/print_header.py'], + 'ampcombi/print_header.py', + 'ampcombi/version.py'], url='http://pypi.python.org/pypi/AMPcombi/', license='LICENSE.txt', description='A parsing tool for AMP tools.', diff --git a/test_files/amplify/sample_1/sample_1_amplify.tsv b/test_files/amplify/sample_1/sample_1.amplify.tsv similarity index 100% rename from test_files/amplify/sample_1/sample_1_amplify.tsv rename to test_files/amplify/sample_1/sample_1.amplify.tsv