Skip to content

Configuration parameters

Freya Arthen edited this page Feb 20, 2024 · 4 revisions

Minimal required information

  • fasta_path path to genomic FASTA file
  • gff_path path to GFF file
  • output_path path to output directory
  • taxon_id NCBI taxonomy ID of the query species

General options

  • threads: X/"auto" number of threads to be used by Bowtie2 and DIAMOND
    • default: 'auto' → auto-detection of all available cores by DIAMOND (Bowtie2 uses one thread)
    • X → X threads are used by DIAMOND and Bowtie2
  • force: True/False overwrite existing results
    • will not overwrite sequence similarity search results ('tax_assignment_path'); to overwrite either delete file or use option 'compute_tax_assignment'

Coverage options

  • include_coverage: TRUE/FALSE explicitly include coverage information in the analysis or not
    • default: inferred from existence of either of the files at 'pbc_path', 'bam_path' or 'read_paths'
  • bam_path_X path to BAM file for coverage set X

Taxonomic assignment options

  • database_path path to diamond-formatted NCBI NR protein database (set this up according to the instructions in Installation)
    • default: 'db.dmnd' in the directory that was specified at 'taxaminer.setup -d'
  • compute_tax_assignment: TRUE/FALSE run sequence similarity search with Diamond
    • default: inferred from existence of file(s) at 'tax_assignment_path'
  • extract_proteins: TRUE/FALSE automatic generation of protein FASTA file based on genomic FASTA and GFF; saved to 'proteins_path'
    • default: inferred from existence of file at 'proteins_path'
  • proteins_path path to FASTA file containing the protein sequences
    • will automatically be generated on non-existence (or 'extract_proteins' == TRUE)
    • can be either specified by user or default is set
    • default: 'output_path/proteins.faa'
  • tax_assignment_path hit file(s) of sequence similarity search in database
    • when 'assignment_mode' == 'quick' and only one path is provided, the suffixe '_1' and '_2' are added; to state both files specifically, give as comma-separated list in brackets
    • can be either specified by user or default is set
    • default: 'output_path/taxonomic_hits.txt' / ['output_path/taxonomic_hits_1.txt', 'output_path/taxonomic_hits_2.txt']
  • target_exclude: TRUE/FALSE exclude self-hits in similarity search (query taxon is either in- or excluded)
    • default: TRUE
  • exclusion_rank: <rank> taxonomic rank at which hits are excluded in taxonomic assignment (based on the query species)
    • taxa which are in the same <exclusion_rank> as the query species are discarded from taxonomic assignment
    • default: 'species'
  • assignment_mode: "exhaustive"/"quick" mode in which to perform similarity search
    • "exhaustive" → default mode
    • "quick" → speed up of similarity search - genes with origin most likely in query species are identified by doing an inital search in small subset of database, other genes are then forwarded to search in whole database
    • default: 'exhaustive'
  • quick_mode_search_rank taxonomic rank at which to create the subset of the database for inital filtering search
    • can be either taxonomic rank like phylum or order and is then based on query species or can be NCBI taxon ID
    • default: 'kingdom'
  • quick_mode_match_rank taxonomic rank which taxonomic assignment of genes has to reach to be accepted in first search, i.e. be identified as belonging to the query species
    • can be either taxonomic rank like phylum or order and is then based on query species or can be NCBI taxon ID
    • default: 'order'

Plot output options

  • num_groups_plot: x/"all" number of distinct taxonomic groups to display in the plots
    • x → only x labels are displayed; taxonomic assignments are iteratively merged to higher ranks until number is exhausted
    • "all" → every taxonomic assignment is displayed
    • default: 25
  • merging_labels: <NCBI IDs>/<rank>/<rank>-all merging of taxonomic assignments can be manually influenced
    • NCBI IDs → comma-separated list of NCBI taxon IDs; taxonomic assignments are merged at each of these IDs (please make sure the IDs are not within the same lineage)
    • <rank> → a taxonomic rank; taxon to merge taxonomic assignments at will be inferred from rank for the query species
    • <rank>-all → a taxonomic rank with suffix '-all'; all taxonomic assignments will be generalized to this rank
    • default: None

Gene info options

  • include_pseudogenes: TRUE/FALSE include pseudogenes in the analysis
    • default: FALSE

PCA options

  • input_variables variables to be used for the PCA
    • comma-separated list of variables, no spaces, whole list put in quotes ('" "')
    • default: "c_name,c_num_of_genes,c_len,c_genelenm,c_genelensd,g_len,g_lendev_c,g_abspos,g_terminal,c_cov,c_covsd,g_cov,g_covsd,g_covdev_c,c_pearson_r,g_pearson_r_o,g_pearson_r_c"
    • see Additional information for details on options