KITSUNE: K-mer-length Iterative Selection for UNbiased Ecophylogenomics

KITSUNE is a toolkit for evaluating the length of k-mer in a given genome dataset for alignment-free phylogenomic analysis.

K-mer based approach is simple and fast yet has been widely used in many applications including biological sequence comparison. However, selection of an appropriate k-mer length to obtain good information content for comparison is normally overlooked. The optimum k-mer length is a prerequisite to obtain biological meaningful genomic distance for assessment of phylogenetic relationships. Therefore, we have developed KITSUNE to aid k-mer length selection process in a systematic way, based on a three-steps approach described in Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer.

KITSUNE will calculate the three matrices across considered k-mer range:

Cumulative Relative Entropy (CRE)
Average number of Common Features (ACF)
Observed Common Features (OCF)

Moreover, KITSUNE also provides various genomic distance calculations from the k-mer frequency vectors that can be used for species identification or phylogenomic tree construction.

Note: If you use KITSUNE in your research, please cite: KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-free Phylogenomic Analysis

Installation

Kitsune is developed under python version 3 environment. We recommend users use python >= v3.5.

Requirement packages: scipy >= 0.18.1, numpy >= 1.1.0, tqdm >= 4.32

Kitsune also requires Jellyfish for k-mer counting as an external software dependency. Thus, you need to install it before running the tool: https://github.com/gmarcais/Jellyfish

Install with pip

pip install kitsune

Install from source

# Clone the GitHub repository
git clone https://github.com/natapol/kitsune

# Move to the kitsune folder
cd kitsune/

# Install
python setup.py install

Usage

Overview of kitsune

command for listing help

$ kitsune --help

usage: kitsune <command> [<args>]

Available commands:
  acf      Compute average number of common features between signatures
  cre      Compute cumulative relative entropy
  dmatrix  Compute distance matrix
  kopt     Compute recommended choice (optimal) of kmer within a given kmer interval for a set of genomes using the cre, acf and ofc
  ofc      Compute observed feature frequencies

Use --help in conjunction with one of the commands above for a list of available options (e.g. kitsune acf --help)

Calculate CRE, ACF, and OFC value for specific kmer

Kitsune provides three commands to calculate an appropiate k-mer using CRE, ACF, and OCF:

Calculate CRE

$ kitsune cre -h

usage: kitsune (cre) [-h] --filename FILENAME [--fast] [--canonical] -ke KEND
                     [-kf KFROM] [-t THREAD] [-o OUTPUT]

Calculate k-mer from cumulative relative entropy of all genomes

optional arguments:
  -h, --help            show this help message and exit
  --filename FILENAME   A genome file in fasta format (default: None)
  --fast                Jellyfish one-pass calculation (faster) (default:
                        False)
  --canonical           Jellyfish count only canonical mer (default: False)
  -ke KEND, --kend KEND
                        Last k-mer (default: None)
  -kf KFROM, --kfrom KFROM
                        Calculate from k-mer (default: 4)
  -t THREAD, --thread THREAD
  -o OUTPUT, --output OUTPUT
                        Output filename (default: None)

Calculate ACF

$ kitsune acf -h

usage: kitsune (acf) [-h] --filenames FILENAMES [FILENAMES ...] [--fast]
                     [--canonical] -k KMERS [KMERS ...] [-t THREAD]
                     [-o OUTPUT]

Calculate an average number of common features pairwise between one genome
against others

optional arguments:
  -h, --help            show this help message and exit
  --filenames FILENAMES [FILENAMES ...]
                        Genome files in fasta format (default: None)
  --fast                Jellyfish one-pass calculation (faster) (default:
                        False)
  --canonical           Jellyfish count only canonical mer (default: False)
  -k KMERS [KMERS ...], --kmers KMERS [KMERS ...]
                        Have to state before (default: None)
  -t THREAD, --thread THREAD
  -o OUTPUT, --output OUTPUT
                        Output filename (default: None)

Calculate OFC

$ kitsune ofc -h

usage: kitsune (ofc) [-h] --filenames FILENAMES [FILENAMES ...] [--fast]
                     [--canonical] -k KMERS [KMERS ...] [-t THREAD]
                     [-o OUTPUT]

Calculate an observe feature frequency

optional arguments:
  -h, --help            show this help message and exit
  --filenames FILENAMES [FILENAMES ...]
                        Genome files in fasta format (default: None)
  --fast                Jellyfish one-pass calculation (faster) (default:
                        False)
  --canonical           Jellyfish count only canonical mer (default: False)
  -k KMERS [KMERS ...], --kmers KMERS [KMERS ...]
  -t THREAD, --thread THREAD
  -o OUTPUT, --output OUTPUT
                        Output filename (default: None)

General Example

kitsune cre --filename genome1.fna -kf 5 -ke 10
kitsune acf --filenames genome1.fna genome2.fna -k 5
kitsune ofc --filenames genome_fasta/* -k 5

Calculate genomic distance at specific k-mer from kmer frequency vectors of two of genomes

Kitsune provides a commands to calculate genomic distance using different distance estimation method. Users can assess the impact of a selected k-mer length on the genomic distnace of choice below.

distance option	name
braycurtis	Bray-Curtis distance
canberra	Canberra distance
chebyshev	Chebyshev distance
cityblock	City Block (Manhattan) distance
correlation	Correlation distance
cosine	Cosine distance
euclidean	Euclidean distance
jensenshannon	Jensen-Shannon distance
sqeuclidean	Squared Euclidean distance
dice	Dice dissimilarity
hamming	Hamming distance
jaccard	Jaccard-Needham dissimilarity
kulsinski	Kulsinski dissimilarity
rogerstanimoto	Rogers-Tanimoto dissimilarity
russellrao	Russell-Rao dissimilarity
sokalmichener	Sokal-Michener dissimilarity
sokalsneath	Sokal-Sneath dissimilarity
yule	Yule dissimilarity
mash	MASH distance
jsmash	MASH Jensen-Shannon distance
jaccarddistp	Jaccard-Needham dissimilarity Probability
euclidean_of_frequency	Euclidean distance of Frequency

Kitsune provides a choice of distance transformation proposed by Fan et.al.

Calculate a distance matrix

$ kitsune dmatrix -h

usage: kitsune (dmatrix) [-h] [--filenames [FILENAMES [FILENAMES ...]]]
                         [--fast] [--canonical] -k KMER [-i INPUT] [-o OUTPUT]
                         [-t THREAD] [--transformed]
                         [-d {braycurtis,canberra,jsmash,chebyshev,cityblock,correlation,cosine,dice,euclidean,hamming,jaccard,kulsinsk,matching,rogerstanimoto,russellrao,sokalmichener,sokalsneath,sqeuclidean,yule,mash,jaccarddistp}]
                         [-f FORMAT]

Calculate a distance matrix

optional arguments:
  -h, --help            show this help message and exit
  --filenames [FILENAMES [FILENAMES ...]]
                        Genome files in fasta format (default: None)
  --fast                Jellyfish one-pass calculation (faster) (default:
                        False)
  --canonical           Jellyfish count only canonical mer (default: False)
  -k KMER, --kmer KMER
  -i INPUT, --input INPUT
                        List of genome files in txt (default: None)
  -o OUTPUT, --output OUTPUT
                        Output filename (default: None)
  -t THREAD, --thread THREAD
  --transformed
  -d {braycurtis,canberra,jsmash,chebyshev,cityblock,correlation,cosine,dice,euclidean,hamming,jaccard,kulsinsk,matching,rogerstanimoto,russellrao,sokalmichener,sokalsneath,sqeuclidean,yule,mash,jaccarddistp}, --distance {braycurtis,canberra,jsmash,chebyshev,cityblock,correlation,cosine,dice,euclidean,hamming,jaccard,kulsinsk,matching,rogerstanimoto,russellrao,sokalmichener,sokalsneath,sqeuclidean,yule,mash,jaccarddistp}
  -f FORMAT, --format FORMAT

Example of choosing distance option:

kitsune dmatrix --filenames genome1.fna genome2.fna -k 11 -d jaccard --canonical --fast -o output.txt
kitsune dmatrix --filenames genome1.fna genome2.fna -k 11 -d hensenshannon --canonical --fast -o output.txt

Find optimum k-mer from a given set of genomes

Kitsune provides a wrap-up comand to find optimum k-mer length for a given set of genome within a given kmer interval.

$ kitsune kopt -h

usage: kitsune (kopt) [-h] [--acf-cutoff ACF_CUTOFF] [--canonical]
                      [--closely-related] [--cre-cutoff CRE_CUTOFF] [--fast]
                      --filenames FILENAMES [--hashsize HASHSIZE]
                      [--in-memory] [--k-min K_MIN] --k-max K_MAX
                      [--lower LOWER] [--nproc NPROC] [--output OUTPUT]
                      [--threads THREADS]

Optimal kmer size selection for a set of genomes using Average number of
Common Features (ACF), Cumulative Relative Entropy (CRE), and Observed Common
Features (OCF). Example: kitsune kopt --filenames genomeList.txt --k-min 4
--k-max 12 --canonical --fast

optional arguments:
  -h, --help            show this help message and exit
  --acf-cutoff ACF_CUTOFF
                        Cutoff to use in selecting kmers whose ACFs are >=
                        (cutoff * max(ACF)) (default: 0.1)
  --canonical           Jellyfish count only canonical kmers (default: False)
  --closely-related     Use in case of closely related genomes (default:
                        False)
  --cre-cutoff CRE_CUTOFF
                        Cutoff to use in selecting kmers whose CREs are <=
                        (cutoff * max(CRE)) (default: 0.1)
  --fast                Jellyfish one-pass calculation (faster) (default:
                        False)
  --filenames FILENAMES
                        Path to the file with the list of genome files paths.
                        There should be at list 2 input genomes (default:
                        None)
  --hashsize HASHSIZE   Jellyfish initial hash size (default: 100M)
  --in-memory           Keep Jellyfish counts in memory (default: False)
  --k-min K_MIN         Minimum kmer size (default: 4)
  --k-max K_MAX         Maximum kmer size (default: None)
  --lower LOWER         Do not let Jellyfish output kmers with count < --lower
                        (default: 1)
  --nproc NPROC         Maximum number of CPUs to make it parallel (default:
                        1)
  --output OUTPUT       Path to the output file (default: None)
  --threads THREADS     Maximum number of threads for Jellyfish (default: 1)

Example dataset

First download the example files. Download

kitsune kopt --filenames genome_list --k-min 6 --k-max 21 --canonical --fast --threads 4 --nproc 2 --output out.txt

⚠️ Please be aware that this command will use big computational resources when large number of genomes and/or large genome size are used as the input.

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
.github/workflows		.github/workflows
doc		doc
examples		examples
kitsune		kitsune
recipe		recipe
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
logoKITSUNE.png		logoKITSUNE.png
pylintrc		pylintrc
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KITSUNE: K-mer-length Iterative Selection for UNbiased Ecophylogenomics

Installation

Install with pip

Install from source

Usage

Overview of kitsune

Calculate CRE, ACF, and OFC value for specific kmer

Calculate CRE

Calculate ACF

Calculate OFC

General Example

Calculate genomic distance at specific k-mer from kmer frequency vectors of two of genomes

Calculate a distance matrix

Find optimum k-mer from a given set of genomes

Example dataset

About

Releases 6

Packages

Contributors 5

Languages

License

natapol/kitsune

Folders and files

Latest commit

History

Repository files navigation

KITSUNE: K-mer-length Iterative Selection for UNbiased Ecophylogenomics

Installation

Install with pip

Install from source

Usage

Overview of kitsune

Calculate CRE, ACF, and OFC value for specific kmer

Calculate CRE

Calculate ACF

Calculate OFC

General Example

Calculate genomic distance at specific k-mer from kmer frequency vectors of two of genomes

Calculate a distance matrix

Find optimum k-mer from a given set of genomes

Example dataset

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 5

Languages

Packages