fimpera
is a simple strategy for reducing false positive calls from any Approximate Membership Query data structure (AMQ) that supports abundance queries.
The fimpera
implementation proposed here uses a counting Bloom filter. It proposes a way to index and query k-mers from biological sequences (fastq or fasta, gzipped or not).
fimpera
relies on templates, hence it can be easily adapted to any other AMQ that supports abundance queries, for any usage.
You must first install zlib. It is likely to be already installed, if not you can try:
sudo apt update
sudo apt-cache search zlib # zlib1g-dev on Ubuntu 20.04.
sudo apt-get install zlib # or wathever you found with apt-cache search
git clone --recursive https://github.com/lrobidou/fimpera
cd fimpera
./install.sh
note:
fimpera
needs a file containing the abundance of each k-mers to index them (each line should contain a k-mer (not a s-mer), one tab and the abundance associated with that k-mer). KMC can provide such file, however you are free to use another kmer counter program.
# indexing a file
./bin/fimpera_index <KMCfile (.txt)> <index_name> [ -b <number of bits per buckets in the filter> -k <k> -z <z> --canonical ]
# querying a file
./bin/fimpera_query <index_name> <your query file>
After installing fimpera
, you can test it using test samples data file.
Index a simple KMC output file, using 1000000 bits, kmer of size 35, 5 encoding bits in the counting bloom filter and z=3 (so storing smers of size 32 in the bloom filter).
./bin/fimpera_index tests/unit/data/1000LinesTest.txt index.idx 1000000 -k 35 -b 5 -z 3
This creates a file named index.idx
. It can be used for performing queries as:
./bin/fimpera_query index.idx tests/unit/data/exemple_query.fasta
This displays the result of queries against every input read from exemple_query.fasta
:
>genome.1 CRG080910-3E6:8:1:522:318 length=70
T A C A A T G A A G A A C T C A A T C G T A T G C C G T C T T T T G T T A A T G A A G A A C T C A A T C A T C G T C G C C G T C T T T T G T T
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>genome.2 CRG080910-3E6:8:1:351:650 length=70
G A C A T C A C A T G C T G G G T C G T A T G C C G T C T T T T T C T T A A T G A A G A A C T C A A T C T C G T A T G C C G T C T T T T G T T
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>genome.3 CRG080910-3E6:8:1:754:111 length=70
T T A C T C T T T A A A G A G C T G G A A C G T G A A A A T C G T G A A C T G C G C C G A A A G A G C T G G A A C G T G A A A A T C G T G A A
0 0 0 0 0 0 0 0 0 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>genome.4 CRG080910-3E6:8:1:804:412 length=70
T T C T G G T G T T C A C T G T T T C T T T T G C C G T A A T G A A G A A C T C A A T C T T A A T G A A G A A C T C A A T C C A C G G G A T
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0
>genome.5 CRG080910-3E6:8:1:847:96 length=70
T A C A C C G G A C A C G G G A T T G G A T G C C C T C T T T T T G T T G G A C A C G G G A T T G G A T T G G A T T G G A T T G G A T T G G
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
Under each letter (base), the abundance of the kmer starting from this position is displayed. The abundance is displayed as -1 if the complexity of the kmer is too low. Here a unique kmer has a non zero abundance. This is kmer AAAGAGCTGGAACGTGAAAATCGTGAACTGCGCCG
with abundance 13.
Explanation of fimpera headers can be found here Explanation on how to use another AMQ is here
This file contains definition of multiple surjective functions that links abundances to their values. Those functions are implemented here. To implement you own surjective function, first define a new class that exposes two public functions: static uint64_t fct(const uint64_t& abundance)
, the surjective function which takes an abundance as a parameter, and give it a name with static std::string name()
. You must then pass an instantiation of that class as the first parameter of the fimpera
constructructor.
Lucas Robidou: [email protected]
Pierre Peterlongo: [email protected]
Robidou, Lucas, and Pierre Peterlongo. "fimpera: drastic improvement of Approximate Membership Query data-structures with counts." Bioinformatics 39.5 (2023): btad305