If you find this repository useful, please cite
@InProceedings{ddps2023sea,
author = {D{\'\i}az-Dom{\'\i}nguez, Diego and D\"{o}nges, Saska and Puglisi, Simon J. and Salmela, Leena},
title = {{Simple Runs-Bounded FM-Index Designs Are Fast}},
booktitle = {21st International Symposium on Experimental Algorithms (SEA 2023)},
pages = {7:1--7:16},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-279-2},
ISSN = {1868-8969},
year = {2023},
volume = {265},
editor = {Georgiadis, Loukas},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2023/18357},
URN = {urn:nbn:de:0030-drops-183579},
doi = {10.4230/LIPIcs.SEA.2023.7},
annote = {Keywords: data structures, efficient algorithms}
}
Doi link and full text: Simple Runs-bounded FM-index Designs are Fast
BWT indexing by splitting into blocks of rank
queries while not sacrificing too much space.
Supports access, rank and pattern counting.
Mochrobenchmarks for predecessor search support structure related to blocks of
Given a plain text BWT file /path/to/bwt.txt
, to make the index bwt.rlbwt
with associated data bwt_data.rlbwt
run the following in the repository root
$ make make_alphabet_header
$ ./make_alphabet_header -i /path/to/bwt.txt > include/custom_alphabet.hpp
$ make make_bwt
$ ./make_bwt -i /path/to/bwt.txt bwt.rlbwt
This will create and index with block size ./make_bwt
for more information on how to generate different versions of the indexes.
Given a default index bwt.rlbwt
and a pattern file patterns.txt
containing one pattern per line do:
$ make count_matches
$ ./count_matches bwt.rlbwt patterns.txt
To count the number of matches for each pattern in bwt.rlbwt
. Results for each query will be output to standard out, and summary statistics to std::cerr. Run ./count_matches
for information on how to benchmark other index variants.
Include the headers in the include
directory in your project. types.hpp
contains default index types that can be used. Given a default index bwt.rlbwt
the following can be used to load and query the index.
#include <iostream>
#include "include/types.hpp"
main() {
bbwt::two_byte<> bwt("bwt.rlbwt");
std::cout << bwt.count("Einstein") << std::endl;
}
Other hopefully useful defualt index variants are `bbwt::runs<>´ and ´bbwt::vbyte<>´. Different blocks sizes can be entered as template parameters.
Compilation and execution has been tested on multiple modern x86-64 systems and GCC supporting -std=c++2a
. Code should be compatible with other compilers, but is not expected to compile correctly on compilers not supporting C++ 20. The project should also work on new ARM based apple systems.
- Include functionality of
make_alphabe_header
inmake_bwt
to eliminate additional compilations steps in index construction - See what effect eliminating super blocks would have. Probably almost nothing but less code is better.
- Make namespaces clearer to make the project more usefull as a library header for end users.
- See about entropy encoding run heads. Should be able to implement without significant performance hit, and could save significant space.
- Extract rare symbols?
- Separate block encodings for blocks with different properties?