Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document the FST format #109

Open
rw opened this issue Jul 13, 2020 · 1 comment
Open

document the FST format #109

rw opened this issue Jul 13, 2020 · 1 comment

Comments

@rw
Copy link

rw commented Jul 13, 2020

I'm trying to understand this library better. Since every fst can be used with mmap, I know that the build process creates a serialized fst representation, and at read time, this representation is used in a zero-alloc way. After reading the code in the raw crate, I can kind of follow along, but I still don't understand the file format.

Is there an overview of the file format anywhere? Maybe just an example, to start? Thanks!

@BurntSushi
Copy link
Owner

Unfortunately I've never written docs for the format. The folks over at Couchbase did port this library to Go though, and in the course of doing it, they sketched out a format in writing: https://github.com/couchbase/vellum/blob/master/docs/format.md --- I haven't combed over it detail though, so I don't know how accurate it is. And there have been some minor changes (like support for CRC32) since that document was written.

Overall, the format itself was my own devising, but took a lot of inspiration from the various papers linked in the docs with respect to compressing finite automata. An FST is a "compressed data structure," meaning that its native form is itself also compressed. So for example, most states inside an FST are represented by a single byte.

@BurntSushi BurntSushi changed the title What is the structure of the underlying representation? document the FST format Jan 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants