Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to customize freeze function? #131

Open
windrunner414 opened this issue Jul 31, 2021 · 6 comments
Open

How to customize freeze function? #131

windrunner414 opened this issue Jul 31, 2021 · 6 comments

Comments

@windrunner414
Copy link

Like lucene, it can use a customize freeze function that only write if the number of children nodes > MIN, otherwise it will be dropped.
This allows only save part of data('prefix') in FST and save other('suffix') in disk.This makes the FST small enough to be fully loaded into memory.

@BurntSushi
Copy link
Owner

Why not just pick the prefixes yourself ahead of time?

I think I really need more details here. An example would be helpful.

@BurntSushi
Copy link
Owner

The direct answer to your question is that there is no way to customize this in the current API.

@windrunner414
Copy link
Author

Because I can't know which prefixes are needed (unless traverse all documents once before build FST)
The documents are from user post contents, will query all data from database and divide to many words to build fst.(It's dynamic so will rebuild FST when certain conditions are met)
When building FST, once a prefix frozen, we can know how many words this prefix has.Lucene does this, it will drop all prefixes with less than 25 words

@windrunner414
Copy link
Author

for example, we have a set: {"ab", "abc", "abd", "abe", "abf", "ac", "ad"}
we can only insert "ab" to fst and drop all others except "ad" because of it has not been frozen yet.
In this way, only a small FST is generated on a very large data set, which can be completely loaded into memory.
Of course, search must to query the suffix on the disk. Maybe use structures such as skip list to store suffixes and posting list on the disk.

@BurntSushi
Copy link
Owner

Because I can't know which prefixes are needed (unless traverse all documents once before build FST)

You have to do this anyway, because an FST requires providing the keys in lexicographic order. So you need to know all of your keys before you start building the FST.

for example, we have a set: {"ab", "abc", "abd", "abe", "abf", "ac", "ad"}
we can only insert "ab" to fst and drop all others except "ad" because of it has not been frozen yet.

It still isn't clear to me why this has to be in FST construction and not in a pre-processing phase. Even if FST construction did this automatically, you'd still need to know what the suffixes that weren't written are so that you can write them somewhere else (as you mention).

If I'm still not understanding you, then I think talking about this at a conceptual level isn't going to work. In that case, I'd suggest that you put forward a proposal that adds a new API along with a sketch of how the implementation of that new API might work.

@windrunner414
Copy link
Author

You have to do this anyway, because an FST requires providing the keys in lexicographic order. So you need to know all of your keys before you start building the FST.

Yes it requires providing the keys in order.But I don't have just one fst. I will generate many fsts in segments.So I can only know the prefix in a particular segments.I will merge fsts finally, during the final FST build, I should know the prefixes in all documents.

you'd still need to know what the suffixes that weren't written are so that you can write them somewhere else (as you mention).

Once a prefix is frozen, all its suffixes are determined and immutable, and we can clearly know all its suffixes.We can record all input after "ab" before "ac", once "ac" is inserted, and we found "", "c", "d", "e", "f" >= the MIN constant, we can write "ac" to fst and write all suffixes elsewhere

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants