How to customize freeze function? #131

windrunner414 · 2021-07-31T15:57:34Z

Like lucene, it can use a customize freeze function that only write if the number of children nodes > MIN, otherwise it will be dropped.
This allows only save part of data('prefix') in FST and save other('suffix') in disk.This makes the FST small enough to be fully loaded into memory.

BurntSushi · 2021-07-31T19:26:13Z

Why not just pick the prefixes yourself ahead of time?

I think I really need more details here. An example would be helpful.

BurntSushi · 2021-07-31T19:26:32Z

The direct answer to your question is that there is no way to customize this in the current API.

windrunner414 · 2021-07-31T19:38:14Z

Because I can't know which prefixes are needed (unless traverse all documents once before build FST)
The documents are from user post contents, will query all data from database and divide to many words to build fst.(It's dynamic so will rebuild FST when certain conditions are met)
When building FST, once a prefix frozen, we can know how many words this prefix has.Lucene does this, it will drop all prefixes with less than 25 words

windrunner414 · 2021-07-31T19:45:50Z

for example, we have a set: {"ab", "abc", "abd", "abe", "abf", "ac", "ad"}
we can only insert "ab" to fst and drop all others except "ad" because of it has not been frozen yet.
In this way, only a small FST is generated on a very large data set, which can be completely loaded into memory.
Of course, search must to query the suffix on the disk. Maybe use structures such as skip list to store suffixes and posting list on the disk.

BurntSushi · 2021-07-31T19:48:56Z

Because I can't know which prefixes are needed (unless traverse all documents once before build FST)

You have to do this anyway, because an FST requires providing the keys in lexicographic order. So you need to know all of your keys before you start building the FST.

for example, we have a set: {"ab", "abc", "abd", "abe", "abf", "ac", "ad"}
we can only insert "ab" to fst and drop all others except "ad" because of it has not been frozen yet.

It still isn't clear to me why this has to be in FST construction and not in a pre-processing phase. Even if FST construction did this automatically, you'd still need to know what the suffixes that weren't written are so that you can write them somewhere else (as you mention).

If I'm still not understanding you, then I think talking about this at a conceptual level isn't going to work. In that case, I'd suggest that you put forward a proposal that adds a new API along with a sketch of how the implementation of that new API might work.

windrunner414 · 2021-07-31T20:01:04Z

You have to do this anyway, because an FST requires providing the keys in lexicographic order. So you need to know all of your keys before you start building the FST.

Yes it requires providing the keys in order.But I don't have just one fst. I will generate many fsts in segments.So I can only know the prefix in a particular segments.I will merge fsts finally, during the final FST build, I should know the prefixes in all documents.

you'd still need to know what the suffixes that weren't written are so that you can write them somewhere else (as you mention).

Once a prefix is frozen, all its suffixes are determined and immutable, and we can clearly know all its suffixes.We can record all input after "ab" before "ac", once "ac" is inserted, and we found "", "c", "d", "e", "f" >= the MIN constant, we can write "ac" to fst and write all suffixes elsewhere

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to customize freeze function? #131

How to customize freeze function? #131

windrunner414 commented Jul 31, 2021

BurntSushi commented Jul 31, 2021

BurntSushi commented Jul 31, 2021

windrunner414 commented Jul 31, 2021

windrunner414 commented Jul 31, 2021

BurntSushi commented Jul 31, 2021

windrunner414 commented Jul 31, 2021

How to customize freeze function? #131

How to customize freeze function? #131

Comments

windrunner414 commented Jul 31, 2021

BurntSushi commented Jul 31, 2021

BurntSushi commented Jul 31, 2021

windrunner414 commented Jul 31, 2021

windrunner414 commented Jul 31, 2021

BurntSushi commented Jul 31, 2021

windrunner414 commented Jul 31, 2021