Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Methods to filter fst into new fst based on bytes #84

Open
joshtriplett opened this issue Oct 28, 2019 · 5 comments
Open

Methods to filter fst into new fst based on bytes #84

joshtriplett opened this issue Oct 28, 2019 · 5 comments
Labels

Comments

@joshtriplett
Copy link

joshtriplett commented Oct 28, 2019

For some uses of an FST, before applying a large number of queries to the fst, the caller may be able to prune out states and transitions from the FST based on knowledge of valid keys. For instance, if using an FST to store a dictionary, and performing lookups in that dictionary for a game, the caller may know that only a subset of letters are available, and could start by pre-processing the FST to prune any states or transitions that use unavailable letters.

Please consider providing ways to efficiently process an FST and prune out states and transitions based on inputs. The simplest could be a filter on valid u8 inputs for transitions.

@BurntSushi
Copy link
Owner

@joshtriplett I'm having a hard time understanding this request. In particular, it sounds like you just want something like this?

let mut builder = Builder::new();
for key in dictionary {
    if key.contains("kyz") {
        continue;
    }
    builder.add(key).unwrap();
}
let fst = builder.into_inner();

Or something like that? What am I missing?

@joshtriplett
Copy link
Author

Not quite, no. I'm not looking for an operation that filters on keys; I'm looking to filter an fst based on states and state transitions.

The operation I'm looking for would accept two functions, one Fn(u8) -> bool that filters out states by their byte, and one Fn(u8, u8) -> bool that filters out state transitions by their current and next byte. fst would filter states and transitions, throwing away states and transitions for which the corresponding filter function returns false (as well as transitions to states that no longer exist and states with no remaining transitions), and then compact the resulting fst.

This would hopefully be a reasonably efficient operation, and would not require "materializing" each of the keys.

@BurntSushi
Copy link
Owner

I think I see. So to be clear, you're looking for something like this:

fn filter(fst: &Fst, filter_state: impl Fn(u8) -> bool, filter_trans: impl Fn(u8, u8) -> bool) -> Fst

I think that it would in theory be possible to implement this, and it would in theory be faster than reconstructing the FST from the keys, but 1) this would, I believe, be a ton of work to implement and 2) it's not an obvious win to me. An FST's representation is not like a normal state machine. There are oodles of compression tricks that happen which would need to be completely re-done anyway in a case like this. That's likely less cost than rebuilding from keys, but I don't know how much less cost it would be. And doing this without loading the entire FST into memory also seems like a potential challenge, although I confess that I have not thought through it.

Is there a higher level use case you have that you could elaborate on? I think it's probably safe to say that unless there is a simple implementation for this, it's very unlikely that this would happen.

@joshtriplett
Copy link
Author

@BurntSushi Note that I'd be fine if this loaded the FST fully into memory; I don't expect or need this to be a "streaming" operation. I also don't mind if this doesn't optimize the fst as extensively as it could, such as by trying to compress states together that weren't previously compressed together; if I need space efficiency, I don't mind having to regenerate the fst. Might that simplify the implementation enough to be feasible?

Use case: I have a large fst of keys, and a relatively small graph of bytes, and I need to search for keys in the former as paths within the latter. (This is the same use case I have in #5.) The search for keys within the graph can take seconds or minutes. I'd like to do a quick first pass that filters out states with bytes that don't appear anywhere in the graph, and transitions between bytes that don't appear adjacent in the graph, which I expect would result in a much smaller fst, which would be faster to access, easier to keep in cache, and take up less memory after the initial pruning.

@BurntSushi
Copy link
Owner

Might that simplify the implementation enough to be feasible?

@joshtriplett Unfortunately not. What I meant wasn't that it would be hard to re-compress everything, but rather, than because it's compressed you can't just remove things. For example, if you remove a transition from a state that has two transitions, then that state's encoding is going to completely change with one transition. I guess it's not so much "redoing" compression, but rather, building a separate pipeline to construct FSTs from pre-existing states.

Have you tried doing this by filtering the keys and rebuilding the FST? Is it too slow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants