Methods to filter fst into new fst based on bytes #84

joshtriplett · 2019-10-28T14:10:27Z

For some uses of an FST, before applying a large number of queries to the fst, the caller may be able to prune out states and transitions from the FST based on knowledge of valid keys. For instance, if using an FST to store a dictionary, and performing lookups in that dictionary for a game, the caller may know that only a subset of letters are available, and could start by pre-processing the FST to prune any states or transitions that use unavailable letters.

Please consider providing ways to efficiently process an FST and prune out states and transitions based on inputs. The simplest could be a filter on valid u8 inputs for transitions.

BurntSushi · 2020-03-05T15:25:04Z

@joshtriplett I'm having a hard time understanding this request. In particular, it sounds like you just want something like this?

let mut builder = Builder::new();
for key in dictionary {
    if key.contains("kyz") {
        continue;
    }
    builder.add(key).unwrap();
}
let fst = builder.into_inner();

Or something like that? What am I missing?

joshtriplett · 2020-03-05T16:21:30Z

Not quite, no. I'm not looking for an operation that filters on keys; I'm looking to filter an fst based on states and state transitions.

The operation I'm looking for would accept two functions, one Fn(u8) -> bool that filters out states by their byte, and one Fn(u8, u8) -> bool that filters out state transitions by their current and next byte. fst would filter states and transitions, throwing away states and transitions for which the corresponding filter function returns false (as well as transitions to states that no longer exist and states with no remaining transitions), and then compact the resulting fst.

This would hopefully be a reasonably efficient operation, and would not require "materializing" each of the keys.

BurntSushi · 2020-03-05T16:28:35Z

I think I see. So to be clear, you're looking for something like this:

fn filter(fst: &Fst, filter_state: impl Fn(u8) -> bool, filter_trans: impl Fn(u8, u8) -> bool) -> Fst

I think that it would in theory be possible to implement this, and it would in theory be faster than reconstructing the FST from the keys, but 1) this would, I believe, be a ton of work to implement and 2) it's not an obvious win to me. An FST's representation is not like a normal state machine. There are oodles of compression tricks that happen which would need to be completely re-done anyway in a case like this. That's likely less cost than rebuilding from keys, but I don't know how much less cost it would be. And doing this without loading the entire FST into memory also seems like a potential challenge, although I confess that I have not thought through it.

Is there a higher level use case you have that you could elaborate on? I think it's probably safe to say that unless there is a simple implementation for this, it's very unlikely that this would happen.

joshtriplett · 2020-03-05T21:31:38Z

@BurntSushi Note that I'd be fine if this loaded the FST fully into memory; I don't expect or need this to be a "streaming" operation. I also don't mind if this doesn't optimize the fst as extensively as it could, such as by trying to compress states together that weren't previously compressed together; if I need space efficiency, I don't mind having to regenerate the fst. Might that simplify the implementation enough to be feasible?

Use case: I have a large fst of keys, and a relatively small graph of bytes, and I need to search for keys in the former as paths within the latter. (This is the same use case I have in #5.) The search for keys within the graph can take seconds or minutes. I'd like to do a quick first pass that filters out states with bytes that don't appear anywhere in the graph, and transitions between bytes that don't appear adjacent in the graph, which I expect would result in a much smaller fst, which would be faster to access, easier to keep in cache, and take up less memory after the initial pruning.

BurntSushi · 2020-03-05T22:42:06Z

Might that simplify the implementation enough to be feasible?

@joshtriplett Unfortunately not. What I meant wasn't that it would be hard to re-compress everything, but rather, than because it's compressed you can't just remove things. For example, if you remove a transition from a state that has two transitions, then that state's encoding is going to completely change with one transition. I guess it's not so much "redoing" compression, but rather, building a separate pipeline to construct FSTs from pre-existing states.

Have you tried doing this by filtering the keys and rebuilding the FST? Is it too slow?

BurntSushi added the question label Mar 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Methods to filter fst into new fst based on bytes #84

Methods to filter fst into new fst based on bytes #84

joshtriplett commented Oct 28, 2019 •

edited

Loading

BurntSushi commented Mar 5, 2020

joshtriplett commented Mar 5, 2020

BurntSushi commented Mar 5, 2020

joshtriplett commented Mar 5, 2020

BurntSushi commented Mar 5, 2020

Methods to filter fst into new fst based on bytes #84

Methods to filter fst into new fst based on bytes #84

Comments

joshtriplett commented Oct 28, 2019 • edited Loading

BurntSushi commented Mar 5, 2020

joshtriplett commented Mar 5, 2020

BurntSushi commented Mar 5, 2020

joshtriplett commented Mar 5, 2020

BurntSushi commented Mar 5, 2020

joshtriplett commented Oct 28, 2019 •

edited

Loading