Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization for sparse automatons #78

Open
vorner opened this issue Apr 18, 2019 · 9 comments
Open

Optimization for sparse automatons #78

vorner opened this issue Apr 18, 2019 · 9 comments
Labels

Comments

@vorner
Copy link
Contributor

vorner commented Apr 18, 2019

Hello

The fst can be queried with a specific key or with an automaton. The key is optimized by just „going forward“, while the automaton is „offered“ one byte at a time to modify the state.

Now, I have a sparse automaton ‒ one that doesn't describe one specific key only, but it has only few branching points and is just single possible next byte in most of the states.

I wonder if it's possible to somehow make the querying faster, given the above. Instead of offering it all the bytes that won't match for sure, just going forward on the straight parts. I could write a code specific to my use case, but is there something that could be done in general?

Maybe extending the automaton with a fn hint(&self) -> Option<&[u8]> (default to None) that would not store the intermediate visited states and just go forward? Would it make it faster? Would it make sense to have something like that as part fst itself? Or, is there a better way? If so, I might try to implement it, but I want to ask first before I invest the time.

Thank you

@BurntSushi
Copy link
Owner

BurntSushi commented Apr 18, 2019

Hmm. I don't think I quite understand your question quite yet. Could you maybe write out a simple example?

Additionally, do either the Automaton::can_match or Automaton::will_always_match routines help?

@vorner
Copy link
Contributor Author

vorner commented Apr 18, 2019

OK, I'll try to explain a little better.

I have a [u8] „pattern“. In its basic form, it matches directly, like Map::get would. However, it's also allowed for the pattern to skip some part of it, based on some marks. So, let's say the patter is

foo/bar/baz:end

Allowed matches are the pattern itself, but also foo:end, foo/bar:end. So my automaton accepts the next character, but if the current position points to /, it also accepts the : and „jumps“ there.

I do use the can_match and I guess it helps a lot. But I still feel like fst feeding it a, then b, then c (if the fst contains them), going into an OutOfPattern state that returns false for can_match is a waste when I already know the only one that can match at the beginning is f. So I was wondering if I could just do something similar to what get does inside except for the places with /.

Is this easier to understand, or am I still confusing? 😇

@BurntSushi
Copy link
Owner

OK, so I'm going through some of the issues on this repo and I've re-read your comments here. But I still just don't quite grok what's going on. It's probably because I haven't fully context switched back into this library, but I'm not sure.

Is the patch you're envisioning difficult? If not, the best path forward might be to just submit the patch and hopefully that will help make what you're trying to do clearer to me.

If it is difficult to write, then I'm not sure, maybe a different way of explaining it?

@vorner
Copy link
Contributor Author

vorner commented Feb 21, 2020

I've also context-switched to other things since, and this is not a big priority for me right now, but I'll try.

I don't know how hard the full patch would be, I think the implementation might require some amount of work. But I'll try to write an actual proposed interface change and what the usage I envisioned would look like.

First, the (simplified) use case. Let's say I create a huge fst::Map with a lot of data in it. Now, if I want to query it and see if it contains foo, I can either do map.get("foo"), or I can use an automaton like bellow and use map.search(MyAutomaton(b"foo"))

struct MyAutomaton<'a>(&'a [u8]);

enum State {
    InStr(usize),
    Outside,
}

impl Automaton for MyAutomaton {
   type State = State;
   fn start(&self) -> State {
      State(0)
   }
   fn is_match(&self, state: State) -> bool {
      match state {
         State::InStr(n) if n == self.0.len() => true,
         _ => false,
      }
   }
   fn accept(&self, state: State, byte: u8) -> State {
     match state {
        State::InStr(n) if self.0.get(n) == Some(byte) => State::InStr(n + 1),
        _ => State::Outside,
     }
   }
   fn can_match(&self, state: State) -> bool {
     match state {
       State::InStr(_) => true,
       State::Outside => false,
     }
   }
}

The .get way will be faster, because it knows where to go through the fst. The latter will do (assuming the fst is dense) something like this, wasting time by doing these little „peeks“ into dead branches. These dead branches will be only 1 character long, thanks to the can_match, but they'll still be there:

  1. Start: "", State::InStr(0)
  2. Try 'd' as the first character: "d", State::Outside → ✘
  3. Try 'f' as the first character: "f", State::InStr(1) → continue
  4. Try 'a' as the second character: "fa", State::Outside → ✘
  5. Try 'o' as the second character: "fo", State::InStr(2) → continue

etc.

While the .get will work something like this:

  1. Does the λ node of the fst have f as a child?
  2. Does that child has o as a child?
    ...

Apparently, the .get does much less work. But if I don't want an exact match, but something like almost exact match ‒ eg. I want to match either fooX or fooY, I can either call get twice or I can use the automaton and enrich it a little bit by another State, let's say ExpectXOrY. This means the automaton will still use the more-work method in the easy foo prefix part.

So my proposal was to add a (tentatively named) fn prefix_hint(&self, state: State) -> &[u8] method to the Automaton trait. Such method would default to returning an empty slice and the contract would be that all matches in the current state will have the given prefix. In such case, the search algorithm could switch to the simpler .get-style way of traversing the fst for the whole prefix and then switch back to calling accept to each next possible byte in fst.

In my simplified example, the method would look something like this:

fn prefix_hint(&self, state: State) -> &[u8] {
    match state {
      State::InStr(n) => self.0.get(n..).unwrap_or(b""),
      _ => b"",
    }
}

Is this explanation with example better?

(It's all too possible the example contains some +-1 errors, I hope they don't matter)

@BurntSushi
Copy link
Owner

@vorner Ahhhh okay, thank you so much, that helps me understand things quite a bit better. I really appreciate you taking the time to write that. The key to my understanding was the inequality between get and search. You are of course correct that get is somewhat privileged in this regard.

But not totally privileged. IIRC, get can actually be re-implemented by callers using the raw fst::raw::Node API. With that in mind, maybe a better path forward for you would be to implement your own get rather than trying to go the generic Automaton route.

That's not to say that I think your prefix_hint API is a bad idea. It just doesn't quite seem obviously right to me yet. But I can keep noodling on this. I have a project coming up where I hope to use fst quite a bit more, so I will hopefully context switch back into this project and be able to provide more helpful advice.

Note that get is actually quite simple, so "re-implementing" it shouldn't be too horrific:

fst/src/raw/mod.rs

Lines 410 to 429 in 4993d14

#[inline(never)]
pub fn get<B: AsRef<[u8]>>(&self, key: B) -> Option<Output> {
let mut node = self.root();
let mut out = Output::zero();
for &b in key.as_ref() {
node = match node.find_input(b) {
None => return None,
Some(i) => {
let t = node.transition(i);
out = out.cat(t.out);
self.node(t.addr)
}
}
}
if !node.is_final() {
None
} else {
Some(out.cat(node.final_output()))
}
}

Although, now that I write this out, I'm realizing that your whole point is that you want get for only part of your search, but would otherwise want a normal automaton search once you get past the known prefix part of your query. So I am perhaps quite off here. (You could re-implement search itself, but search is not trivial at all and quite a bit more complicated than get. search isn't much code, but god damn, I remember it hard to get right.)

@vorner
Copy link
Contributor Author

vorner commented Feb 22, 2020

Although, now that I write this out, I'm realizing that your whole point is that you want get for only part of your search, but would otherwise want a normal automaton search once you get past the known prefix part of your query.

Kind of, yes. I want something a little bit more powerful ‒ I want to interleave the full-featured automaton search with the simplified get-style traversal. In the pattern I search, there are interesting points (where the full power of the automaton is very useful) and paths where nothing interesting happens. It's similar to the optimization of trie vs. compressed trie. The hint would always return the prefix of the rest of the match from the current state.

To put some more context to it, I've used fst to implement a service at work and it turned out searching the fst was the most prominent function in the profile (which made sense, since it did the heavy lifting of the service, the rest was mostly just wrapping it into http). I was thinking about how to improve that and using the raw API to skip the simple parts of matching was a possibility.

Nevertheless, I felt like using it would be a bit painful and would produce code that would be much less readable. We also have a policy that if some internal project needs a change in an opensource library, it's OK to create the pull request as part of work time and publish it. So I looked for a way to abstract some part that could be usable by others that'd also make my code more readable and use the work time to implement it. The idea is this extension of the trait, but it turned out the service is fast enough as it is. The evil master plan to use company time to improve fst won't fly, I'm still willing to attempt it if you like the interface in my free time, but it might take a while to find some. If you think it would only complicate matters and wouldn't bring enough improvement, I'm OK with that too.

@BurntSushi
Copy link
Owner

@vorner Thanks very much! I am nearly sold, but not 100%, so I'd say hold off on a patch for now. I will keep on noodling about this though and might just do it myself.

One other question: does this optimization only apply to prefixes? It sounds like it ought to be generalizable to other parts. e.g., The prefix might be variable but the suffix might be fixed. Although I guess it's not possible to take advantage of this because one does not know how many transitions remain to the next match state at any given point in traversal. So I guess skipping past prefixes is a special case.

I think prefix_hint (or equivalent) can also be forwarded in some cases. e.g., If you intersect A with B, then the prefix_hint of the intersection is either empty if neither is a prefix of the other and both are non-empty, or is the longer of the two prefixes in all other cases. But if you union A and B, then neither can have its prefix_hint forwarded unless one is a prefix of the other.

Does that all make sense to you?

@BurntSushi
Copy link
Owner

Also, given that you're using this in production, it would be good to give my plans in #96 a quick perusal to make sure that makes sense to you. (And if you have any suggestions for breaking changes, now would be a good time! :-))

@vorner
Copy link
Contributor Author

vorner commented Feb 27, 2020

Well, as I envision it, it would be prefix of the rest of the match (so maybe the word prefix might be a bit misleading). If I wanted to match (foo|bar)bar, I could make it so I check the first letter and then return either oobar or arbar as the prefix. I guess one could be little bit more inventive on what to return there, depending on the situation. But sure, if you wanted to match .*foo, then you probably couldn't use the prefix hind as an optimisation here. Maybe that's what you meant. I don't know if there's a way to make this more generic or work in the suffix case (except building the fst from reversed inputs).

With the forwarding and combining of automatons, I think you're right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants