Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting the biggest key smaller than an id #129

Open
dureuill opened this issue Jun 23, 2021 · 7 comments
Open

Getting the biggest key smaller than an id #129

dureuill opened this issue Jun 23, 2021 · 7 comments

Comments

@dureuill
Copy link

Hello! Thank you for making fst!

I read your blog post and thought that fst might be suitable for my purpose: I'm observing some datum that changes value over discrete time intervals, and want to get the value associated to the datum at any point in time as efficiently as possible.

Currently I'm storing the successive values for that datum in fst::Map, and as the value doesn't change on each unit of time interval, I'm storing the value (with the time interval as key) only when it changes.

For example, possible pairs of (time, value) could be:
(0, 42), (10, 28), (1000, 0), (1001, 1), ... meaning the value was initially 42 at the origin of time, then changed to 28 at time interval 10, then stayed at 10 until it reached time interval 1000 where it was reset to 0, ...
The time unit can be expressed as a u64, stored in its big endian [u8] representation (for insertions in lexicographical order). The value is a simple u64.

Everything is well, I can store good amounts of data in a map. Now, I would like to efficiently recover the value associated with a point in time. The small difficulty is that the point in time can be, or not, actually a key in the map, since the map only contains keys where a change of value occurred. Whenever a point in time that is not in the map is requested, the value associated with the biggest key that is smaller than the requested point in time should be returned.

Here are the solutions I tried and why I didn't choose them:

  1. I tried using fst::Map::range.le, but unfortunately the returned stream can only be iterated on in lexicographic order, which means that the full stream must be consumed to get the greatest key that is less or equal to the request time point. As I may store billions of time points in the Map, recovering the late time points are going to iterate on almost the entire map, which is going to take too much time. If using a std::BTreeMap, the range function actually returns a type implementing DoubleEndedIterator, which allows to efficiently recover the biggest smaller key.
  2. I tried storing every time points in the map, as in my understanding two consecutive time points should share a prefix and so be space efficient. Unfortunately, this still results in a bigger map (165MB to 32MB).
  3. I tried tinkering with automatons, but I'm not sure I'm seeing how they could address the use case (at least without backtracking I guess?), although it might be me being too new to the API.

Am I missing something, or is there a missing piece in the Map API to cover that use case? Or maybe I'm mistaken in my understanding of the data structure, and that particular use case cannot be served well by fsts?

Thank you for your time reading this!

@BurntSushi
Copy link
Owner

BurntSushi commented Jun 23, 2021

Hmm yes, I see the conundrum. The range API was the first thing my brain jumped to as well, but since you want the biggest value <= to the target value, I don't see a way to make that work.

It's been a while since I've been in the guts of this crate, but it seems to me like it should be possible to implement reverse streaming like BTreeMap does, although that would be a lot of effort I expect.

For your specific case though, I don't think you need a stream (which is also more costly to construct). I think you can just write a variant of key lookup:

fst/src/raw/mod.rs

Lines 681 to 700 in 3676d3e

#[inline]
fn get(&self, key: &[u8]) -> Option<Output> {
let mut node = self.root();
let mut out = Output::zero();
for &b in key {
node = match node.find_input(b) {
None => return None,
Some(i) => {
let t = node.transition(i);
out = out.cat(t.out);
self.node(t.addr)
}
}
}
if !node.is_final() {
None
} else {
Some(out.cat(node.final_output()))
}
}

But instead of return None; when find_input doesn't find anything, just look for the biggest byte that is smaller that your target. If you keep following this until you reach the end of your key, then you should be at a final state (or perhaps no state at all, if no such key exists) that represents precisely the key that is closest to your target without going over. Does that make sense?

I believe all of the routines for walking the the automaton that get uses above are public. You'll also want to use the transitions routine for the case where find_input fails. (The API is expressive enough where you could implement binary search instead, although I'd start with the simple linear scan.)

@dureuill
Copy link
Author

Fantastic answer! Indeed I think I will need to use the lower-level Node API (for this purposes and for others). I'll see what I can come up with and report back with an implementation.

Thank you for answering.

@BurntSushi
Copy link
Owner

Yeah, the lower level API isn't as convenient, but it gives you full access to the underlying automaton. So there's a lot you can do with it!

@dureuill
Copy link
Author

dureuill commented Jun 30, 2021

Alright, I think I have a working implementation. It was a bit more complicated than discussed above, due to the case where we select a branch whose key prefix is equal to the searched key but that ends up become larger than the searched key. We need to remember the latest byte of the key where we could have chosen a lower byte rather an equal one, so that we can backtrack to it when it turns out that we're in a dead end.

The implementation is available here. As discussed, it uses a linear scan as a first step. I didn't attempt to profile/optimize it.

I share this implementation in the hope it is correct and will be useful to someone, but do keep in mind that it is running late here, so it could definitely be faulty despite the few test cases 😅.

@BurntSushi
Copy link
Owner

Ah I see, that makes sense. I'll leave this issue open for now, as I don't quite have the time to process this fully. It might be the case that this should lead to some API editions to make this simpler, but clearly, my intuition here was wrong. So perhaps my intuition about providing reverse streams is also wrong. If reverse streams are possible (without any sort of backtracking), then it should also be possible to resolve your query without any kind of backtracking either.

@dureuill
Copy link
Author

dureuill commented Aug 19, 2021

Hello, sorry for the delay in coming back to you. Life.

If reverse streams are possible (without any sort of backtracking)

Allegedly I don't know enough of the implementation to answer this so could be completely wrong, but in my mental model of the fst, I don't see how to implement reverse streams without some kind of backtracking. Again, this is my superficial understanding, which might be tainted by the implementation of my use case using the existing API.


If it can help with the design of the API, my specific use case involves both retrieving the "biggest key smaller than an id" and also the first key bigger than that id, preferably by querying the map only once.

To explain the use case, the "biggest key smaller than an id" allows for fast random access to the value associated with any id, while the first key bigger than an id improves the sequential access (by caching the returned key/value, we only need to query the map again when we request an id bigger than the returned key).

A way to achieve that would maybe be to return an iterator rather than just the key/value bigger than the id. That way, for continued sequential access we could only use the iterator.

With this changes, the pub fn get_before(&self, key: u64) -> Option<u64> from my linked source file would become a pub fn get_from_before(&self, key: u64) -> (Option<u64>, fst::map::Stream<'_>), with the first element from the stream being the first with a bigger key that the passed id.

A naive implementation (that walks the map twice) for this function could be:

pub get_from_before(&self, key: u64) -> (Option<u64>, fst::map::Stream<'_>) {
    let before = self.get_before(key);
    // FIXME: might be gt, need to check the behavior of get_before when key is in map
    let stream = self.range.ge(key);
    (before, stream)
}

@bonsairobo
Copy link

I have also found myself needing to implement this kind of search. It is not trivial!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants