consider adding a tokenizer/scanning routine #103

BurntSushi · 2020-04-03T13:50:35Z

@llogiq brought up this use case where one might have a really big set of tokens that one wants to use to scan some text. It turns out that this is pretty easy to support using existing public APIs, but it might be nice to actually make this a proper part of the API. Here is a candidate implementation:

use fst::raw::Fst;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let fst = Fst::from_iter_map(vec![
        (" ", 0),
        ("a", 1),
        ("ab", 2),
        ("abc", 3),
        ("bc", 4),
        ("uvwxyz", 5),
    ]).unwrap();

    let mut input = "abc a uvwxyz uvwxyz ab ab a ab abc";
    while !input.is_empty() {
        let (end, id) = match scan(&fst, input.as_bytes()) {
            None => return Err(From::from("unrecognized token")),
            Some((end, id)) => (end, id),
        };
        // don't print whitespace
        if id != 0 {
            println!("{:?} :: {:?}", id, &input[..end]);
        }
        input = &input[end..];
    }

    Ok(())
}

fn scan<D: AsRef<[u8]>>(fst: &Fst<D>, input: &[u8]) -> Option<(usize, u64)> {
    let mut node = fst.root();
    let mut out = fst::raw::Output::zero();
    // If the empty string is a token, then `last_match` probably needs to
    // be initialized in a smarter way.
    let mut last_match: Option<(usize, u64)> = None;
    for (i, &b) in input.iter().enumerate() {
        node = match node.find_input(b) {
            None => return last_match,
            Some(trans_index) => {
                let t = node.transition(trans_index);
                out = out.cat(t.out);
                fst.node(t.addr)
            }
        };
        if node.is_final() {
            last_match = Some((i+1, out.value()));
        }
    }
    last_match
}

And the output:

$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/fst-longest`
3 :: "abc"
1 :: "a"
5 :: "uvwxyz"
5 :: "uvwxyz"
2 :: "ab"
2 :: "ab"
1 :: "a"
2 :: "ab"
3 :: "abc"

The text was updated successfully, but these errors were encountered:

llogiq · 2020-04-03T18:42:28Z

I think if node is final we need to cat the final_output of the node. So I replaced the loop with

    for (i, &b) in input.iter().enumerate() {
        match node.find_input(b) {
            None => return last_match,
            Some(trans_index) => {
                let t = node.transition(trans_index);
                node = fst.node(t.addr);
                if node.is_final() {
                    out = out.cat(node.final_output());
                    last_match = Some((i + 1, out.value()));
                } else {
                    out = out.cat(t.out);
                }
            }
        }
    }

This gives me correct results.

BurntSushi · 2020-04-03T18:49:21Z

Ah yeah, great catch!

llogiq · 2020-04-03T22:40:25Z

There's still one error, because we prematurely cat the final output, but this "final" node may still be a prefix of another node that's valid. So with that corrected:

    for (i, &b) in input.iter().enumerate() {
        if let Some(trans_index) = node.find_input(b) {
            let t = node.transition(trans_index);
            node = fst.node(t.addr);
            if node.is_final() {
                last_match = Some((i + 1, out.cat(node.final_output()).value()));
            }
            out = out.cat(t.out);
        } else {
            return last_match;
        }
    }

llogiq · 2020-04-04T06:02:19Z

Regarding the name, I think scan is short, but not too descriptive. Perhaps find_longest_prefix?

tapeinosyne · 2020-05-19T17:41:30Z

Having done exactly this, I agree that inclusion in the public API would be desirable. (Perhaps more generally, prefix matching would be a natural inclusion given the trie's internal structure, though expectations vary when it comes to APIs.)

BurntSushi added the enhancement label Apr 3, 2020

llogiq linked a pull request Apr 8, 2020 that will close this issue

add fst::raw::Fst::find_longest_prefix #104

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consider adding a tokenizer/scanning routine #103

consider adding a tokenizer/scanning routine #103

BurntSushi commented Apr 3, 2020

llogiq commented Apr 3, 2020

BurntSushi commented Apr 3, 2020

llogiq commented Apr 3, 2020

llogiq commented Apr 4, 2020

tapeinosyne commented May 19, 2020

consider adding a tokenizer/scanning routine #103

consider adding a tokenizer/scanning routine #103

Comments

BurntSushi commented Apr 3, 2020

llogiq commented Apr 3, 2020

BurntSushi commented Apr 3, 2020

llogiq commented Apr 3, 2020

llogiq commented Apr 4, 2020

tapeinosyne commented May 19, 2020