Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider adding a tokenizer/scanning routine #103

Open
BurntSushi opened this issue Apr 3, 2020 · 5 comments · May be fixed by #104
Open

consider adding a tokenizer/scanning routine #103

BurntSushi opened this issue Apr 3, 2020 · 5 comments · May be fixed by #104

Comments

@BurntSushi
Copy link
Owner

@llogiq brought up this use case where one might have a really big set of tokens that one wants to use to scan some text. It turns out that this is pretty easy to support using existing public APIs, but it might be nice to actually make this a proper part of the API. Here is a candidate implementation:

use fst::raw::Fst;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let fst = Fst::from_iter_map(vec![
        (" ", 0),
        ("a", 1),
        ("ab", 2),
        ("abc", 3),
        ("bc", 4),
        ("uvwxyz", 5),
    ]).unwrap();

    let mut input = "abc a uvwxyz uvwxyz ab ab a ab abc";
    while !input.is_empty() {
        let (end, id) = match scan(&fst, input.as_bytes()) {
            None => return Err(From::from("unrecognized token")),
            Some((end, id)) => (end, id),
        };
        // don't print whitespace
        if id != 0 {
            println!("{:?} :: {:?}", id, &input[..end]);
        }
        input = &input[end..];
    }

    Ok(())
}

fn scan<D: AsRef<[u8]>>(fst: &Fst<D>, input: &[u8]) -> Option<(usize, u64)> {
    let mut node = fst.root();
    let mut out = fst::raw::Output::zero();
    // If the empty string is a token, then `last_match` probably needs to
    // be initialized in a smarter way.
    let mut last_match: Option<(usize, u64)> = None;
    for (i, &b) in input.iter().enumerate() {
        node = match node.find_input(b) {
            None => return last_match,
            Some(trans_index) => {
                let t = node.transition(trans_index);
                out = out.cat(t.out);
                fst.node(t.addr)
            }
        };
        if node.is_final() {
            last_match = Some((i+1, out.value()));
        }
    }
    last_match
}

And the output:

$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/fst-longest`
3 :: "abc"
1 :: "a"
5 :: "uvwxyz"
5 :: "uvwxyz"
2 :: "ab"
2 :: "ab"
1 :: "a"
2 :: "ab"
3 :: "abc"
@llogiq
Copy link

llogiq commented Apr 3, 2020

I think if node is final we need to cat the final_output of the node. So I replaced the loop with

    for (i, &b) in input.iter().enumerate() {
        match node.find_input(b) {
            None => return last_match,
            Some(trans_index) => {
                let t = node.transition(trans_index);
                node = fst.node(t.addr);
                if node.is_final() {
                    out = out.cat(node.final_output());
                    last_match = Some((i + 1, out.value()));
                } else {
                    out = out.cat(t.out);
                }
            }
        }
    }

This gives me correct results.

@BurntSushi
Copy link
Owner Author

Ah yeah, great catch!

@llogiq
Copy link

llogiq commented Apr 3, 2020

There's still one error, because we prematurely cat the final output, but this "final" node may still be a prefix of another node that's valid. So with that corrected:

    for (i, &b) in input.iter().enumerate() {
        if let Some(trans_index) = node.find_input(b) {
            let t = node.transition(trans_index);
            node = fst.node(t.addr);
            if node.is_final() {
                last_match = Some((i + 1, out.cat(node.final_output()).value()));
            }
            out = out.cat(t.out);
        } else {
            return last_match;
        }
    }

@llogiq
Copy link

llogiq commented Apr 4, 2020

Regarding the name, I think scan is short, but not too descriptive. Perhaps find_longest_prefix?

@llogiq llogiq linked a pull request Apr 8, 2020 that will close this issue
@tapeinosyne
Copy link
Contributor

Having done exactly this, I agree that inclusion in the public API would be desirable. (Perhaps more generally, prefix matching would be a natural inclusion given the trie's internal structure, though expectations vary when it comes to APIs.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants