Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find_matches_as_indexes returns bogus values #83

Closed
nijel opened this issue Sep 21, 2023 · 22 comments · Fixed by #84
Closed

find_matches_as_indexes returns bogus values #83

nijel opened this issue Sep 21, 2023 · 22 comments · Fixed by #84

Comments

@nijel
Copy link
Contributor

nijel commented Sep 21, 2023

At WeblateOrg/weblate#10002 we observe find_matches_as_indexes to return bogus values (bigger than sys.maxsize). While I'm still trying to understand the issue, I looked at the ahocorasick_rs code and I have one question about find_matches_as_indexes:

https://github.com/G-Research/ahocorasick_rs/blob/7bcfe13711ef81a43cbae1ecff87f7d3847a4744/src/lib.rs#L211C1-L231C6

It uses as_u64 to convert pattern index, while byte_to_code_point to convert start and end positions. Couldn't this be the place where something could go wrong? I would expect all to be converted as integers.

@BurntSushi
Copy link

Can you provide a small Python program that reproduces the problem?

@nijel
Copy link
Contributor Author

nijel commented Sep 21, 2023

Okay, it is caused by a blank string (looks like a bug on our side) and some unicode characters.

Minimal reproducer:

import ahocorasick_rs

automaton = ahocorasick_rs.AhoCorasick([""])

print(automaton.find_matches_as_indexes('”'))

It prints:

[(0, 0, 0), (0, 18446744073709551615, 18446744073709551615), (0, 18446744073709551615, 18446744073709551615), (0, 1, 1)]

@BurntSushi
Copy link

Yeah it looks like you might be right that it's an issue with this wrapper. I can at least confirm that the underlying Rust library is doing the right thing here:

fn main() -> anyhow::Result<()> {
    let ac = aho_corasick::AhoCorasick::new([""]).unwrap();
    let hay = "”";
    for m in ac.find_iter(hay) {
        dbg!(m);
    }
    Ok(())
}

And running it:

$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/i83`
[main.rs:5] m = Match {
    pattern: PatternID(
        0,
    ),
    span: 0..0,
}
[main.rs:5] m = Match {
    pattern: PatternID(
        0,
    ),
    span: 1..1,
}
[main.rs:5] m = Match {
    pattern: PatternID(
        0,
    ),
    span: 2..2,
}
[main.rs:5] m = Match {
    pattern: PatternID(
        0,
    ),
    span: 3..3,
}

The empty string is somewhat of a pathological case that usually requires a special code path.

@nijel
Copy link
Contributor Author

nijel commented Sep 21, 2023

I think the problem is that matching is byte wise, so it is trying to map matched [0, 1, 2, 3] into a string of length 1. This really cannot happen with a non-blank strings, thanks to the way UTF-8 is encoded.

@itamarst
Copy link
Collaborator

Thanks for the bug report and reproducer, I will try to get this fixed soon.

@itamarst
Copy link
Collaborator

itamarst commented Sep 21, 2023

I think I will just filter out empty patterns, it's not a useful thing to match on. That will make the storing-patterns code path more annoying, though.

@itamarst
Copy link
Collaborator

But... filtering them out will distort the results, so guess I won't.

@BurntSushi
Copy link

One possible thing you could do: when searching a Unicode string, remove all zero-length matches that split a codepoint. You should probably leave them in when searching a byte string, but matches that split a codepoint when searching a Unicode string are a little weird.

(FWIW, this is what the Rust regex crate does. When searching a &str, empty matches that split a codepoint are never returned. But they are allowed when searching a &[u8].)

@itamarst
Copy link
Collaborator

Is there any use case for zero-length matches? It just feels like GIGO, a mistake on the user's part when they inserted this sort of pattern.

@itamarst
Copy link
Collaborator

(As side note, this is plausibly what caused the panic in #51).

@BurntSushi
Copy link

BurntSushi commented Sep 21, 2023

For regex? Definitely. For multi-substring matching, it's hard to articulate one off the top of my head. Historically speaking, all single and multi substring search algorithms handle the empty substring correctly. And production implementations (that I'm aware of) do as well. So by choosing not to handle it, you might wind up surprising users. And by choosing not to handle it, you might in turn require callers to handle it themselves. For example, callers might feed user generated patterns into an Aho-Corasick searcher. If your library doesn't handle the case of the empty pattern, then it follows the caller likely needs to handle it somehow. It's a non-obvious and typically pathological case, so it's a fair bet that callers will not know to handle it. From that perspective, it is IMO better to just handle the empty pattern in the library. Aside from the Unicode issue of how to deal with byte offset splitting a codepoint, it has a straight-forward semantic: it should match at every position. Finally, the underlying library you're wrapping explicitly handles the empty pattern. So you might as well do it too. :-)

It is indeed plausible that providing an empty pattern is a mistake. But that mistake can be easily observed when the search ends up matching at every position.

@itamarst
Copy link
Collaborator

Hm. My current prototype handles it by filtering out all 0-length matches. This doesn't seem like a bad way to handle it but it is inconsistent. I guess I can compare to the "is this invalid UTF-8 boundary" case and see how I feel about it.

@nijel
Copy link
Contributor Author

nijel commented Sep 21, 2023

I consider this bug in our code and fixed that. There is nothing useful we could do with zero length matches. Constructor failing with a ValueError on such string would work well as it would make it clear that this is wrong. If empty matches are expected to work, filtering out positions inside utf8 might be needed.

@itamarst
Copy link
Collaborator

I think I'll go with the ValueError, yeah.

@itamarst
Copy link
Collaborator

itamarst commented Sep 21, 2023

My reasoning for posterity—given kinda bogus inputs, the options are:

  • Complain early. Annoying for users in that sometimes it breaks, but very informative, and informs users of a likely bug in their code, as @nijel noted. Small performance hit on construction, smaller if I factor it right.
  • Filter out all bogus 0-length results. Adds (a tiny bit of) performance overhead, masks bugs in user code.
  • Filter out 0-length results that happen on the wrong unicode boundary. This is a more significant performance overhead depending on which API is used, though still pretty small. Again, masks bugs in user code.

The first option seemed to have the best tradeoffs in terms of where performance hit is taken and long-term user success.

@BurntSushi
Copy link

I don't understand the performance impacts here. You should only ever need to do anything when a pattern and/or match is zero length. Checking for zero length should be nearly free relative to the work being done to produce the match in the first place. If it's zero length, then the overhead of filtering them out should be acceptable IMO because there is already a lot of overhead if you're producing a match at every position.

I think it's bad juju for any multi or single substring implementation to just reject the empty pattern altogether. In my experience this becomes an annoying limitation that needs to be worked around. I'm thinking specifically about the case where one accepts user supplied patterns.

I also disagree that allowing an empty pattern masks bugs in user code. If you supply an empty pattern you should get a match at every position. It's not clear to me how that's masking a bug.

@itamarst
Copy link
Collaborator

Filtering out all of them is cheap, yeah. Filtering out the ones on bad unicode boundaries requires coming up with that mapping in some cases where currently we don't need to calculate it. Again, probably pretty cheap, but a bit more work.

Mostly this is about user intentions: why would you ever want to match an empty string?

@itamarst
Copy link
Collaborator

If you the author of a tool have user gave you an empty string, yes you have to handle it, but did your user intended to give an empty string? Probably not meaningful to them either. So you the tool author have to handle it but the upstream user is now getting more meaningful results (or perhaps they had a bug).

@BurntSushi
Copy link

Filtering out the ones on bad unicode boundaries requires coming up with that mapping in some cases where currently we don't need to calculate it. Again, probably pretty cheap, but a bit more work.

The key here is that you should only ever have to do this when there's a pattern of zero length. Otherwise a zero length match is impossible.

If you the author of a tool have user gave you an empty string, yes you have to handle it, but did your user intended to give an empty string? Probably not meaningful to them either.

They might have, yes. It's a not-uncommon idiom to do something like grep -e foo -e bar -e '' to show the entire contents of a file with matches of foo and bar highlighted. The empty string at the end ensures that every line matches, but not necessarily every line shows a highlighted match.

Basically, an empty pattern in the context of, say, a Python or Rust program is pretty weird and it is difficult for me to articulate a use case. (Although I would argue that consistency with other multi and substring search implementations should move the needle here in favor of making it work.) But once you move to asking an end user for patterns, then use cases tend to pop up because the end user has a limited interface to work with. Things like an empty pattern can provoke useful behavior in some cases.

So you the tool author have to handle it but the upstream user is now getting more meaningful results (or perhaps they had a bug).

The problem is that they may have a bug. But your change is proposing that it is definitely a bug.

Tool authors are unlikely to handle this case up-front. Pretty much everyone forgets about the empty pattern. I did too. I went through exactly the same dance you're going through. It was just a long time ago.

@itamarst
Copy link
Collaborator

It's a not-uncommon idiom to do something like grep -e foo -e bar -e '' to show the entire contents of a file with matches of foo and bar highlighted. The empty string at the end ensures that every line matches, but not necessarily every line shows a highlighted match.

Doesn't this suggest grep should have a --show-all-lines option?

@BurntSushi
Copy link

grep is example. I don't want to argue about what grep itself should or shouldn't support. The point is that this is what happens when patterns become the user interface: users are limited in what they can express and the empty pattern is a useful means of expressing certain idioms without expanding the user interface. Expanding the user interface may be desirable in some cases and not in others. It really just depends. We can't legislate everything here.

I think I'm about spent on this issue. I think I've made my case as well as I can.

@itamarst
Copy link
Collaborator

I'm going to stick to current plan for now; can always remove the restriction if someone has a sufficiently compelling use case. I do appreciate the feedback, @BurntSushi!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants