Skip to content
This repository has been archived by the owner on May 14, 2020. It is now read-only.

pmf and case-sensitive matching #998

Open
fgsch opened this issue Jan 22, 2018 · 6 comments
Open

pmf and case-sensitive matching #998

fgsch opened this issue Jan 22, 2018 · 6 comments

Comments

@fgsch
Copy link
Contributor

fgsch commented Jan 22, 2018

After c1a7652, all @pmf occurrences have the lowercase transformation.

IMO this was not the right way to fix it since it implies @pmf is no longer case insensitive as the documentation states.
Furthermore, although disabled there is code in the Aho-Cora implementation to support UTF8.
If this ever gets enabled and rules start depending on it, having lowercase alone would not be enough.

We should either revert this or update the documentation (and code) to treat the input as case-sensitive.

@dune73
Copy link
Contributor

dune73 commented Jan 25, 2018

I am not quite following you; too many negations for my brain.

If I remember by correctly, we made this adjustment to make sure we can work with case sensitive and case insensitive. All data files are supposed to be lowercase, all payloads transformed to lowercase.

Do you understand how UTF-8 plays into this? Could you explain.

Nothing wrong with updating the documentation and making this all transparent, though.

@fgsch
Copy link
Contributor Author

fgsch commented Feb 19, 2018

Apologies for the delay on this.

There are a number of issues IMO with this change:

  1. We're doing the conversion twice (one in Aho-Corasick and the other in the transformation).
  2. The A-C code will lowercase one letter at the time; the lowercase transformation will do the whole string. This has performance implications, in particular for large inputs.
  3. There is a mismatch between the rules and the documentation; the documentation says pmf is case insensitive but the rules add the lowercase transformation, implying otherwise.
  4. Because of number 3, other implementations following the description of pmf will end up doing the conversion twice.
  5. The A-C code supports UTF-8, albeit disabled by default. The lowercase transformation does not support UTF-8. An implementation depending on the lowercase transformation will not behave the same iff UTF-8 is enabled.

Hope that is better but let me know otherwise.

@github-actions
Copy link

This issue has been open 120 days with no activity. Remove the stale label or comment, or this will be closed in 14 days

@github-actions github-actions bot added the Stale issue This issue has been open 120 days with no activity. label Nov 20, 2019
@fgsch fgsch removed the Stale issue This issue has been open 120 days with no activity. label Nov 20, 2019
@fgsch
Copy link
Contributor Author

fgsch commented Nov 20, 2019

@dune73 ping.

@dune73
Copy link
Contributor

dune73 commented Jan 8, 2020

OK. I got you now.

We are based on an engine that defines @pmf as case-insensitive, yet we act as if it was sensitive in order to comfort implementations that do not follow the definition and implement a case sensitive A-C algorithm. Consequently users of compliant implementations suffer the performance penalty of an unnecessary t:lowercase transformation. And on top of it, we're likely to face UTF-8 problems with out transformation.

Correct?

If yes, it would indeed make sense to roll back the change.

It would be nice, if @p0pr0ck5 could chime in, before we perform the roll back, though.

@dune73
Copy link
Contributor

dune73 commented Mar 4, 2020

This was on the agenda for the community chat on March 2, 2020. But we postponed the discussion.

#1683 (comment)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants