-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance enhancements to enable large inputs #147
Comments
Hi Vítor, thanks for reaching out to me. This tool is simply not meant to scale to billion-sized inputs. I would not even know how to accomplish that. For DFA generation, I use a separate library so I'm limited by that one. The tool's primary purpose is to help creating regular expressions with a focus on correctness (which is hard enough), not performance. I will work on a new feature that allows to create various character classes and ranges of characters so that more generalized expressions are possible. Feel free to open a PR if you see a way to improve performance or other aspects. Even though I'm relatively proficient in Rust, I'm far from being an expert. |
Thanks Peter!
I appreciate you taking the time to reply and not dismissing this issue
completely.
Many apologies if this sounded absurd at first glance but it should be
generally feasible if there's a way to distribute computation.
Using rayon [1] which is already a transitive dependency of this project
because of criterion [2, 3] would be a great and easy start to introduce
multiprocessing at a single node level.
After that the hard part would be to distribute to more than one node. In
scientific computing we generally use OpenMPI for that, I wouldn't know it
in Rust. There is mpi [4], I guess that would do.
In relation to memory it just need not grow beyond a single node's total
memory in single and parallel processing code.
The hardest part however is on a project level determining what needs to be
changed from an architecture standpoint to include parallel constructs.
As an example if it were possible to create expressions for independent
subsets of input data and then merge the expressions this would completely
solve the problem even if DFA is slow. MPI is not needed in this case since
each node can independently generate expressions from each independent
subset and then be merged. Optimally though it would be nice to introduce
some relaxation to lower computational cost such as merging ranges.
I am aware this is too much and outside the scope of your project so please
feel free to read this issue just as a reminder to consider larger-sized
problems.
I will work on a new feature that allows to create various character
classes and ranges of characters so that more generalized expressions are
possible.
That would be incredible, thank you!
Feel free to open a PR if you see a way to improve performance or other
aspects. Even though I'm relatively proficient in Rust, I'm far from being
an expert.
If I do I'll be sure to send it your way.
[1] https://docs.rs/rayon/latest/rayon/index.html
[2] https://github.com/pemistahl/grex/blob/v1.4.1/Cargo.lock#L209-L233
[3] https://github.com/pemistahl/grex/blob/v1.4.1/Cargo.lock#L909-L919
[4] https://docs.rs/mpi/latest/mpi/index.html
…On Thu, Jan 19, 2023 at 7:02 AM Peter M. Stahl ***@***.***> wrote:
Hi Vítor, thanks for reaching out to me. This tool is simply not meant to
scale to billion-sized inputs. I would not even know how to accomplish
that. For DFA generation, I use a separate library so I'm limited by that
one. The tool's primary purpose is to help creating regular expressions
with a focus on correctness (which is hard enough), not performance. I will
work on a new feature that allows to create various character classes and
ranges of characters so that more generalized expressions are possible.
Feel free to open a PR if you see a way to improve performance or other
aspects. Even though I'm relatively proficient in Rust, I'm far from being
an expert.
—
Reply to this email directly, view it on GitHub
<#147 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJW2VYO5OEBYDDCKM6SPQ5DWTEGKDANCNFSM6AAAAAAT7SFFEM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Thank you for your suggestions. Looks interesting. I will leave this issue open and come back to it as soon as I continue working on performance improvements. |
I changed the title so it looks less like clickbait to outside viewers. I'll share a few test sets to consider and more as they come along.
Expected output:
Expected output:
Expected output:
Expected output: These test cases also only use a single thread. Lots of operations including (independent) iterations and sorting could be exchanged by parallel counterpart calls for a free win. |
Hey, I have a very simple data set where the mixture is initial digits followed by one or two lowercase non-digit characters or followed by one or two lowercase non-digit characters with a roman numeral from I-VI in parentheses. I would offer the following regex. Surely this can also be done more simply, but here: ^[0-9]+[a-z]{0,2}(\s\((IV|V?I{0,3})\))?$ regex101 analysis. Test file: test.txt Here's what I got with ^(?:(?:(?:13|69)6 \(II|440 \(II|725 \(II|(?:2(?:56j|85|98)|4(?:2(?:2b|9)|1[07]|4[38])|7(?:10e|02|22)|1(?:63|75|81)|3(?:43|12)|5(?:15|83)|6(?:0[0358]|1[34])|89[08]|502) \(I|(?:13|69)6 \(V|(?:(?:783|(?:(?:23|31)|44))|883) \(I)I?\)|(?:296|409|708|821) \(I(?:II?\)|\))|9(?:(?:57i|22) \(II?\)|9(?:8 \(I(?:II?\)|\))|9a|9|[0-7])?|5(?:7[ad-fh]|[0-689])|7(?:7[a-h]|[0-689])|57?|77?|1[0-9ab]|2[013-9]|6[013-9a]|[0348][0-9]|1|2|6|[0348])?|(?:255 \(I[IV]|(?:13|69)6 \(IV|440 \(IV|725 \(IV)\)|2(?:5(?:5 \(I\)|6(?:a[a-s]|[bce-ik-oq-y])|7[a-c]|[0-489])|1[0-57-9]|4[0-9a-c]|6[0-4689]|7[1-689]|8[0-46-8]|3[0-9]|9[0-579])|(?:13|69)6 \(I\)|440 \(I\)|7(?:2(?:5 \(I\)|[1346-9])|1(?:0(?:[ab][a-z]|[df-rt-z])|[1-57-9])|10c[a-i]|7(?:4(?:a[a-i]|[b-kw-z])|[0-35-8])|3(?:9[ac-g]|[0-8])|8(?:3[ab]|[124-69])|0[013-79])|1(?:0(?:1(?:9[a-j]|[0-8])|2[01346-9]|6[4-9a-f]|7[013-9])|1(?:0[0-7]|[1-9])|3(?:6b|[0-579])|6(?:2[bc]|[014-689])|2[0-25-9]|4[013-689]|5[235-9]|7[0-46-9a-c]|8[02-9])|(?:1(?:060|54|67)|277|35[58]|7(?:16|20|8[78]))[ab]|(?:1(?:0(?:22|63|72)|50)|265|3(?:4[12]|60)|4(?:24|35)|7(?:79|80))a|1(?:0(?:19?|2|6|7)?|10?|62?|2|3|4|5|7|8)?|2(?:5(?:6a?)?|1|4|6|7|8)?|7(?:1(?:0[ab]?)?|10c|7(?:4a?)?|39?|0|2|8)?|1(?:060|54|67)|1(?:0(?:22|63|72)|50)|105[013-9]|1(?:38|47)[a-e]|3(?:2(?:4[a-eg-l]|[0-35-9])|45[a-e]|8(?:4[abd-f]|6[a-e]|[0-357-9])|0[013-9]|3[0-79]|4[046-9a-k]|5[0-49]|6[1-9]|7[0-46-9]|1[013-79])|4(?:2(?:2[ac]|[0135-8])|0[0-8]|1[1-689]|3[0-46-9ab]|4[124-79])|108[02-9]|(?:1(?:42|51)|267)[a-d]|(?:270|3(?:38|56|75))[a-c]|79(?:0[ab]|[1-9])|8(?:3(?:6[a-fh-mq-z]|[0-57-9])|4(?:5[a-d]|8[a-c]|[0-4679ab])|5(?:0[ab]|[1-9])|7(?:5[a-l]|[0-46-9])|8(?:3[a-c]|[0-24-9])|0[02-8]|1[02-9]|2[02-9]|6[0-9a]|9[1-79a]|[a-h])|(?:1(?:0[0349]|9)|2[02]|39|4[5-9]|5[2-7]|6[2-68]|7[45])[0-9]|105|1(?:38|47)|3(?:24?|45|8(?:4|6)?|0|3|4|5|6|7)?|4(?:22?|0|1|3)?|108|1(?:42|51)|267|270|3(?:38|56|75)|277|35[58]|7(?:16|20|8[78])|790?|8(?:36?|4(?:5|8)?|50?|75?|0|1|2|6|9)?|1(?:0[0349]|9)|265|3(?:4[12]|60)|4(?:24|35)|7(?:79|80)|5(?:1[01346-9]|8[0-24-9]|9[0-9a-d])|6(?:0[124679]|1[0-25-9]|7[0-9c]|9[0-57-9])|76[0-9a]|50[013-9]|5(?:1|8)?|6(?:0|1|7|9)?|76|50|2[02]|39|4[5-9]|5[2-7]|6[2-68]|7[45])$ I tried to find out with the analysis of regex101 if grex follows some kind of rule set in which it is grouped to possibly provide helpful information for the enhancement. Since I didn't really have the skills to get into the code right now. But I love regex. The pattern grex is trying to group is too wild for me to put into words. The analysis of regex101 indicates that grex has formed a large non-capturing group and 59 alternatives to follow. I do play around with
If this comment does not provide important or purposeful information, let me know and I will delete it. I just wanted to share my play around. |
Hi @scheidelerl, thank you for your valuable input, very much appreciated. :) I know about the shortcomings of the current algorithm, that's why the issues #48 and #122 already exist. I want to work on these, I'm just lacking free time at the moment. But the project is still active and I will continue working on it. I just can't tell you when. |
Hi,
I want to run this on a billion-sized low entropy case-insensitive UTF-8 input with a variable number of characters up to approximately 256 bytes each.
The data comply to an specification format but empirically it is very low entropy, only uses a very narrow subset of the specification and also a very narrow subset of UTF-8 characters.
Think of URI schemes like mailto [1]. The standard yields ample provisions for the format but empirically only a narrow subset is used.
However DFA is very slow even for very small samples. In addition even though I am not very proficient in Rust, I cannot see how to introduce MP or MPI constructs. DFA minimization also returns impractical expressions and I cannot see how to relax, penalize or restrict the size of the expression, coalesce (arbitrary) character ranges or groups, include prior knowledge of the superset expression defined by the specification format, or take advantage of data preorderings.
Finding an expression to match legal and legitimate data based on the empirical set is useful in determining diverging datum.
As an example, if some subexpression is know by the specification to be constrained to \w characters and in the data all lowercase alphabet characters except one were present, it would be fine to merge ranges to [a-z]. The point here would be that even though this subexpression can be all \w characters per specification, only [a-z] would be used in practice - and if some new datum had numbers, even though the specification allows it, it would be kind of weird given a billion samples did not.
The constraints imposed by the DFA solution as implemented are too strict in this regard and I cannot see where to go from here.
To summarize, I would greatly appreciate if you could point me towards scaling this library to work on large inputs.
I hope this tiny discussion interests you to some degree.
Vítor
[1] https://datatracker.ietf.org/doc/html/rfc6068#section-2
The text was updated successfully, but these errors were encountered: