Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline flags #25

Open
SonOfLilit opened this issue Sep 1, 2022 · 10 comments
Open

Inline flags #25

SonOfLilit opened this issue Sep 1, 2022 · 10 comments

Comments

@SonOfLilit
Copy link
Owner

Implement syntax for inline flags, e.g.:

[ignore_case "hello, world, my name is " [ignore_case:unset #uppercase [1+ #lowercase]]]

Should compile to:

(?i)hello, world, my name is (?-i:[A-Z][a-z]+)

See https://docs.python.org/3/library/re.html section (?aiLmsux-imsx:...)

@PrakritiRaw
Copy link

can you please elaborate the issue a little more.

@SonOfLilit
Copy link
Owner Author

Python supports a feature called "inline flags", which you can read about in the python regular expression docs.

See https://docs.python.org/3/library/re.html section (?aiLmsux-imsx:...)

Currently, there is no syntax in Kleenexp that would compile to this regex syntax.

This issue proposes syntax for this. e.g.

[ignore_case "hello, world, my name is " [ignore_case:unset #uppercase [1+ #lowercase]]]

Should compile to:

(?i)hello, world, my name is (?-i:[A-Z][a-z]+)

This would involve adding a line to builtin_operators in compile.py, adding a class to asm.py that implements an inline flag, and implementing methods like to_regex() on it (see other similar asm classes like Capture or Lookahead).

Then we could do something similar for the rust compiler, the mapping is pretty straightforward but with different names (Regexable instead of Asm).

@SonOfLilit
Copy link
Owner Author

SonOfLilit commented Oct 4, 2022

Names for all the settings:

[ascii_only ...]
[ignore_case ...]
[locale_dependent ...]
[multiline ...]
[any_matches_all ...]
[unicode ...]

We obviously don't need to support the verbose flag.

@Yehuda-blip
Copy link

Yehuda-blip commented Apr 1, 2023

Hey, I started implementing this feature - a few questions:
a. I found a syntax error in asm.py on line 228, not really sure what that is or why it's there.
b. Assuming I want to open a draft PR, should I push changes to here or to my own fork? I
am not sure about the standards you are trying to withhold and some think that you might
consider some of the computation I do more fitting for the compile stage rather than the
assembly stage (mainly: validating that a flag cannot be set and unset in the same flag-setting.
the expression re.compile(r"(?i-i:example)" raises "bad inline flags" and I figured that so should
"ignore_case ignore_case:unset "example"). Since I'm new here (and to open-source as a whole)
I figured somebody will have a lot of comments.
c. Running the pytest command shows a NotImplemented failure on asm.py line 21 - again, not
sure if that's meant to be there.
d. I've yet to write any tests (or figure out how the testing works), but I have some code here: https://github.com/Yehuda-blip/kleenexp/tree/feature/add_inline_flags
(yehudaKLEIN and yehuda-blip are both myself, I'll fix that problem at some point)

e. Almost forgot: There seems to be a special case for inline flags in the beginning of a regular expression
that's considered as some global parameter to the compiling function (in the example above, that would
be the '(?i)' in '(?i)hello...'. If I remember correctly a syntax node in the current implementation holds a position
value that can be used in the special case parenthesis wrapping of these global flags, but since these flags
are apparently meant to replace actual parameters to the re.compile(), and since the original re has a
different syntax for these, I thought maybe a special syntax should be considered here? I figured it might
prove useful when building the auto-convert tool for python (although very likely it will not be, just figured
I'd ask).

@SonOfLilit
Copy link
Owner Author

Hi Yehuda, good to hear you're trying your hand at this. Welcome to kleenexp!

a. Seems like someone accidentally pressed Enter while a symbol was selected. Fixed.

b. Fork, push to fork, open a PR against my repo. Specifically, I don't think we should raise for this case, just use the innermost setting, but I could be convinced otherwise.

c. I'm going to guess you're testing the rust compiler. Try KLEENEXP_RUST=0 pytest, do you still get an error?

d. Look at test_api.py for how to write the simplest kind of test, e.g. test_ip() has both test of ke.re() and of ke.match().

e. As far as I understand, we never need this special case, right? Nothing wrong will happen if we only implement the local version and not the global version, and nothing will be felt as missing by our users?

@Yehuda-blip
Copy link

Yehuda-blip commented Apr 4, 2023

b. Innermost setting it is.

c. I know for a fact that I'm running python tests because my tests keep breaking :) the error is still there.
image

e. From what I can see there is no way to use an operator outside of a "[]", which means that it's impossible to set a flag and for it to "feel" global - however I don't think this is reason enough to add special Behavior.

f. Just to make sure I'm not missing something, It IS impossible to use an operator outside box braces or without anything to operate on, right?
good: [ascii_only "operator input"]
bad: [ascii_only], [ascii_only "operator input" ignore_case "operator input 2"] , [ascii_only ignore_case operator input 2]
not-an-operator: ascii_only "operator input"

g. There are some changes on flag compatibilty between python 3.6 and 3.7. I'm currently writing under the assumption that this feature is compatible with python 3.10 (meaning it is compatible with 3.7 and does not support the weird noflag thing).

h. Again - almost forgot: In the last 4 hours I learned that a string being a bytes-like affects the validity of locale_dependent and unicode flags. I also saw that there is some implementation regarding passing flags to the compile function, but I didn't see any validations on this matter. I'm not about to deal with it in this PR, but I figured it's worth noting.

@SonOfLilit
Copy link
Owner Author

c. I think this line is your fix to the typo you found in (a)? Use my fix instead (return False) and it will work.

e. Agreed. This is a design mistake in Regex that we're fixing.

f. I don't remember if bad[0] is always a syntax error or sometimes just meaningless, but otherwise you're correct.

g. I just went over all the "Changed in version 3.x" notes in the docs, none of them seems to affect us in the sense that users will get the expected behavior from their version (e.g. flags are correctly ints in old versions and RegexFlag in new versions, NOFLAG behaves identically in all 3.x since it was always just an alias for 0 and I define it this way if it doesn't exist in the re module)

h. If a user tries to use LOCALE with a non-byteslike, or pass in LOCALE | ASCII, they will get an error from re.compile(), as intended. We aim for full bug-for-bug compatibility by wrapping the existing API.

@SonOfLilit
Copy link
Owner Author

Also, you're being very thorough, that's a good habit to have!

@Yehuda-blip
Copy link

g + h: My PR includes some validations on the flags locale_dependent, ascii_only and unicode. From what I understand, you would want these removed, and the error messages should come from regex and not kleenexp?

@SonOfLilit
Copy link
Owner Author

I'll have to have a look at the code before I form an opinion, but my gut instinct is to rely on the re validations and maybe just wrap their exception with a CompileError.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants