Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add new operator \J (skip search) #278 #299

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

add new operator \J (skip search) #278 #299

wants to merge 1 commit into from

Conversation

kkos
Copy link
Owner

@kkos kkos commented Jun 7, 2024

For "aaa..." (9000 chars)

/a+b/
real 0m0.174s
user 0m0.168s
sys 0m0.004s

/a+\Jb/
real 0m0.007s
user 0m0.002s
sys 0m0.004s

@RedCMD
Copy link

RedCMD commented Jun 7, 2024

thank you
this looks very promising

will \J cancel the search instantly or only after getting to the end of the regex?
eg. will a+\Jb match aaaaaaab?
or will it need to be something like a+(b|\J)?

@kkos
Copy link
Owner Author

kkos commented Jun 8, 2024

/a+\Jb/ match with "aaaaaaab".
\J has no effect on the current matching process and always succeeds.
When the current matching fails, it has an effect on the starting position of the next matching.
The next match is started at the most advanced position in the string position matched with \J. (However, if other optimizations result in a more advanced position, this has no effect.)

@tonco-miyazawa
Copy link

@kkos (Owner)

Thank you for adding this great feature.

I would like to use the "retraction only" version of (*MISMATCH) .

/doc/CALLOUTS.BUILTIN : (*MISMATCH)

* MISMATCH (progress)

Let's assume it is (*MISMATCH{<}) .

Then, the following will have the same effect as (*SKIP) in perl5:

(*MISMATCH{<})\J

or

\J(*MISMATCH{<})


In the following test, backtracking on "a+" will skip to pos "b" without being executed.

test(enc, mp, "a+(*MISMATCH{<})\\J$/", "aaaaaaab");


Alternatively, we could choose to make \J incorporate the functionality of (*MISMATCH{<}) .
That is, this would make \J the same as perl5's (*SKIP).


Below is the behavior of (*SKIP) in perl5 that I have looked into in the past. (Japanese only)
https://github.com/tonco-miyazawa/regex_etc/blob/master/MEMO_perl5/Backtrack_ctrl/SKIP.txt

===========================================================
(The above Japanese translation)

素晴らしい機能の追加をありがとうございます。

私は (*MISMATCH) の "retractionのみ" のバージョンが使いたいです。

/doc/CALLOUTS.BUILTIN : (*MISMATCH)

* MISMATCH (progress)

それを (*MISMATCH{<}) と仮定します。

すると、以下はperl5の (*SKIP) と同じ動作になります。

(*MISMATCH{<})\J

or

\J(*MISMATCH{<})


以下のテストの場合、 "a+" のバックトラックは実行されることなく "b" のposまでスキップします。

test(enc, mp, "a+(*MISMATCH{<})\\J$/", "aaaaaaab");


または \J(*MISMATCH{<}) の機能を組み込むという選択肢もあります。
つまりこれは \J を perl5の (*SKIP) と同じものにするということになります。


以下は私が過去に調べたperl5の (*SKIP) の動作です。 (日本語のみ)
https://github.com/tonco-miyazawa/regex_etc/blob/master/MEMO_perl5/Backtrack_ctrl/SKIP.txt

@kkos
Copy link
Owner Author

kkos commented Jun 10, 2024

I read it, but did not clearly understand the SKIP specifications.
Does the fact that it has an effect on another alternative mean that SKIP could affect the position of other parts of a single matching process?
My implementation is simple, it does not affect matching, only the next search position.
There is no need to use (*MISMATCH{<}) because \J only updates when the position advances.

I noticed that there is no need to use the operator \J.
I had forgotten about the callout.
I am thinking of canceling this PR and using (*SKIP).

@tonco-miyazawa
Copy link

Sorry, I didn't explain it well.

When backtracking reaches (*SKIP) in Perl5, it immediately ends Match-process at the current position.

In the following test,

test(enc, mp, "ab(*SKIP)c|ab", "abd");

Backtracking reaches (*SKIP) because "c" in "abc" is not found.

In this case, (*SKIP) immediately ends Match-process at the current position,

so "ab" in the latter half of the regular expression is not tried at the same position.

It then skips to the next matching start position (after "ab", before "d").

As a result, this test returns "search fail" .

This behavior of immediately ending Match-process at the current position is the same as (*MISMATCH).

While (*MISMATCH) only works when moving forward, (*SKIP) in Perl5 only ends Match-process at the current position when moving backward.

Below is a sample code in Perl5.
You can turn (*SKIP) on or off with the comment "#".

#!/usr/bin/perl

use Encode;
use utf8;

my $str = 'abd';

if ($str =~ /ab(*SKIP)c|ab/) {       # (*SKIP)  exists
# if ($str =~ /abc|ab/) {            # no (*SKIP) exists

  print "Match! '$&'\n";
}
else {
  print "Not match!\n";
}

( In Japanese )

すみません、説明不足でした。

Perl5 の (*SKIP) (*SKIP) 自身にバックトラックが到達した場合、
即座に現在位置での照合を終了させます。

以下のテストの場合、
test(enc, mp, "ab(*SKIP)c|ab", "abd");

"abc" の "c" が見つからないのでバックトラックが (*SKIP) に到達します。
このとき、 (*SKIP) は現在位置での照合を即座に終了させるので
同じ位置で正規表現後半の "ab" が試されることはありません。

そして次の照合開始位置 ( "ab" の後ろ、 "d" の前) にスキップします。
その結果、このテストは "search fail" を返します。

この現在位置での照合を即座に終了させるという動作は (*MISMATCH) と同じです。

(*MISMATCH) は前進のときのみ動作しますが、 Perl5 の (*SKIP) は後退のときのみ
現在位置での照合を終了させます。

以下は Perl5 のサンプルコードです。
(*SKIP) のあり、なし をコメント "#" で切り替えられます。

#!/usr/bin/perl

use Encode;
use utf8;

my $str = 'abd';

if ($str =~ /ab(*SKIP)c|ab/) {       # (*SKIP)  exists
# if ($str =~ /abc|ab/) {            # no (*SKIP) exists

  print "Match! '$&'\n";
}
else {
  print "Not match!\n";
}

@kkos
Copy link
Owner Author

kkos commented Jun 12, 2024

I don't feel that the ability to exit when backing out is essential to (*SKIP).
(*MISMATCH) can only be used when moving forward, so something for retreating might be added later.

@tonco-miyazawa
Copy link

@kkos (Owner)

I don't feel that the ability to exit when backing out is essential to (*SKIP).

Thank you for your consideration. I have no objection to your decision.
I think that the behavior of "\J" is simpler, easier to understand, and more flexible to use.

(*MISMATCH) can only be used when moving forward, so something for retreating might be added later.

It seems that this can be achieved by simply rewriting (*MISMATCH{<}) internally to (?:|(*MISMATCH)) .

(*MISMATCH) and (*MISMATCH{<}) can be converted to each other as follows:

(*MISMATCH{<}) == (?:|(*MISMATCH))
(*MISMATCH) == (*MISMATCH{<})(*FAIL)


Similarly for (*ERROR) and (*ERROR{<}) .
However, since (*ERROR) has an optional argument, you need to consider how to write {<} .

(*ERROR)

* ERROR (progress)
(*ERROR{n::LONG})
Terminates Search/Match process.
Return value is the argument 'n'. (The value must be less than -1)
'n' is an optional argument. (default value is ONIG_ABORT)

The idea below from @RedCMD is an alternative to (*ERROR{<}) .
#278 (comment)
( The above comment was very helpful. Thanks @RedCMD )


As an aside, (*MISMATCH{<}) and Perl5's (*PRUNE) are the same thing.
In other words, Oniguruma already has Perl5's (*PRUNE) .

perlre.pod : (*PRUNE)
https://metacpan.org/dist/perl/view/pod/perlre.pod#Verbs


( In Japanese )

I don't feel that the ability to exit when backing out is essential to (*SKIP).

ご検討ありがとうございました。 Kosako氏の判断に異論はありません。
現時点での "\J" のほうが動作がシンプルで分かりやすく、
使い方に柔軟性があり優れていると思います。

(*MISMATCH) can only be used when moving forward, so something for retreating might be added later.

これは (*MISMATCH{<}) を内部的に (?:|(*MISMATCH)) に書き換えるだけで良さそうです。

(*MISMATCH)(*MISMATCH{<}) は以下のように相互変換が可能です。

(*MISMATCH{<}) == (?:|(*MISMATCH))
(*MISMATCH) == (*MISMATCH{<})(*FAIL)


(*ERROR)(*ERROR{<}) も同様です。
ただ、(*ERROR) はオプション引数を持つので {<} の書き方を考える必要があります。

(*ERROR)

* ERROR (progress)
(*ERROR{n::LONG})
Terminates Search/Match process.
Return value is the argument 'n'. (The value must be less than -1)
'n' is an optional argument. (default value is ONIG_ABORT)

以下の @RedCMD さんのアイデアは (*ERROR{<}) の代替案そのものです。
#278 (comment)
( 上記のコメントはとても参考になりました。 @RedCMD さん、ありがとうございます )


余談ですが、 (*MISMATCH{<}) は Perl5 の (*PRUNE) と同じ動作になります。
つまり事実上、 oniguruma はPerl5 の (*PRUNE) を実装済みです。

perlre.pod : (*PRUNE)
https://metacpan.org/dist/perl/view/pod/perlre.pod#Verbs

kkos added a commit that referenced this pull request Jun 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants