Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Posix classes are not strict enough [:upper:] #253

Closed
RedCMD opened this issue Apr 19, 2022 · 9 comments
Closed

Posix classes are not strict enough [:upper:] #253

RedCMD opened this issue Apr 19, 2022 · 9 comments

Comments

@RedCMD
Copy link

RedCMD commented Apr 19, 2022

Multiple posix classes can be opened at once with many left brackets [: and then can all be closed on the same right bracket :]

The below regex runs and will match all upper case letters aswell as a, b, c and :.
image
Note how there are many opening brackets, but only 2 closing.

And if posix a value is inputted in the middle, the engine breaks:
image
Engine also breaks if any one of the : gets removed

Initial report from: microsoft/vscode-textmate#165

@kkos
Copy link
Owner

kkos commented Apr 24, 2022

Aren't you going to do a little research before you make three posts?

  1. The ability to nest itself does not exist in POSIX Brackets.
  2. However, character classes can be nested.
  3. With a few exceptions (POSIX Bracket is one of them), any characters can be written in any order in the character classes.

The only thing that could be debatable would be which case would be an error when an incomplete POSIX Bracket appears.
But that would not be of interest to most people.
It is possible to change it so that it does not error at all for incomplete POSIX Brackets (i.e., it is considered just writing characters in the character class), but most people would not even notice if I did so.

@RedCMD
Copy link
Author

RedCMD commented Apr 25, 2022

Sorry that I wasn't clear
What Im trying to get at; is that certain arrangements of certain characters in and around POSIX Brackets causes them to run in unexpected ways
I know you can nest character classes (and have multiple next to each other) [ [] [ [] [] ] ]
and that POSIX Brackets do not nest (and require to be inside a character class [ [:upper:] ]

Like for instance; I would expect the outcome of this regex to be an error [[:::]
Since because the first character class has no ending bracket ], the POSIX class has a invalid value : and square brackets [] require backslashing to be a literal \[\]
But instead it unexpectly matches [ and :
Regex101.com: image

and this rather confusing one
[[:\]:] matches [, ] and :
But [[:\[:] fails
and [[:\]] also fails

But as you said, fixing it would affect almost no one

Only reason I'm complaining about it; is so I can know if I'm able to rely on this behaviour or not

@tonco-miyazawa
Copy link

tonco-miyazawa commented Apr 26, 2022

oniguruma/doc/RE

Lines 214 to 215 in 08d3611

* If you want to use '[', '-', or ']' as a normal character
in character class, you should escape them with '\'.

If you want to make it clear that it is not Posix class, you can use [\: .

The following behavior may be a trap for anyone who does not intend to use the Posix class:
oniguruma/sample/callout.c
test(enc, mp, "[[:]]", ":]");
match at 0
0: (0-1)

test(enc, mp, "[[::]]", ":]");
COMPILE ERROR: -121: invalid POSIX bracket type

test(enc, mp, "[[:::]]", ":]");
match at 0
0: (0-2)

Escape is effective to avoid the above behavior.

test(enc, mp, "[[\\:]]", ":]");
test(enc, mp, "[[\\::]]", ":]");
test(enc, mp, "[[\\:::]]", ":]");

These will have the same result.

match at 0
0: (0-1)


Remarks: Perl 5
https://metacpan.org/dist/perl/view/pod/perlrecharclass.pod

A [ is not special inside a character class, unless it's the start of a POSIX character class (see "POSIX Character Classes" below). It normally does not need escaping.

@kkos
Copy link
Owner

kkos commented Apr 28, 2022

There was a problem with the processing after "[:", so a corrected version was added as issue_253 branch.
The next release will not include this change, but I will include it in the release after that.
This change breaks compatibility, but only if the character classes contain "[:" and ":]" and they do not constitute a POSIX Bracket, which is not a problem in a practical sense. I think.

@tonco-miyazawa
Copy link

tonco-miyazawa commented May 13, 2022

translation by google translate (August 24, 2023)

Background on this issue

  1. Unlike Perl, oniguruma allows nesting of normal character classes other than Posix brackets.

  2. Posix bracket format can be interpreted as a normal character class.
    As a result, the boundaries in interpretation tend to be ambiguous, and they tend to be a source of confusion.

Proposal 1

I think that the misspelling of the Posix class name should be treated as an error.
(e.g. [[:alnun:]] should be an error)

If the user is unaware of their spelling mistake, it will be interpreted character by character,
I feel anxious when using because unexpected matches without warning.

By the way, spelling mistakes in \p{property-name} are errors.

If I use sample/callout.c and try the below code with misspelling inside \p{ } ..
test(enc, mp, "\\p{Alnun}", "a");

I get an error like:
COMPILE ERROR: -223: invalid character property name {Alnun}

A misspelling within \p{ } is an error, so a misspelling within Posix bracket is also an error.
There is a risk that the user will be mistaken. Therefore, I think it is desirable to unify both errors.

RedCMD did not clearly write "what is the problem and how to fix it".
For that reason, it was difficult for Kosako(Owner) to understand his intentions.

RedCMD didn't think "Incomplete Posix bracket is an error" was the problem.
Perhaps he was concerned about "what should be an error is not an error".

(I'm reading through Google Translate, so I'm sorry if I'm wrong)

Proposal 2

Perl plans to introduce [= =] and [. .] in the future.
If these are introduced to oniguruma in the future, oniguruma will need further specification changes.

While fixing issue253, If [= =] and [. .] become reserved words,
It might be one less hassle in the future.

Perl5: [= =] and [. .]
https://metacpan.org/pod/perlrecharclass#%5B=-=%5D-and-%5B.-.%5D

Perl recognizes the POSIX character classes [=class=] and [.class.],
but does not (yet?) support them. Any attempt to use either construct raises an exception.

However, I think that it will be a long time before these are implemented in Perl.
Even if you keep these as reserved words in oniguruma, they will never be used and may be useless.
So you don't need to do this if you don't feel the need.

Investigation result of boundary of Posix bracket using Perl 5.35.11

I don't know if it will be a reference for oniguruma production, but I will put it on.

(Click to view) The boundary between Posix bracket and not

May 13, 2022 (English translation : August 26, 2023 )

I used the Perl 5.35.11 Development release for this research.
If I got the "POSIX class [: :] unknown in regex;" error,
I decided that it was "recognized as a Posix bracket".

Recognized as a normal character class

[[:aa:]]               #  Name part is 2 characters or less
[[:aabbccddeeffggh:]]  #  Name part has more than 15 characters
[[:UPPER:]]            #  Name part is upper case
(?i)[[:UPPER:]]
[[!:upper:]]
[[:upper:!]]
[[:up!!!per:]]         #  3 or more symbols
[[:up!p!e!r:]]
[[:___upper:]]         #  3 or more underscores

[[:aa\\\\a:]]          #  "\" for escaping also counts as one character
[[:aa\\\a:]]
[[:aa\\!a:]]
[[:aa]]a:]]            #  There is a “]” and the symbol is more than 2 characters
[[:aa\]a:]]

[[:aa\]]a:]]
[[:aa\]\]a:]]
[[:aa[[a:]]            #  There is a “[” and the symbol is more than 2 characters
[[:aa\[[a:]]
[[:aa\[\[a:]]
[[:upper]]             #  If you forget the trailing ":"

[[:aaあいうえa:]]       #  there are 4 multibyte characters   (UTF-8)
[[:あいうえお:]]
[[::]]

[[.abc!!!de.]]
[[.abc!!de.]]
[[.abc!de.]]
[[.aa\_aa.]]
[[.aa[aa.]]
[[.aa\[aa.]]
[[.aa]aa.]]
[[.aa\]aa.]]

Recognized as Posix bracket

[[:aab:]]              #  3 to 14 characters for the name part
[[:upper:]]            #  Correct Posix class name
[[:uper:]]             #  Misspelled Posix class name
[[:up!per:]]           #  1 symbol
[[:up!!per:]]          #  2 symbol
[[:!upper:]]           #  symbol at the beginning of the name
[[:8upper:]]
[[:__upper:]]          #  2 underscores or less
[[:upper!:]]           #  symbol at the end of the name
[[:aa\\a:]]
[[:aa]a:]]
[[:aa[a:]]
[[:aa\[a:]]

[:aaあa:]
[:aaあいa:]
[:aaあいうa:]
[:あいう:]
[:あいうえ:]

[[.abcde.]]
[[._______________________________.]]       #  lots of underscores
[[.8888888888888888888888888888888.]]       #  lots of numbers
[[.BBBBBBBBBBBBBBBBBBBBBBBBBB.]]            #  lots of capital letters
(?i)[[.BBBBBBBBBBBBBBBBBBBBBBBBBB.]]
[[=abcde=]]
[[..]]
[[==]]
[[.aabbccddeeffgghhiijjkkllmmnnooppqqrrssttuuvvwwxxyyzz.]]
[[=aabbccddeeffgghhiijjkkllmmnnooppqqrrssttuuvvwwxxyyzz=]]

regular expressions on Issue 253

Unlike oniguruma, Perl does not allow normal character class nesting.
So there is no point in examining how it works in Perl.
The following are not worth reading.

Recognized as a normal character class

[[:::]                    ==  [\[:::]
[[:::]]                   ==  [\[:::]\]
[[:\]:]                   ==  [\[:\]:]
[[:\[:]                   ==  [\[:\[:]
[[:upp\]er:]]             ==  [\[:upp\]er:]\]
[[:a[:b[:c[:upper:]]      ==  [\[:a\[:b\[:c[:upper:]]       * Excluding [:upper:]
[[:a[:upper[:c[:upper:]]  ==  [\[:a\[:upper\[:c[:upper:]]   * Excluding [:upper:]

Recognized as Posix bracket

[[:upp]er:]] # POSIX class [:upp]er:] unknown in regex;

  • "[[:::]" was an error on the operation confirmation site "Regex101.com (PCRE2?)"
    shown by RedCMD. However, Perl 5.35.11 does not give an error.

end.

  • This is not the purpose of wanting the behavior of oniguruma to be exactly the same as that of Perl.
    Perl specs change all the time, so that's pointless.

Sorry for #234, I apologize again.

Original text (written in Japanese)

● この問題にある背景

  1. oniguruma では Perl とは違い、 Posix bracket 以外の普通の文字クラスの入れ子も許可される。

  2. Posix bracket の書式は普通の文字クラスとして解釈しようと思えば出来てしまう。
    そのため解釈する上での境目が曖昧になりやすく、混乱の元になりやすい。

● 提案 1

Posix class 名のスペルミスはエラーにしたほうが良いと思う。
( 例、 [[:alnun:]] はエラーになるべき )

ユーザーが自分のスペルミスに気付かなかった場合、1文字ずつに分解されて解釈され、
警告すら出ずに意図しないマッチをするのは実用する上で少し不安に思う。

ちなみに、 \p{property-name} のほうではスペルミスはエラーになる。

sample/callout.c を使って \p{ } 内にスペルミスがある以下のコードを試すと..
test(enc, mp, "\\p{Alnun}", "a");

以下のようなエラーが出る。
COMPILE ERROR: -223: invalid character property name {Alnun}

\p{ } 内でのスペルミスがエラーになることにより、 Posix bracket 内でのスペルミスもエラーになると
ユーザーが思い込んでしまうリスクがある。 なので両方エラーで統一するのが望ましいと思う。

issue 253主である RedCMD氏の書き込みは "何が問題でどう直すべきか" をはっきりと
書いておらず、 Kosako氏に書き込みの意図が上手く伝わっていなかったように思う。

RedCMD氏 は "不完全な Posix bracket がエラーになる" ことを問題視していたというよりは
"エラーになるべきものがエラーにならない" ことを問題視していたのではないだろうか。

( 自分はGoogle翻訳を通して読んでいるので間違っていたらごめんなさい )

● 提案 2

Perl では将来、 [= ~ =] や [. ~ .] の導入が予定されており、将来 oniguruma もこれを
採用するならば更に仕様の変更が必要になる。

issue 253 の修正のついでに [= ~ =] や [. ~ .] を予約語としてキープしておけば
将来の面倒事が1つ減るかも知れない。

Perl5: [= =] and [. .]
https://metacpan.org/pod/perlrecharclass#%5B=-=%5D-and-%5B.-.%5D

Perl recognizes the POSIX character classes [=class=] and [.class.],
but does not (yet?) support them. Any attempt to use either construct raises an exception.

ただし Perl にこれらが実装されるのは相当先の話だと思うので
oniguruma にこれらを予約語としてキープしても永遠に使われず、無駄になるかも知れません。
なので Kosako氏が必要性を感じないならばこれを実行する必要はありません。

● Perl 5.35.11 を使ったPosix bracket の境目の調査結果

oniguruma 制作の参考になるかどうか分かりませんが一応貼っておきます。

[DOC] oniguruma_253.doc
https://github.com/tonco-miyazawa/regex_etc/raw/master/MEMO_onig/Issues/oniguruma_253.doc

※ これは oniguruma の動作を Perl の挙動とまったく同じにして欲しいという主旨ではありません。
Perl の仕様はコロコロ変わるためそれは無意味です。


#234 ではすみませんでした、重ねてお詫びします。

@kkos
Copy link
Owner

kkos commented May 18, 2022

It would be nice to make spelling errors an error.
The question is, however, to what extent is POSIX bracket misspelled.
(1) All cases where "[:" and ":]" pairs appear.
ex. "[:a%bc2:]"
(2) When there is one or more letters between a pair of "[:" and ":]", all of which are alphabetic.
ex. "[:abdc:]"

For my part, I would like to misspell only 2), and 1) is not a POSIX bracket.

It would be good to reserve [= =] and [. .].
However, I also would like this to only make sense if all the letters in between are alphabetic, is that ok?
And I have no plans to implement [= =] and [. .] features.

@tonco-miyazawa
Copy link

tonco-miyazawa commented May 19, 2022

translation by google translate (August 23, 2023)

(1) or (2)

(2) is OK.
I think either (1) or (2) is fine.
I think it would be nice to prevent simple spelling mistakes.

[= =] and [. .]

If you don't plan to implement these, I don't think you need a reservation, but
If you think you should make a reservation, please do so.
I think it doesn't matter which one is chosen.

all the letters in between are alphabetic

This is OK. I will leave it to you to decide whether to make a reservation or not.

----------------------- Added on June 2, 2022 -----------------------------
I confirmed the operation after the correction of the [: :].
After the fix, the behavior was ideal. thank you very much.

Original text (written in Japanese)

(1) or (2)

(2) で OK です。
私は (1) でも (2) でもどちらでも良いと思っています。
単純なスペルミスを防げれば良いと考えています。

[= =] and [. .]

これらを実装する予定が無いのなら私は予約が必要だとは思いませんが、
Kosako氏が予約をすべきだと思われる場合は予約して下さい。
私はどちらが選ばれても良いと考えています。

all the letters in between are alphabetic

これで OK です。 予約するかどうかの判断は Kosako氏におまかせ致します。
------------------------------------ 2022/06/02 pm8 added ------------------------------------
[: :] の修正後の動作を確認致しました。
修正後の動作は理想的な動作になっていました。ありがとうございました。

@tonco-miyazawa
Copy link

tonco-miyazawa commented Jun 15, 2022

translation by google translate (August 23, 2023)

Negative form has the same result as positive form.

test(enc, mp, "[[:upper:]]", "a"); // search fail
test(enc, mp, "[[:^upper:]]", "a"); // search fail

------------------ Added on June 16, 2022 ------------------

This bug has been fixed. Thank you for fixing.
\p{^ } and \P{ } and \P{^ } are fine, too.

Original text (written in Japanese) Negative form has the same result as positive form. 否定形が肯定形と同じ結果になってしまいます。

test(enc, mp, "[[:upper:]]", "a"); // search fail
test(enc, mp, "[[:^upper:]]", "a"); // search fail
------------------------------------ 2022/06/16 pm8 added ------------------------------------
This bug has been fixed. Thank you for fixing.
このバグは修正されました、ありがとうございました。

\p{^...} and \P{...} and \P{^...} are fine, too.
念のため \p{...} のチェックもしましたがこちらも問題ありませんでした。

@kkos
Copy link
Owner

kkos commented Jun 25, 2022

Assume the current master head is accepted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants