Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Named back-references validated too early #17

Open
nbtrap opened this issue Feb 21, 2014 · 12 comments
Open

Named back-references validated too early #17

nbtrap opened this issue Feb 21, 2014 · 12 comments

Comments

@nbtrap
Copy link
Contributor

nbtrap commented Feb 21, 2014

CL-PPCRE validates register names too early in the regex compilation phase. The converter should wait until it has seen all register names before asserting that named back-references refer to existing registers.

(setf *allow-named-registers* t)
==> T

(scan "\\k<regOne>(?<regOne>.)" "a")
; Evaluation aborted on #<PPCRE-SYNTAX-ERROR "Illegal back-reference: ~S." {1006D50F93}>.

(scan "\\1(.)" "a")
==> NIL

Compare this to Perl:

say 'a' =~ /\k<regOne>(?<regOne>.)/ ? 'true' : 'false';
==> false

say 'a' =~ /\k<regTwo>(?<regOne>.)/ ? 'true' : 'false';
==> Reference to nonexistent named group in regex...

I plan on fixing this in my subpattern references branch, which is nearing completion.

@hanshuebner
Copy link
Member

Why does perl's behavior make sense?

2014-02-21 3:26 GMT+01:00 nbtrap [email protected]:

CL-PPCRE validates register names too early in the regex compilation
phase. The converter should wait until it has seen all register names
before asserting that named back-references refer to existing registers.

(setf allow-named-registers t)==> T
(scan "\k(?.)" "a"); Evaluation aborted on #<PPCRE-SYNTAX-ERROR "Illegal back-reference: ~S." {1006D50F93}>.
(scan "\1(.)" "a")==> NIL

Compare this to Perl:

say 'a' =~ /\k(?.)/ ? 'true' : 'false';==> false
say 'a' =~ /\k(?.)/ ? 'true' : 'false';==> Reference to nonexistent named group in regex...

I plan on fixing this in my subpattern references branch, which is nearing
completion.

Reply to this email directly or view it on GitHubhttps://github.com//issues/17
.

@nbtrap
Copy link
Contributor Author

nbtrap commented Feb 22, 2014

I can think of two reasons:

  1. Perl allows you to incrementally build regexes. For example, you can do this:

    $rx1 = qr/\k<foobar>/
    $rx2 = qr/(?<foobar>.)$rx1/

    If backreference names were validated as early as they are in CL-PPCRE, the first qr would be impossible. However, CL-PPCRE doesn't have anything like this, so it's not relevant apart from maintaining Perl compatibility

  2. Subpattern references make forward and self-referential back references meaningful. For example, in Perl, you can do things like this:

    print 'abcb' =~ /^(.\k<regTwo>?)(?<regTwo>.)(?1)$/ ? "true\n" : "false\n"
    => true

    In that example, the \k<regTwo> fails on the first encounter, but the matching phase continues thanks to the ?. Then, inside the (?1) subpattern call, the \k<regTwo> refers to what the (?<regTwo>.) matched up on the call stack, namely, "b", and it succeeds. CL-PPCRE currently doesn't support subpattern references, but I'm nearly finished implementing them and will be submitting a merge request in the near future.

By the way, Perl itself doesn't always handle things like the second example above correctly, but the developers have confirmed that it's a bug.

@nbtrap
Copy link
Contributor Author

nbtrap commented Feb 22, 2014

Crap. I didn't know CL-PPCRE tries to match named backreferences against every register with a given name. This complicates things. Here are the options as I see them:

  1. Extend the current behavior to include matching registers that come after the backreference.
  2. Break backwards compatibility, and behave like Perl.
  3. Leave things the way they are.

My favorite is 2. Do you think anyone relies on this behavior? We should try to conform to Perl as much as possible. (Come to think of it, what's the deal with *ALLOW-NAMED-REGISTERS*? Seems unnecessary. I'd be happy to remove it too.)

3 is a bad idea if we're going to merge subpattern references, so I think 1 is the other viable option.

@hanshuebner
Copy link
Member

I have no opinion regarding this. Maybe Edi has one.

-Hans

2014-02-22 15:12 GMT+01:00 nbtrap [email protected]:

Crap. I didn't know CL-PPCRE tries to match named backreferences against
every register with a given name. This complicates things. Here are the
options as I see them:

Extend the current behavior to include matching registers that come
after the backreference.
2.

Break backwards compatibility, and behave like Perl.
3.

Leave things the way they are.

My favorite is 2. Do you think anyone relies on this behavior? We should
try to conform to Perl as much as possible. (Come to think of it, what's
the deal with ALLOW-NAMED-REGISTERS? Seems unnecessary. I'd be happy to
remove it too.)

3 is a bad idea if we're going to merge subpattern references, so I think
1 is the other viable option.

Reply to this email directly or view it on GitHubhttps://github.com//issues/17#issuecomment-35803448
.

@nbtrap
Copy link
Contributor Author

nbtrap commented Feb 24, 2014

Are you in a position to ask him?

@hanshuebner
Copy link
Member

I Cc'd him in my previous comment.
Am 24.02.2014 02:22 schrieb "nbtrap" [email protected]:

Are you in a position to ask him?

Reply to this email directly or view it on GitHubhttps://github.com//issues/17#issuecomment-35851033
.

@nhabedi
Copy link
Member

nhabedi commented Feb 26, 2014

Named registers came in late and were modeled after AllegroCL's named registers, not after Perl's.

I don't think ALLOW-NAMED-REGISTERS is unnecessary. Changing its value obviously changes the semantics of some regular expressions.

While CL-PPCRE is based on Perl's syntax, I wouldn't get out of the way to copy each of its features. There are several regex variants out there which are close to but not similar to what Perl does and people seem to be able to cope with it.

I generally think that being backwards compatible is a good thing. Just because you haven't used a feature, that doesn't mean nobody else has. In my private code repository I count almost 100 commercial Lisp projects I've done in the last decade (and that doesn't include code with its own repository). Much of this code is still in use and sometimes there are requirements to extend it or to fix something. I'd rather have code that, say, "uses CL-PPCRE" and not "uses CL-PPCRE, but can only use it up to version x.y.z".

Having said that, the most important feature for me in case of any changes would be clear documentation.

@nbtrap
Copy link
Contributor Author

nbtrap commented Feb 26, 2014

To be clear, I wasn't saying named registers are unnecessary. What I am saying is that it's unnecessary to have to bind a variable to use them. Deprecating the *ALLOW-NAMED-REGISTERS* variable would not break backwards compatibility. It would merely make using named registers easier.

Rather, the feature that I was talking about changing was this idea that named back-references can refer to multiple registers, whereas numbered registers have only one referent. The whole point of named registers, in my mind, is to help more clearly disambiguate registers when the regex contains so many as might otherwise confuse the person reading the code. That being the case, I don't think named registers should have different semantics from regular registers.

I agree with what you say about not having to be like Perl in every respect. In implementing subpattern references, I discovered that Perl does not have clearly defined semantics for that feature. I'm trying to convince them to adopt the definition I've implemented (they seem rather open to it), but if they reject it, I'll still probably stick with what I already have.

@nhabedi
Copy link
Member

nhabedi commented Feb 26, 2014

Deprecating ALLOW-NAMED-REGISTERS would break backwards compatibility in two ways:

  1. Every piece of code that uses CL-PPCRE and sets ALLOWED-NAMED-REGISTERS to a true value has to be changed because the variable doesn't exist anymore.
  2. As I already said, the semantics of some regular expressions will be changed. Try to parse "\k" (from the documentation) for example with and without ALLOWED-NAMED-REGISTERS being NIL.

As to your second point: With the current behavior, you can for example use a regex like "(?[1-3])..(?[4-6])\" to match "12341" as well as "12344". I don't think I've ever used that, but my experience from 15 years of open source development tells me that just because you and I can't come up with a use case for this feature doesn't mean nobody else ever came up with one... :)

@nbtrap
Copy link
Contributor Author

nbtrap commented Feb 26, 2014

  1. I don't want to remove *ALLOW-NAMED-REGISTERS*, I'm just suggesting allowing named registers regardless of its value.
  2. Yes, deprecating *ALLOW-NAMED-REGISTERS* might break anomalous code that relies on PARSE-STRING throwing an exception because *ALLOW-NAMED-REGISTERS* is not bound to T. I think that's a stretch though. (Technically, you could argue that such code is relying on a non-api anyway. There's nothing in the documentation that describes how *ALLOW-NAMED-REGISTERS* is handled. It only says that binding it to T will allow you to use named regsiters.)
  3. I don't understand the point you try to make at then end. Check your syntax. I think you meant to write something else.

@nbtrap
Copy link
Contributor Author

nbtrap commented Feb 26, 2014

That said, I don't want to get carried away with the *ALLOW-NAMED-REGISTERS* bit. This issue is about how CL-PPCRE handles named backreferences.

@nhabedi
Copy link
Member

nhabedi commented Feb 26, 2014

I have updated my previous comment. I forgot a couple of backslashes for GitHub's markdown.

Otherwise, I've said what I wanted to say on this issue. The discussion is getting a bit too ridiculous for my taste.

nbtrap added a commit to nbtrap/cl-ppcre that referenced this issue Feb 27, 2014
The issue referred to there is one of the the subjects of edicl#17.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants