Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix LIKE with escapes #6703

Merged

Conversation

findepi
Copy link
Member

@findepi findepi commented Nov 8, 2024

Description

Fix LIKE processing for patterns containing escapes

  • the starts_with / ends_with optimization did not correctly check for
    escapes when checking rest of the pattern for being literal or not
  • the pattern to regexp compiler incorrectly processed \ followed by a
    character other than % or _. In PostgreSQL '\x' pattern matches single
    'x'.

There are two tests

  • like_escape_many was generated using PostgreSQL with the code attached
    below for verification
  • like_escape is hand-picked test cases that are more interesting.
    Lower cardinality of hand-picked test cases allows for exercising all
    scalar/array vs scalar/array combinations.

Test case generation

Show

The below script isn't simples possible, because it was attempted to
generate more test cases by adding padding. Hence e.g.
is_like_without_dangling_escape. Since this is attached for reference,
should be attached as-is.

#!/usr/bin/env python
import psycopg2

data = r"""
\
\\
\\\
\\\\
a
\a
\\a
%
\%
\\%
%%
\%%
\\%%
_
\_
\\_
__
\__
\\__
abc
a_c
a\bc
a\_c
%abc
\%abc
a\\_c%
""".split('\n')

# Deduplicate, keeping order
data = list(dict.fromkeys(data))

conn = psycopg2.connect(host='localhost', port=5432, user='postgres', password='mysecretpassword')
conn.set_session(autocommit=True)
cursor = conn.cursor()
for r in data:
    try:
        # PostgreSQL verifies dandling escape only sometimes
        cursor.execute(f"SELECT %s LIKE %s", (r, r))
        is_like, = cursor.fetchone()
        has_dandling_escape = False
        pg_pattern = r
    except Exception as e:
        if 'LIKE pattern must not end with escape character' not in str(e):
            raise e
        has_dandling_escape = True
        pg_pattern = r + '\\'

    for l in data:
        # print()
        # print('     '.join(str(v) for v in (l, r, has_dandling_escape, postgres_pattern)))
        cursor.execute(f"SELECT %s LIKE %s", (l, pg_pattern))
        is_like, = cursor.fetchone()
        assert type(is_like) is bool

        if not is_like and has_dandling_escape:
            pattern_without_escaped_dandling_escape = pg_pattern[:-2]
            cursor.execute(f"SELECT %s LIKE %s", (l, pattern_without_escaped_dandling_escape))
            is_like_without_dangling_escape, = cursor.fetchone()
            assert type(is_like_without_dangling_escape) is bool
        else:
            is_like_without_dangling_escape = False
        assert '"' not in l
        assert '"' not in r
        print('(r"%s", r"%s", %s),' % (
            l, r,
            str(is_like).lower(),
            # str(has_dandling_escape).lower(),
            # str(is_like_without_dangling_escape).lower(),
        ))

Which issue does this PR close?

Fixes #6702

Rationale for this change

Correctness fix for LIKE.

What changes are included in this PR?

Correctness fix for LIKE.

Are there any user-facing changes?

Yes

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 8, 2024
@findepi findepi force-pushed the findepi/fix-like-with-escapes-792c56 branch 2 times, most recently from b151949 to e463db6 Compare November 8, 2024 14:17
@@ -350,7 +346,7 @@ mod tests {
#[test]
fn test_replace_like_wildcards_with_multiple_escape_chars() {
let a_eq = "\\\\%";
let expected = "^\\\\%$";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after Rust parsing, this is \\% input. which should match any string starting with \
so the regex should be ^\\ which looks like "r\\\\" in Rust source code.
I.e. the value expected here wasn't quite correct

@findepi
Copy link
Member Author

findepi commented Nov 8, 2024

@alamb @tustvold please take a look

cc @samuelcolvin @Dandandan @ozankabak

@tustvold
Copy link
Contributor

tustvold commented Nov 8, 2024

Please can you refrain from repeatedly tagging people on your PRs, we will get to your PRs when we have time, but spamming people needlessly is just rude

@findepi
Copy link
Member Author

findepi commented Nov 8, 2024

Apologies if i tagged you (or anyone else) repeatedly on a single PR, especially if it was in a short period of time.

Fix LIKE processing for patterns containing escapes

- the starts_with / ends_with optimization did not correctly check for
  escapes when checking rest of the pattern for being literal or not
- the pattern to regexp compiler incorrectly processed \ followed by a
  character other than % or _. In PostgreSQL '\x' pattern matches single
  'x'.

There are two tests

- like_escape_many was generated using PostgreSQL with the code attached
  below for verification
- like_escape is hand-picked test cases that are more interesting.
  Lower cardinality of hand-picked test cases allows for exercising all
  scalar/array vs scalar/array combinations.

The below script isn't simples possible, because it was attempted to
generate more test cases by adding padding. Hence e.g.
is_like_without_dangling_escape.  Since this is attached for reference,
should be attached as-is.

```python
import psycopg2

data = r"""
\
\\
\\\
\\\\
a
\a
\\a
%
\%
\\%
%%
\%%
\\%%
_
\_
\\_
__
\__
\\__
abc
a_c
a\bc
a\_c
%abc
\%abc
a\\_c%
""".split('\n')

data = list(dict.fromkeys(data))

conn = psycopg2.connect(host='localhost', port=5432, user='postgres', password='mysecretpassword')
conn.set_session(autocommit=True)
cursor = conn.cursor()
for r in data:
    try:
        # PostgreSQL verifies dandling escape only sometimes
        cursor.execute(f"SELECT %s LIKE %s", (r, r))
        is_like, = cursor.fetchone()
        has_dandling_escape = False
        pg_pattern = r
    except Exception as e:
        if 'LIKE pattern must not end with escape character' not in str(e):
            raise e
        has_dandling_escape = True
        pg_pattern = r + '\\'

    for l in data:
        # print()
        # print('     '.join(str(v) for v in (l, r, has_dandling_escape, postgres_pattern)))
        cursor.execute(f"SELECT %s LIKE %s", (l, pg_pattern))
        is_like, = cursor.fetchone()
        assert type(is_like) is bool

        if not is_like and has_dandling_escape:
            pattern_without_escaped_dandling_escape = pg_pattern[:-2]
            cursor.execute(f"SELECT %s LIKE %s", (l, pattern_without_escaped_dandling_escape))
            is_like_without_dangling_escape, = cursor.fetchone()
            assert type(is_like_without_dangling_escape) is bool
        else:
            is_like_without_dangling_escape = False
        assert '"' not in l
        assert '"' not in r
        print('(r"%s", r"%s", %s),' % (
            l, r,
            str(is_like).lower(),
            # str(has_dandling_escape).lower(),
            # str(is_like_without_dangling_escape).lower(),
        ))
```
Reduce test code boilerplate and make it easier to see what are the test
cases.
@findepi findepi force-pushed the findepi/fix-like-with-escapes-792c56 branch from 053bef3 to 6a7c624 Compare November 8, 2024 17:32
@findepi
Copy link
Member Author

findepi commented Nov 8, 2024

Rebased to resolve conflict with #6662

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @findepi -- I looked at the code in this PR and it looks quite good to me. I am just running the like benchmarks now to make sure there isn't any performance regression

&& !pattern.ends_with("\\%")
&& !contains_like_pattern(&pattern[..pattern.len() - 1])
{
} else if pattern.ends_with('%') && !contains_like_pattern(&pattern[..pattern.len() - 1]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 the contains_like_pattern check now also checks for (not) ending with % too 👍 )

Copy link
Member Author

@findepi findepi Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and for (not) ending with \ too (which is a change in that method)

@alamb
Copy link
Contributor

alamb commented Nov 8, 2024

Benchmark results (basically no change in my opinion):

++ critcmp master findepi_fix-like-with-escapes-792c56
group                                              findepi_fix-like-with-escapes-792c56    master
-----                                              ------------------------------------    ------
ilike_utf8 scalar complex                          1.00      2.6±0.07ms        ? ?/sec     1.01      2.7±0.06ms        ? ?/sec
ilike_utf8 scalar contains                         1.03      4.3±0.05ms        ? ?/sec     1.00      4.1±0.05ms        ? ?/sec
ilike_utf8 scalar ends with                        1.00  1202.7±35.59µs        ? ?/sec     1.03  1236.9±23.92µs        ? ?/sec
ilike_utf8 scalar equals                           1.00   747.4±16.29µs        ? ?/sec     1.05   784.2±23.50µs        ? ?/sec
ilike_utf8 scalar starts with                      1.00  1102.1±30.61µs        ? ?/sec     1.02  1127.3±28.58µs        ? ?/sec
ilike_utf8_scalar_dyn dictionary[10] string[4])    1.00     77.6±0.08µs        ? ?/sec     1.00     77.7±0.10µs        ? ?/sec
like_utf8 scalar complex                           1.00  1845.0±21.89µs        ? ?/sec     1.02  1889.4±18.40µs        ? ?/sec
like_utf8 scalar contains                          1.00  1549.6±14.84µs        ? ?/sec     1.01  1570.9±13.61µs        ? ?/sec
like_utf8 scalar ends with                         1.00   410.3±14.51µs        ? ?/sec     1.02   420.4±10.65µs        ? ?/sec
like_utf8 scalar equals                            1.00    127.0±0.52µs        ? ?/sec     1.00    127.0±0.27µs        ? ?/sec
like_utf8 scalar starts with                       1.00   336.7±19.19µs        ? ?/sec     1.01    339.4±5.83µs        ? ?/sec
like_utf8_scalar_dyn dictionary[10] string[4])     1.00     77.5±0.12µs        ? ?/sec     1.00     77.4±0.05µs        ? ?/sec
like_utf8view scalar complex                       1.00    182.4±0.72ms        ? ?/sec     1.00    183.1±0.44ms        ? ?/sec
like_utf8view scalar contains                      1.03    142.5±0.23ms        ? ?/sec     1.00    138.8±0.22ms        ? ?/sec
like_utf8view scalar ends with 13 bytes            1.00     44.3±0.38ms        ? ?/sec     1.08     47.8±0.10ms        ? ?/sec
like_utf8view scalar ends with 4 bytes             1.00     45.0±0.18ms        ? ?/sec     1.09     48.9±0.18ms        ? ?/sec
like_utf8view scalar ends with 6 bytes             1.00     44.8±0.13ms        ? ?/sec     1.08     48.6±0.17ms        ? ?/sec
like_utf8view scalar equals                        1.00     35.8±0.16ms        ? ?/sec     1.01     36.0±0.07ms        ? ?/sec
like_utf8view scalar starts with 13 bytes          1.01     47.8±0.15ms        ? ?/sec     1.00     47.3±0.16ms        ? ?/sec
like_utf8view scalar starts with 4 bytes           1.00     34.3±0.09ms        ? ?/sec     1.00     34.3±0.10ms        ? ?/sec
like_utf8view scalar starts with 6 bytes           1.01     48.2±0.20ms        ? ?/sec     1.00     47.8±0.15ms        ? ?/sec
nilike_utf8 scalar complex                         1.00      2.6±0.06ms        ? ?/sec     1.02      2.7±0.07ms        ? ?/sec
nilike_utf8 scalar contains                        1.03      4.2±0.06ms        ? ?/sec     1.00      4.1±0.05ms        ? ?/sec
nilike_utf8 scalar ends with                       1.00  1189.1±24.87µs        ? ?/sec     1.03  1229.4±22.89µs        ? ?/sec
nilike_utf8 scalar equals                          1.00   755.0±17.17µs        ? ?/sec     1.04   787.2±27.12µs        ? ?/sec
nilike_utf8 scalar starts with                     1.00  1124.4±45.36µs        ? ?/sec     1.01  1136.6±33.68µs        ? ?/sec
nlike_utf8 scalar complex                          1.00  1855.2±39.69µs        ? ?/sec     1.02  1887.3±29.24µs        ? ?/sec
nlike_utf8 scalar contains                         1.00  1548.3±12.27µs        ? ?/sec     1.01  1568.2±12.93µs        ? ?/sec
nlike_utf8 scalar ends with                        1.00   411.8±17.08µs        ? ?/sec     1.02    420.5±7.53µs        ? ?/sec
nlike_utf8 scalar equals                           1.00    126.7±0.14µs        ? ?/sec     1.00    127.0±0.39µs        ? ?/sec
nlike_utf8 scalar starts with                      1.00    328.4±5.42µs        ? ?/sec     1.05   345.2±13.53µs        ? ?/sec
++ popd

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks (again) @findepi

@tustvold tustvold merged commit 5ad621f into apache:master Nov 9, 2024
18 checks passed
@findepi findepi deleted the findepi/fix-like-with-escapes-792c56 branch November 9, 2024 10:18
findepi added a commit to sdf-labs/arrow-rs that referenced this pull request Nov 14, 2024
* Fix LIKE with escapes

Fix LIKE processing for patterns containing escapes

- the starts_with / ends_with optimization did not correctly check for
  escapes when checking rest of the pattern for being literal or not
- the pattern to regexp compiler incorrectly processed \ followed by a
  character other than % or _. In PostgreSQL '\x' pattern matches single
  'x'.

There are two tests

- like_escape_many was generated using PostgreSQL with the code attached
  below for verification
- like_escape is hand-picked test cases that are more interesting.
  Lower cardinality of hand-picked test cases allows for exercising all
  scalar/array vs scalar/array combinations.

The below script isn't simples possible, because it was attempted to
generate more test cases by adding padding. Hence e.g.
is_like_without_dangling_escape.  Since this is attached for reference,
should be attached as-is.

```python
import psycopg2

data = r"""
\
\\
\\\
\\\\
a
\a
\\a
%
\%
\\%
%%
\%%
\\%%
_
\_
\\_
__
\__
\\__
abc
a_c
a\bc
a\_c
%abc
\%abc
a\\_c%
""".split('\n')

data = list(dict.fromkeys(data))

conn = psycopg2.connect(host='localhost', port=5432, user='postgres', password='mysecretpassword')
conn.set_session(autocommit=True)
cursor = conn.cursor()
for r in data:
    try:
        # PostgreSQL verifies dandling escape only sometimes
        cursor.execute(f"SELECT %s LIKE %s", (r, r))
        is_like, = cursor.fetchone()
        has_dandling_escape = False
        pg_pattern = r
    except Exception as e:
        if 'LIKE pattern must not end with escape character' not in str(e):
            raise e
        has_dandling_escape = True
        pg_pattern = r + '\\'

    for l in data:
        # print()
        # print('     '.join(str(v) for v in (l, r, has_dandling_escape, postgres_pattern)))
        cursor.execute(f"SELECT %s LIKE %s", (l, pg_pattern))
        is_like, = cursor.fetchone()
        assert type(is_like) is bool

        if not is_like and has_dandling_escape:
            pattern_without_escaped_dandling_escape = pg_pattern[:-2]
            cursor.execute(f"SELECT %s LIKE %s", (l, pattern_without_escaped_dandling_escape))
            is_like_without_dangling_escape, = cursor.fetchone()
            assert type(is_like_without_dangling_escape) is bool
        else:
            is_like_without_dangling_escape = False
        assert '"' not in l
        assert '"' not in r
        print('(r"%s", r"%s", %s),' % (
            l, r,
            str(is_like).lower(),
            # str(has_dandling_escape).lower(),
            # str(is_like_without_dangling_escape).lower(),
        ))
```

* Compact tests for regex_like

Reduce test code boilerplate and make it easier to see what are the test
cases.

* Add more test cases for regex_like
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect like results for pattern starting/ending with % percent and containing escape characters
3 participants