Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: regexp_extract returns match in mismatched group #12109

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

HolyLow
Copy link
Contributor

@HolyLow HolyLow commented Jan 17, 2025

The implementation of Re2Extract has a bug, that it might consider a mismatched group as MATCHED empty string "" rather than MISMATCHED std::nullopt.

For example, in the function calling: regexp_extract("rat cat\nbat dog", "ra(.)|blah(.)(.)", 2).
In this case, for group 2 the result must be std::nullopt because no substring would match pattern blah(.).
But the current implementation would mistake the matching of group 1 ra(.) as a empty match case for group 2, and thus return a empty matching, which is wrong.

This PR fix this bug in Re2Extract implementation.

Also note that this bug behavior exists in Re2ExtractAll as well, but this PR doesn't modify in Re2ExtractAll because existing UTs of Re2ExtractAll already rely on this behavior.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 17, 2025
Copy link

netlify bot commented Jan 17, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit b7ed8c7
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/678a1e5bf91dc1000836a73b

@HolyLow
Copy link
Contributor Author

HolyLow commented Jan 17, 2025

@kgpai @mbasmanova Could you kindly help review this PR? Thanks a lot.

Any suggestion is welcome.

@mbasmanova
Copy link
Contributor

Also note that this bug behavior exists in Re2ExtractAll as well, but this PR doesn't modify in Re2ExtractAll because existing UTs of Re2ExtractAll already rely on this behavior.

Sounds like both the implementation is buggy and test expectations are wrong. In this case we need to fix both the implementation and the test.

Would you check Presto Java to see if Velox behavior matches it?

@mbasmanova
Copy link
Contributor

Looks like Presto Java returns NULL:

presto:di> SELECT regexp_extract('rat cat\nbat dog', 'ra(.)|blah(.)(.)', 2);
 _col0
-------
 NULL
(1 row)

@HolyLow Would you create a GitHub issue to describe this problem? Then, reference the issue in the PR description.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix!

CC: @amitkdutta @kevinwilfong @kagamiori

@mbasmanova mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Jan 17, 2025
@kagamiori kagamiori changed the title bugfix: regexp_extract returns match in mismatched group fix: regexp_extract returns match in mismatched group Jan 17, 2025
@facebook-github-bot
Copy link
Contributor

@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants