Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi string contains [databricks] #11413

Open
wants to merge 4 commits into
base: branch-24.12
Choose a base branch
from

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Aug 30, 2024

Combine multi targets for multiple string contains, and invoke one kernel to get multiple bool columns results.

Depends on cuDF PR

Verified: contains in case when were combined when running pytest.

Perf test

summary

strings pattern total kernel time change kernel speedup end to end time change end to end speedup
short strings 125ms -> 38ms 3.5x 12s->12s not obvious
long strings 24s -> 11s 2.18x 59s-53s 1.11x

details

short strings

image
The above part is kernel time of new kernel
The bottom part is base line.

long strings

Kernel time:
Base line:
image

new:
image

@res-life res-life changed the title [Do not review] Support multi string contians Support multi string contians Sep 2, 2024
@res-life res-life marked this pull request as ready for review September 2, 2024 09:29
@res-life
Copy link
Collaborator Author

res-life commented Sep 2, 2024

Building failed, because it's depending on rapidsai/cudf#16641

@sameerz sameerz added the performance A performance related task/issue label Sep 2, 2024
@res-life res-life changed the title Support multi string contians Support multi string contians [databricks] Sep 3, 2024
revans2
revans2 previously approved these changes Sep 4, 2024
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good and the performance numbers look good.

gerashegalov
gerashegalov previously approved these changes Sep 7, 2024
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nits

@@ -379,3 +379,30 @@ def test_case_when_all_then_values_are_scalars_with_nulls():
"tab",
sql_without_else,
conf = {'spark.rapids.sql.case_when.fuse': 'true'})

@pytest.mark.parametrize('combine_string_contains_enabled', ['true', 'false'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer Python constants

Suggested change
@pytest.mark.parametrize('combine_string_contains_enabled', ['true', 'false'])
@pytest.mark.parametrize('combine_string_contains_enabled', [True, False])

However, the pytest case id will be more readable if, instead of a boolean, parameters are strings

@pytest.mark.parametrize('string_contains_mode', ['multiContains', 'singleContains'], ids=idfn)

Comment on lines +471 to +473
override def equals(o: Any): Boolean = o match {
case other: ContainsCombiner => exp.left.semanticEquals(other.exp.left) &&
exp.right.isInstanceOf[GpuLiteral] && other.exp.right.isInstanceOf[GpuLiteral]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if you make ContainsCombiner a case class you can use pattern matching instead of manual instanceof checks:

Suggested change
override def equals(o: Any): Boolean = o match {
case other: ContainsCombiner => exp.left.semanticEquals(other.exp.left) &&
exp.right.isInstanceOf[GpuLiteral] && other.exp.right.isInstanceOf[GpuLiteral]
override def equals(o: Any): Boolean = (o, exp) match {
case (ContainsCombiner(GpuContains(combLeft, GpuLiteral(_, _))), GpuContains(expLeft, GpuLiteral(_, _))) =>
expLeft.semanticEquals(combLeft)

@ttnghia ttnghia changed the title Support multi string contians [databricks] Support multi string contains [databricks] Sep 27, 2024
@res-life res-life changed the base branch from branch-24.10 to branch-24.12 October 9, 2024 02:09
@res-life res-life dismissed stale reviews from gerashegalov and revans2 October 9, 2024 02:09

The base branch was changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants