Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog post on using UDFs in python #17

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

timsaucer
Copy link
Contributor

This PR adds a blog post describing using UDFs, and in particular on how to combine third party rust UDFs with datafusion-python.

_posts/2024-08-06-datafusion-python-udf-comparisons.md Outdated Show resolved Hide resolved
_posts/2024-08-06-datafusion-python-udf-comparisons.md Outdated Show resolved Hide resolved
_posts/2024-08-06-datafusion-python-udf-comparisons.md Outdated Show resolved Hide resolved
_posts/2024-08-06-datafusion-python-udf-comparisons.md Outdated Show resolved Hide resolved
_posts/2024-08-06-datafusion-python-udf-comparisons.md Outdated Show resolved Hide resolved
return pa.array(result)


is_of_interest = udf(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is_of_interest = udf(
# Wrap our custom function with `datafusion.udf`, annotating expected
# parameter and return types
is_of_interest = udf(

As a separate note, it wouldn't be hard to convert this udf function wrapper into a Python decorator, so we could do

@udf(args=(pa.int64(), pa.int64(), pa.utf8()), returns=pa.bool_(), "stable")
def is_of_interest(
    partkey_arr: pa.Array,
    suppkey_arr: pa.Array,
    returnflag_arr: pa.Array,
) -> pa.Array: ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea. I've added it to the issue list apache/datafusion-python#806

_posts/2024-08-06-datafusion-python-udf-comparisons.md Outdated Show resolved Hide resolved
_posts/2024-08-06-datafusion-python-udf-comparisons.md Outdated Show resolved Hide resolved
returnflag_arr: pa.Array,
) -> pa.Array:
results = None
for partkey, suppkey, returnflag in values_of_interest:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use pyarrow.is_in to speed this up, instead of doing an equality check multiple times: https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_in.html

_posts/2024-08-06-datafusion-python-udf-comparisons.md Outdated Show resolved Hide resolved
_posts/2024-08-06-datafusion-python-udf-comparisons.md Outdated Show resolved Hide resolved
_posts/2024-08-06-datafusion-python-udf-comparisons.md Outdated Show resolved Hide resolved
Comment on lines +117 to +120
partkey_arr: pa.Array,
suppkey_arr: pa.Array,
returnflag_arr: pa.Array,
) -> pa.Array:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be helpful here to describe your problem a little bit. Say what partkey_arr is representing, and how it relates to your values_of_interest above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a smaller statement in the earlier section, but I've expanded it because it was easy to pass over.

_posts/2024-08-06-datafusion-python-udf-comparisons.md Outdated Show resolved Hide resolved
_posts/2024-08-06-datafusion-python-udf-comparisons.md Outdated Show resolved Hide resolved
Comment on lines +479 to +485
let values = partkey_arr
.values()
.iter()
.zip(suppkey_arr.values().iter())
.zip(returnflag_arr.iter())
.map(|((a, b), c)| (a, b, c.unwrap_or_default()))
.map(|v| values_to_search.contains(&v));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is faster I suppose because it's not doing a boolean check on each individual array in its entirety and then ORing them? It's doing it all at once in a single pass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't dive any deeper but my expectation is that by doing a single pass through the iteration we'll get a small speed improvement. It my modest test it only accounted for about a 5% boost.

_posts/2024-08-06-datafusion-python-udf-comparisons.md Outdated Show resolved Hide resolved
@timsaucer
Copy link
Contributor Author

Huge tip of the hat to @kylebarron for the thorough feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants