Blog post on using UDFs in python #17

timsaucer · 2024-08-06T14:05:17Z

This PR adds a blog post describing using UDFs, and in particular on how to combine third party rust UDFs with datafusion-python.

_posts/2024-08-06-datafusion-python-udf-comparisons.md

kylebarron · 2024-08-12T18:36:50Z

_posts/2024-08-06-datafusion-python-udf-comparisons.md

+    return pa.array(result)
+
+
+is_of_interest = udf(


Suggested change

is_of_interest = udf(

# Wrap our custom function with `datafusion.udf`, annotating expected

# parameter and return types

is_of_interest = udf(

As a separate note, it wouldn't be hard to convert this udf function wrapper into a Python decorator, so we could do

@udf(args=(pa.int64(), pa.int64(), pa.utf8()), returns=pa.bool_(), "stable") def is_of_interest( partkey_arr: pa.Array, suppkey_arr: pa.Array, returnflag_arr: pa.Array, ) -> pa.Array: ...

Great idea. I've added it to the issue list apache/datafusion-python#806

_posts/2024-08-06-datafusion-python-udf-comparisons.md

kylebarron · 2024-08-12T18:41:49Z

_posts/2024-08-06-datafusion-python-udf-comparisons.md

+    returnflag_arr: pa.Array,
+) -> pa.Array:
+    results = None
+    for partkey, suppkey, returnflag in values_of_interest:


I think you can use pyarrow.is_in to speed this up, instead of doing an equality check multiple times: https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_in.html

_posts/2024-08-06-datafusion-python-udf-comparisons.md

kylebarron · 2024-08-12T18:46:21Z

_posts/2024-08-06-datafusion-python-udf-comparisons.md

+    partkey_arr: pa.Array,
+    suppkey_arr: pa.Array,
+    returnflag_arr: pa.Array,
+) -> pa.Array:


I think it might be helpful here to describe your problem a little bit. Say what partkey_arr is representing, and how it relates to your values_of_interest above.

I had a smaller statement in the earlier section, but I've expanded it because it was easy to pass over.

_posts/2024-08-06-datafusion-python-udf-comparisons.md

kylebarron · 2024-08-12T18:52:54Z

_posts/2024-08-06-datafusion-python-udf-comparisons.md

+        let values = partkey_arr
+            .values()
+            .iter()
+            .zip(suppkey_arr.values().iter())
+            .zip(returnflag_arr.iter())
+            .map(|((a, b), c)| (a, b, c.unwrap_or_default()))
+            .map(|v| values_to_search.contains(&v));


This is faster I suppose because it's not doing a boolean check on each individual array in its entirety and then ORing them? It's doing it all at once in a single pass?

Yes, I didn't dive any deeper but my expectation is that by doing a single pass through the iteration we'll get a small speed improvement. It my modest test it only accounted for about a 5% boost.

_posts/2024-08-06-datafusion-python-udf-comparisons.md

timsaucer · 2024-08-13T01:34:03Z

Huge tip of the hat to @kylebarron for the thorough feedback!

Blog post on using UDFs in python

8af9b04

timsaucer mentioned this pull request Aug 6, 2024

Document how to use rust UDF extensions of datafusion-python apache/datafusion-python#792

Open

kylebarron reviewed Aug 12, 2024

View reviewed changes

timsaucer-may added 2 commits August 12, 2024 21:10

Addressing review comments

2c7627d

Small typo

cd04f33

timsaucer mentioned this pull request Aug 13, 2024

Add udf / udaf decorators apache/datafusion-python#806

Open

timsaucer-may added 3 commits August 12, 2024 21:20

Capitalization

685af51

Small language adjustments

718f1ac

Add more thorough description of the problem

7b39dc2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blog post on using UDFs in python #17

Blog post on using UDFs in python #17

timsaucer commented Aug 6, 2024

kylebarron Aug 12, 2024

timsaucer Aug 13, 2024

kylebarron Aug 12, 2024

kylebarron Aug 12, 2024

timsaucer Aug 13, 2024

kylebarron Aug 12, 2024

timsaucer Aug 13, 2024

timsaucer commented Aug 13, 2024

Blog post on using UDFs in python #17

Are you sure you want to change the base?

Blog post on using UDFs in python #17

Conversation

timsaucer commented Aug 6, 2024

kylebarron Aug 12, 2024

Choose a reason for hiding this comment

timsaucer Aug 13, 2024

Choose a reason for hiding this comment

kylebarron Aug 12, 2024

Choose a reason for hiding this comment

kylebarron Aug 12, 2024

Choose a reason for hiding this comment

timsaucer Aug 13, 2024

Choose a reason for hiding this comment

kylebarron Aug 12, 2024

Choose a reason for hiding this comment

timsaucer Aug 13, 2024

Choose a reason for hiding this comment

timsaucer commented Aug 13, 2024