-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blog post on using UDFs in python #17
base: main
Are you sure you want to change the base?
Conversation
return pa.array(result) | ||
|
||
|
||
is_of_interest = udf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_of_interest = udf( | |
# Wrap our custom function with `datafusion.udf`, annotating expected | |
# parameter and return types | |
is_of_interest = udf( |
As a separate note, it wouldn't be hard to convert this udf
function wrapper into a Python decorator, so we could do
@udf(args=(pa.int64(), pa.int64(), pa.utf8()), returns=pa.bool_(), "stable")
def is_of_interest(
partkey_arr: pa.Array,
suppkey_arr: pa.Array,
returnflag_arr: pa.Array,
) -> pa.Array: ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea. I've added it to the issue list apache/datafusion-python#806
returnflag_arr: pa.Array, | ||
) -> pa.Array: | ||
results = None | ||
for partkey, suppkey, returnflag in values_of_interest: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use pyarrow.is_in
to speed this up, instead of doing an equality check multiple times: https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_in.html
partkey_arr: pa.Array, | ||
suppkey_arr: pa.Array, | ||
returnflag_arr: pa.Array, | ||
) -> pa.Array: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be helpful here to describe your problem a little bit. Say what partkey_arr
is representing, and how it relates to your values_of_interest
above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a smaller statement in the earlier section, but I've expanded it because it was easy to pass over.
let values = partkey_arr | ||
.values() | ||
.iter() | ||
.zip(suppkey_arr.values().iter()) | ||
.zip(returnflag_arr.iter()) | ||
.map(|((a, b), c)| (a, b, c.unwrap_or_default())) | ||
.map(|v| values_to_search.contains(&v)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is faster I suppose because it's not doing a boolean check on each individual array in its entirety and then ORing them? It's doing it all at once in a single pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I didn't dive any deeper but my expectation is that by doing a single pass through the iteration we'll get a small speed improvement. It my modest test it only accounted for about a 5% boost.
Huge tip of the hat to @kylebarron for the thorough feedback! |
This PR adds a blog post describing using UDFs, and in particular on how to combine third party rust UDFs with datafusion-python.