Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to use rust UDF extensions of datafusion-python #792

Open
timsaucer opened this issue Jul 29, 2024 · 5 comments
Open

Document how to use rust UDF extensions of datafusion-python #792

timsaucer opened this issue Jul 29, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@timsaucer
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Frequently when we are working on a set of new features or data analytics we find recurring patterns in our code. We will typically abstract these into a python package to share across the team. There are times where we must use UDFs to process some data that are non-performant since we have to convert the data to python objects when the existing pyarrow operations are insufficient for our needs.

I would like to be able to take these UDFs and author them in rust and use these user defined functions in python. Currently there is a blocker in that data cannot be shared from one rust crate to another. PyO3/pyo3#1444 is a tracking issue that discusses this in more detail.

Describe the solution you'd like

It appears at first glance that PyCapsule approach is the best way forward based on examples in pyo3_arrow and rust-numpy.

I have a minimal example that I will try to put up later this week into a repo to demonstrate what I would like to accomplish. Basically, I should be able to define a UDF in rust so that I can make it highly performant, add a pyo3 wrapper function around it, and use this python module alongside the existing datafusion-python.

Describe alternatives you've considered

I'm uncertain at this point about other options besides a very painful reexport of the entire datafusion python module.

Additional context

I will try to push up my minimal demonstration repo later this week.

@timsaucer timsaucer added the enhancement New feature or request label Jul 29, 2024
@timsaucer timsaucer changed the title Allow for rust UDF extensions of datafusion-python Document how to use rust UDF extensions of datafusion-python Jul 31, 2024
@timsaucer
Copy link
Contributor Author

Based on excellent feedback on the discord server, I have a minimum demonstrable of doing this that does not require any code changes in datafusion-python. I'm going to document what I have.

@timsaucer
Copy link
Contributor Author

https://github.com/timsaucer/tuple_filter_example/blob/main/src/lib.rs

I think my code must be making a copy of the data because it is slower than I would expect. I will continue to investigate.

@Michael-J-Ward
Copy link
Contributor

I think my code must be making a copy of the data because it is slower than I would expect. I will continue to investigate.

Try using maturin develop --release in tuple_filter_example

Running with maturin develop

❯ python python-udf-comparisons.py 
Simple filtering has number 32 rows and took 1.7959909439086914 s
This is the incorrect number of rows!
Explicit filtering has number 21 rows and took 0.9604465961456299 s
UDF filtering has number 21 rows and took 8.786761999130249 s
UDF filtering using pyarrow compute has number 21 rows and took 0.7557330131530762 s
UDF filtering using a custom rust function has number 21 rows and took 2.760791063308716 s

Re-running with maturin devleop --release

❯ python python-udf-comparisons.py 
Simple filtering has number 32 rows and took 1.6849780082702637 s
This is the incorrect number of rows!
Explicit filtering has number 21 rows and took 0.9972596168518066 s
UDF filtering has number 21 rows and took 8.215271472930908 s
UDF filtering using pyarrow compute has number 21 rows and took 0.777594804763794 s
UDF filtering using a custom rust function has number 21 rows and took 0.6748173236846924 s

NOTE:

I had to convert all int32 to int64 types.

❯ python python-udf-comparisons.py 
Simple filtering has number 32 rows and took 1.810662031173706 s
This is the incorrect number of rows!
Explicit filtering has number 21 rows and took 0.9524128437042236 s
UDF filtering has number 21 rows and took 8.34911322593689 s
Traceback (most recent call last):
  File "/home/mike/workspace/tuple_filter_example/python-example/python-udf-comparisons.py", line 185, in <module>
    num_rows = df_udf_pyarrow_compute.count()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mike/workspace/datafusion-python/dev/python/datafusion/dataframe.py", line 508, in count
    return self.df.count()
           ^^^^^^^^^^^^^^^
Exception: type_coercion
caused by
Error during planning: Coercion from [Int64, Int64, Utf8] to the signature Exact([Int32, Int32, Utf8]) failed.

@timsaucer
Copy link
Contributor Author

Awesome! Thank you!

@timsaucer
Copy link
Contributor Author

If this PR goes in, we can possibly just update the user documentation with a minimal example and link to the blog post: apache/datafusion-site#17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants