-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose API to register a foreign TableProvider #823
Comments
I'm trying to think through how best to support this. From preliminary investigation into how we're handling arrow dataset, we create a table provider in |
For Table Provider, what I've been investigating is how we could do something like On the one hand, I can see wanting to just implement |
A I think pola-rs has built their plugins around a C interface, haven't dived to deep in their internals yet |
No, sorry, I wasn't clear in what I was suggesting. I was wondering if we should expose something along the lines of I think this topic is complex. Since different versions of the rust compiler and different versions of datafusion would all lead to different binary layouts, it's really not as simple as exposing Lowest level effort to get this up and running would be the last idea. Maybe that's okay? I'd hate to add that kind of build dependency. Also anyone who is developing would have to make sure they either use the same or build both wheels locally. My thoughts are a little jumbled. |
I've done some additional testing with mixed success. Approach 1: Direct ExposeIn this approach we basically just expose a function
Approach 2: Create FFI Table ProviderIn this approach we define a true FFI friendly Table Provider. We expose a PyCapsule with this table provider.
EvaluationI have each of these approaches working in a minimal fashion. For the direct expose it is working except I have an odd failure when trying to do a For the FFI table provider, I've got the round trip working where we can get the schema from the table provider through FFI and I intentionally built them in different compiler modes to ensure the internal representations differed. The part I'm stuck on here is how much would have to be exposed to get all required functions of I'm open to thoughts and suggestions. |
Approach 2 looks to be most promising since we could have many different versions of deltalake be compatible with Datafusion-python. Regarding session, what is exactly configurable from python? Wouldn't those config settings be easy to pass across? |
I'm in favor of the second approach also, but the concern is the depth of options that have to be exposed. I suppose instead of trying to expose |
I don't know if this is useful, but in here you have a POC creating a PyTableProvider, exposing part of TableProvider trait to Python, the same way is done for UDAF (Accumulator). The idea would be to wrap any Python object that returns RecordBatch(es). This goes through Python, though. In Python the dev would have to do something like this |
That's a really good idea. I was thinking of it entirely from the direction of exposing the api but maybe what we should be doing is leaning on going through python like you suggest. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In delta-rs we have a TableProvider on DeltaTable on the rust side, we would like to leverage this in datafusion-python, so that we can make use the scanning capabilities of datafusion. However I don't see an API where we can register a table which as TableProvider in rust
Describe the solution you'd like
Provide a means to allow registering Tables that implement TableProvider in rust through python
Describe alternatives you've considered
Couldn't find any.
Related issues
delta-io/delta-rs#1204
The text was updated successfully, but these errors were encountered: