Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose API to register a foreign TableProvider #823

Closed
ion-elgreco opened this issue Aug 19, 2024 · 9 comments · Fixed by #921
Closed

Expose API to register a foreign TableProvider #823

ion-elgreco opened this issue Aug 19, 2024 · 9 comments · Fixed by #921
Labels
enhancement New feature or request

Comments

@ion-elgreco
Copy link
Contributor

ion-elgreco commented Aug 19, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In delta-rs we have a TableProvider on DeltaTable on the rust side, we would like to leverage this in datafusion-python, so that we can make use the scanning capabilities of datafusion. However I don't see an API where we can register a table which as TableProvider in rust

Describe the solution you'd like
Provide a means to allow registering Tables that implement TableProvider in rust through python

Describe alternatives you've considered
Couldn't find any.

Related issues
delta-io/delta-rs#1204

@timsaucer
Copy link
Contributor

I'm trying to think through how best to support this. From preliminary investigation into how we're handling arrow dataset, we create a table provider in src/dataset.rs but a fair amount of the actual interface code is in src/dataset_exec.rs. I suspect what we need to do is implement a PyCapsule approach and do some checking to ensure the versions are compatible. I still need to investigate further, but this seems like a useful thing to do.

@timsaucer
Copy link
Contributor

For Table Provider, what I've been investigating is how we could do something like register_table_provider. But even scratching the surface of making this stable across versions is a deep rabbit hole. Suppose I only wanted to support scan in a Table Provider. I need to export a stable version of Session, Expr, and ExecutionPlan. But then Session and ExecutionPlan require even more.

On the one hand, I can see wanting to just implement register_delta_table and add code specific to the python interface on delta-rs. That might be the fastest way to get us where we want to be but it's not the best long-term solution probably.

@ion-elgreco
Copy link
Contributor Author

For Table Provider, what I've been investigating is how we could do something like register_table_provider. But even scratching the surface of making this stable across versions is a deep rabbit hole. Suppose I only wanted to support scan in a Table Provider. I need to export a stable version of Session, Expr, and ExecutionPlan. But then Session and ExecutionPlan require even more.

On the one hand, I can see wanting to just implement register_delta_table and add code specific to the python interface on delta-rs. That might be the fastest way to get us where we want to be but it's not the best long-term solution probably.

A register_delta_table in Python Datafusion? That might bloat your binary since you include delta-rs and it will limit your Datafusion version.

I think pola-rs has built their plugins around a C interface, haven't dived to deep in their internals yet

@timsaucer
Copy link
Contributor

No, sorry, I wasn't clear in what I was suggesting. I was wondering if we should expose something along the lines of register_delta_table and then provide a FFI stable interface that narrowly supports the needs of delta-rs. As I'm diving in deeper I'm discovering more difficulties.

I think this topic is complex. Since different versions of the rust compiler and different versions of datafusion would all lead to different binary layouts, it's really not as simple as exposing register_table_provider. One way around that could be to create a compiler dependency for any official wheels - but then we're adding a new requirement on delta-rs which seems fragile long term, especially if you want to update rustc versions (or we do and you want to keep them where they are).

Lowest level effort to get this up and running would be the last idea. Maybe that's okay? I'd hate to add that kind of build dependency. Also anyone who is developing would have to make sure they either use the same or build both wheels locally.

My thoughts are a little jumbled.

@timsaucer
Copy link
Contributor

I've done some additional testing with mixed success.

Approach 1: Direct Expose

In this approach we basically just expose a function register_table_provider that takes a Arc<dyn TableProvider> and put this in a simple structure we can expose via PyCapsule.

  • Advantages: This is simple in terms of datafusion-python. It's really a matter of exposing the PyCapsule and puts all of the responsibility of making it work on the consumer.
  • Disadvantages: This requires not only the datafusion version to match exactly between the consumer and datafusion-python but they must also use the same arrow dependencies and the same compiler version to get compatible binaries. This means that every version of delta-rs would have to update in lock step with datafusion - including sub dependencies AND compiler version. From a release perspective this means users would have to be specific about which versions of datafusion and delta-rs packages they use, leading to a lot of difficult edge cases.

Approach 2: Create FFI Table Provider

In this approach we define a true FFI friendly Table Provider. We expose a PyCapsule with this table provider.

  • Advantages: As long as there have been no changes to the FFI definition any version of datafusion will work with any version of delta-rs. Also, it opens the door for integrating non-rust table providers, but I don't know if that's really important.
  • Disadvantages: This is a huge lift because we need to expose Session as well, which then exposes a lot more. Now, if we could make some argument about only allowing external table providers that do not require the session that would vastly simplify the problem but I do see that delta-rs is using things like the runtime environment and the session config.

Evaluation

I have each of these approaches working in a minimal fashion.

For the direct expose it is working except I have an odd failure when trying to do a show() on the data frame. Oddly, I can execute the dataframe and do count() and other operations. There is some odd dependency along the line that is causing some fault during projection that I'm struggling to troubleshoot.

For the FFI table provider, I've got the round trip working where we can get the schema from the table provider through FFI and I intentionally built them in different compiler modes to ensure the internal representations differed. The part I'm stuck on here is how much would have to be exposed to get all required functions of TableProvider working.

I'm open to thoughts and suggestions.

@ion-elgreco
Copy link
Contributor Author

Approach 2 looks to be most promising since we could have many different versions of deltalake be compatible with Datafusion-python.

Regarding session, what is exactly configurable from python? Wouldn't those config settings be easy to pass across?

@timsaucer
Copy link
Contributor

I'm in favor of the second approach also, but the concern is the depth of options that have to be exposed. I suppose instead of trying to expose Session we can limit it to SessionConfig that would work, especially since we can pass a set of strings and create a SessionConfig from them. The other one it seems like delta-rs uses is RuntimeEnv. I don't know how that's used.

@mesejo
Copy link
Contributor

mesejo commented Sep 10, 2024

I don't know if this is useful, but in here you have a POC creating a PyTableProvider, exposing part of TableProvider trait to Python, the same way is done for UDAF (Accumulator). The idea would be to wrap any Python object that returns RecordBatch(es). This goes through Python, though.

In Python the dev would have to do something like this

@timsaucer
Copy link
Contributor

That's a really good idea. I was thinking of it entirely from the direction of exposing the api but maybe what we should be doing is leaning on going through python like you suggest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants