Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete the rewrite_table_scan from the ExecutionPlan and use it within an AnalyzerRule #61

Open
hozan23 opened this issue Sep 19, 2024 · 4 comments

Comments

@hozan23
Copy link
Collaborator

hozan23 commented Sep 19, 2024

Hello,

We are thinking of moving the recently added rewrite_table_scan function into an AnalyzerRule instead of running it in the ExecutionPlan. This change makes sense since the AnalyzerRule is responsible for the semantic changes when transforming the LogicalPlans.

What do you think? @phillipleblanc

cc: @backkem

@phillipleblanc
Copy link
Contributor

Makes sense to me. One caveat we've run into is that you can't blindly rewrite the plan on the DataFusion side, since DataFusion still needs the table names it knows about to work properly. The LogicalPlan generated by rewrite_table_scan is only useful as input to the Unparser

@backkem
Copy link
Collaborator

backkem commented Sep 19, 2024

Yea, good point.

We have an implementation where we basically have two separate schemas: the external or "model" schema represented by a ModelSchemaProvider and ModelTableProviders and a internal/storage schema represented by a StorageSchemaProvider and StorageTableProviders. The former "model" schema is used during query parsing. The ModelTableProviders have a method to fetch the corresponding StorageTableProvider. It's the rewrite analyser that rewrites the entire LogicalPlan from using ModelTableProviders to StorageTableProviders. The storage schema is used during execution.

I'd love to hear your insight on that approach. I think it's more general. For example, it would work when using plain table providers or when trying to federate using Substrait plans. On the other hand, it could introduce too many new abstraction into the mix.

@phillipleblanc
Copy link
Contributor

In general I'm a fan of simpler systems when possible. I'm not really convinced that adding these abstractions buys us all that much. i.e. for plain table providers, its relatively straightforward to keep state on the internal name and when you call .scan() on it, you don't even have to care what the name datafusion has for you (we do this for all of our non-federated custom providers). For Substrait plan federation, I would just have its implementation call rewrite_table_scan directly - similar to how the SQL federation is doing it.

I do see how it would be nice though if that was already handled at a different layer so that the SQL/Substrait federation providers can just call the unparser directly on the plan its given.

@backkem
Copy link
Collaborator

backkem commented Sep 20, 2024

Yea, I guess our use-case for this is somewhat different from federation. It's more related to creating a separate "model" or "semantic layer" on/over top of the storage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants