Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(examples): use memtables for examples to allow backends that do not support temporary tables to support examples #10094

Closed
wants to merge 6 commits into from

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Sep 11, 2024

Fixes an import bug that can show up when calling register on the datafusion backend without having imported pyarrow.dataset before calling register, as well as moving to memtables for examples to bypass temp=True shenanigans.

The import bug wasn't caught due to importing pyarrow.dataset at the top of the test module.

@cpcloud cpcloud added this to the 9.5 milestone Sep 11, 2024
@cpcloud cpcloud added bug Incorrect behavior inside of ibis datafusion The Apache DataFusion backend labels Sep 11, 2024
# directly.
obj = ibis.memtable(table)
return backend.create_table(table_name, obj, temp=True, overwrite=True)
return ibis.memtable(table, name=table_name)
Copy link
Member

@jcrist jcrist Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this returns a memtable not bound to a backend, calling .execute() on the result will always execute on the default backend, rather than the backend passed in to the backend arg to fetch (except for duckdb or polars, which take a different fast path). This would be a breaking change and seems non-ideal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, true. 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an idea and it's kind of gross.

return backend.create_table(table_name, obj, temp=True, overwrite=True)
obj = ibis.memtable(table, name=table_name)
backend._register_in_memory_tables(obj)
return backend.table(table_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this drop obj (the memtable) at the end of the function, causing it to be collected and the table removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARG, yes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I immediately call .cache()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just allow temp=True datafusion. Seems like a lot of effort just to disable a behavior that wasn't so bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might have weird consequences for the ddl work @ncclementi is doing? That said, I do think that an engine like datafusion (where all tables are kinda temporary anyway) ignoring temp completely does make sense in isolation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term I do want to make examples work with as many backends as possible, while also not pooping a bunch of example tables around the user's database.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memtable seems like currently the only sane way to do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll push up the cache "solution" and we can discuss further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might have weird consequences for the ddl work @ncclementi is doing? That said, I do think that an engine like datafusion (where all tables are kinda temporary anyway) ignoring temp completely does make sense in isolation.

I know this is closed, but for completion of the discussion this wouldn't affect the work itself. My only concern would be if someone creates a "temporary" and then tries to do a con.ddl.list_temp_tables() it will get a not implemented error, because the create_table swallows the TEMPORARY and such table shows a BASE_TABLE in the information_schema, meaning that it will and is listed when con.ddl.list_tables() is invoked.

That being said, there is an upstream issue to raise when TEMPORARY is used in the SQL which will cause the same problem down the road.

@cpcloud
Copy link
Member Author

cpcloud commented Sep 11, 2024

An alternative solution might be to invert the dependency here and allow backends to control how they produce an example table.

return backend.create_table(table_name, obj, temp=True, overwrite=True)
obj = ibis.memtable(table, name=table_name)
backend._register_in_memory_tables(obj)
return backend.table(table_name).cache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're still in the same place here (but more roundabout), since datafusion doesn't implement cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL, what a mess!

@jcrist
Copy link
Member

jcrist commented Sep 11, 2024

Could we add datafusion to _DIRECT_BACKENDS? I sympathize with wanting to fix the general problem here (#9110), but since we want to release 9.5 today maybe a more minimal change would be good for now?

@cpcloud
Copy link
Member Author

cpcloud commented Sep 11, 2024

Yeah, that makes sense.

@cpcloud
Copy link
Member Author

cpcloud commented Sep 11, 2024

We can't do that because datafusion can't read the compressed CSV file without additional arguments.

@cpcloud
Copy link
Member Author

cpcloud commented Sep 11, 2024

Closing this for now.

@cpcloud cpcloud closed this Sep 11, 2024
@cpcloud cpcloud deleted the datafusion-dataset-fix branch September 11, 2024 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis datafusion The Apache DataFusion backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants