-
Notifications
You must be signed in to change notification settings - Fork 595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(examples): use memtables for examples to allow backends that do not support temporary tables to support examples #10094
Conversation
ibis/examples/__init__.py
Outdated
# directly. | ||
obj = ibis.memtable(table) | ||
return backend.create_table(table_name, obj, temp=True, overwrite=True) | ||
return ibis.memtable(table, name=table_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this returns a memtable not bound to a backend, calling .execute()
on the result will always execute on the default backend, rather than the backend passed in to the backend
arg to fetch
(except for duckdb
or polars
, which take a different fast path). This would be a breaking change and seems non-ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, true. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an idea and it's kind of gross.
ibis/examples/__init__.py
Outdated
return backend.create_table(table_name, obj, temp=True, overwrite=True) | ||
obj = ibis.memtable(table, name=table_name) | ||
backend._register_in_memory_tables(obj) | ||
return backend.table(table_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this drop obj
(the memtable) at the end of the function, causing it to be collected and the table removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARG, yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if I immediately call .cache()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should just allow temp=True
datafusion. Seems like a lot of effort just to disable a behavior that wasn't so bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might have weird consequences for the ddl work @ncclementi is doing? That said, I do think that an engine like datafusion (where all tables are kinda temporary anyway) ignoring temp
completely does make sense in isolation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long term I do want to make examples work with as many backends as possible, while also not pooping a bunch of example tables around the user's database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memtable
seems like currently the only sane way to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll push up the cache
"solution" and we can discuss further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might have weird consequences for the ddl work @ncclementi is doing? That said, I do think that an engine like datafusion (where all tables are kinda temporary anyway) ignoring temp completely does make sense in isolation.
I know this is closed, but for completion of the discussion this wouldn't affect the work itself. My only concern would be if someone creates a "temporary" and then tries to do a con.ddl.list_temp_tables()
it will get a not implemented error, because the create_table
swallows the TEMPORARY
and such table shows a BASE_TABLE
in the information_schema, meaning that it will and is listed when con.ddl.list_tables()
is invoked.
That being said, there is an upstream issue to raise when TEMPORARY
is used in the SQL which will cause the same problem down the road.
…ot support temporary tables to support examples
8e5cbc0
to
a2cc439
Compare
An alternative solution might be to invert the dependency here and allow backends to control how they produce an example table. |
return backend.create_table(table_name, obj, temp=True, overwrite=True) | ||
obj = ibis.memtable(table, name=table_name) | ||
backend._register_in_memory_tables(obj) | ||
return backend.table(table_name).cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're still in the same place here (but more roundabout), since datafusion doesn't implement cache
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOL, what a mess!
Could we add |
Yeah, that makes sense. |
We can't do that because datafusion can't read the compressed CSV file without additional arguments. |
Closing this for now. |
Fixes an import bug that can show up when calling register on the datafusion backend without having imported
pyarrow.dataset
before callingregister
, as well as moving to memtables for examples to bypasstemp=True
shenanigans.The import bug wasn't caught due to importing
pyarrow.dataset
at the top of the test module.