-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(duckdb): use duckdb's default temp_directory #10433
base: main
Are you sure you want to change the base?
Conversation
89bee38
to
0090b0c
Compare
0090b0c
to
1f724d4
Compare
Also, only create parent of the temp dir. duckdb will make dir itself only if it needs. Maybe the user can avoid a temp dir if they are only doing low-memory stuff. fixes ibis-project#10433
1f724d4
to
48baa07
Compare
config["temp_directory"] = str(temp_directory) | ||
|
||
if temp_directory := config.get("temp_directory"): | ||
# Only create parent. duckdb will make dir itself only if it needs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this behavior documented somewhere? We were doing this because at the time, DuckDB did not create a temp directory for in-memory connections unless the user explicitly specified one.
ibis.duckdb.connect(temp_directory=path) | ||
assert path.exists() | ||
con = ibis.duckdb.connect(temp_directory=path, max_memory="1MB") | ||
con.raw_sql("CREATE TABLE big AS SELECT range(100_000) as i") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see.
Thanks for adding this test, I agree this is much better than the previous one.
@@ -394,19 +389,10 @@ def do_connect( | |||
("md:", "motherduck:", ":memory:") | |||
): | |||
database = Path(database).absolute() | |||
|
|||
if temp_directory is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a good idea to get rid of.
You're effectively making it so that in-memory connections no longer work on larger-than-memory data unless temp_directory
is passed. That seems strictly worse than being inconsistent with duckdb.connect
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, your new test should pass without needing to provide the temp_directory
argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://duckdb.org/docs/configuration/overview.html#global-configuration-options, it says the default temp_directory is "⟨database_name⟩.tmp or .tmp (in in-memory mode)", which makes me think that either way, duckdb is gonna make the temp dir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, just saw max_temp_directory_size. yeah I will add a test that checks that if you
- do not pass temp dir
- set the memory limit low and do some computation
- then we expect
./.tmp/
to be created and have some content
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nice, it looks like it's now behaving as desired :)
In that case, can we remove the explicit option entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that the only other behavior that we want to test for? Or is there another test case I should add that we want to pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we can't, because duckdb won't create intermediate directories
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I'd much rather have duckdb pooping files in the XDG-standard place (usually ~/.cache/$APPLICATION
), rather than in the current directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're aiming for maximum transparency then we should just call it a breaking change and pass the argument through as part of **kwargs
and verify that out-of-core processing works.
Let's make the pass-through from ibis to duckdb more transparent, so you get closer to the same behavior as when using
import duckdb; duckdb.connect()
.I considered also removing the nicety of creating the temp_dir for users, but decided to keep it.
Now, we only create parent. duckdb will make dir itself only if it needs. This means that for low-memory loads,
a dir might never be needed. This better matches duckdb's behavior.
Also, adjust our test to really check for the end desired behavior, not an implementation detail
of if WE create the dir.
Related to #6974