Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(duckdb): use duckdb's default temp_directory #10433

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Nov 4, 2024

Let's make the pass-through from ibis to duckdb more transparent, so you get closer to the same behavior as when using import duckdb; duckdb.connect().

I considered also removing the nicety of creating the temp_dir for users, but decided to keep it.
Now, we only create parent. duckdb will make dir itself only if it needs. This means that for low-memory loads,
a dir might never be needed. This better matches duckdb's behavior.
Also, adjust our test to really check for the end desired behavior, not an implementation detail
of if WE create the dir.

Related to #6974

@github-actions github-actions bot added the duckdb The DuckDB backend label Nov 4, 2024
@github-actions github-actions bot added tests Issues or PRs related to tests sql Backends that generate SQL labels Nov 4, 2024
NickCrews added a commit to NickCrews/ibis that referenced this pull request Nov 4, 2024
Also, only create parent of the temp dir. duckdb will make dir itself only if it needs. Maybe the user can avoid a temp dir
if they are only doing low-memory stuff.

fixes ibis-project#10433
config["temp_directory"] = str(temp_directory)

if temp_directory := config.get("temp_directory"):
# Only create parent. duckdb will make dir itself only if it needs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this behavior documented somewhere? We were doing this because at the time, DuckDB did not create a temp directory for in-memory connections unless the user explicitly specified one.

ibis.duckdb.connect(temp_directory=path)
assert path.exists()
con = ibis.duckdb.connect(temp_directory=path, max_memory="1MB")
con.raw_sql("CREATE TABLE big AS SELECT range(100_000) as i")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see.

Thanks for adding this test, I agree this is much better than the previous one.

@@ -394,19 +389,10 @@ def do_connect(
("md:", "motherduck:", ":memory:")
):
database = Path(database).absolute()

if temp_directory is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good idea to get rid of.

You're effectively making it so that in-memory connections no longer work on larger-than-memory data unless temp_directory is passed. That seems strictly worse than being inconsistent with duckdb.connect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, your new test should pass without needing to provide the temp_directory argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://duckdb.org/docs/configuration/overview.html#global-configuration-options, it says the default temp_directory is "⟨database_name⟩.tmp or .tmp (in in-memory mode)", which makes me think that either way, duckdb is gonna make the temp dir

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, just saw max_temp_directory_size. yeah I will add a test that checks that if you

  • do not pass temp dir
  • set the memory limit low and do some computation
  • then we expect ./.tmp/ to be created and have some content

Copy link
Member

@cpcloud cpcloud Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nice, it looks like it's now behaving as desired :)

In that case, can we remove the explicit option entirely?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the only other behavior that we want to test for? Or is there another test case I should add that we want to pass?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we can't, because duckdb won't create intermediate directories

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'd much rather have duckdb pooping files in the XDG-standard place (usually ~/.cache/$APPLICATION), rather than in the current directory.

Copy link
Member

@cpcloud cpcloud Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're aiming for maximum transparency then we should just call it a breaking change and pass the argument through as part of **kwargs and verify that out-of-core processing works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duckdb The DuckDB backend sql Backends that generate SQL tests Issues or PRs related to tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants