Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache Parquet metadata in pg_analytics #57

Open
philippemnoel opened this issue Aug 8, 2024 · 9 comments
Open

Cache Parquet metadata in pg_analytics #57

philippemnoel opened this issue Aug 8, 2024 · 9 comments
Labels
feature New feature or request good first issue Good for newcomers priority-low Low priority issue

Comments

@philippemnoel
Copy link
Collaborator

What feature are you requesting?

DuckDB has a feature, enable_object_cache, which can cache Parquet metadata in-memory. This is known to improve query performance. We would like to cache this on-disk, so that it can persist across queries.

Why are you requesting this feature?

See above ^

What is your proposed implementation for this feature?

I'm not sure, see #30 for reference

Full Name:

Philippe Noël

Affiliation:

ParadeDB

@philippemnoel philippemnoel added the good first issue Good for newcomers label Aug 8, 2024
@philippemnoel philippemnoel transferred this issue from paradedb/paradedb Aug 8, 2024
@philippemnoel philippemnoel added feature New feature or request priority-high High priority issue labels Aug 8, 2024
@dharanad
Copy link

interesting. I would like to work on this issue. Before that let me take a look at the codebase.

@philippemnoel
Copy link
Collaborator Author

It’s yours!

@dharanad
Copy link

Hey @philippemnoel I am finding it hard to allocate time for this issue. Since this is marked as high priority. Please feel free to re-assign this issues to someone else

@philippemnoel
Copy link
Collaborator Author

Hey @philippemnoel I am finding it hard to allocate time for this issue. Since this is marked as high priority. Please feel free to re-assign this issues to someone else

Sounds good!

@shamb0
Copy link

shamb0 commented Aug 20, 2024

Hi @philippemnoel, I am interested to work on this. Could you please assign it to me.

@philippemnoel
Copy link
Collaborator Author

Yes

@philippemnoel
Copy link
Collaborator Author

Hi @philippemnoel, I am interested to work on this. Could you please assign it to me.

Done! Excited to see :)

@shamb0
Copy link

shamb0 commented Sep 2, 2024

Hi @philippemnoel,

Could you please reassign the issue to me?

  • I've completed a draft version, but this implementation depends on PR#115, so I’m waiting for it to be reviewed and merged before proceeding further.

  • You can find the draft prototype here: link to draft prototype.

  • Please kindly find the traces for reference.

2024-09-02 22:27:30.592  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Completed data upload to S3
2024-09-02 22:27:30.593  INFO sqlx::postgres::notice: table "auto_sales" does not exist, skipping
2024-09-02 22:27:30.593  INFO sqlx::postgres::notice: server "auto_sales_server" does not exist, skipping
2024-09-02 22:27:30.593  INFO sqlx::postgres::notice: foreign-data wrapper "parquet_wrapper" does not exist, skipping
2024-09-02 22:27:30.594  INFO sqlx::postgres::notice: server "auto_sales_server" does not exist, skipping
2024-09-02 22:27:43.928  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Average execution time with cache disabled: 479.805658ms
2024-09-02 22:27:43.928  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Average execution time with cache enabled: 272.384816ms
2024-09-02 22:27:43.928  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Average execution time after disabling cache: 476.185084ms
2024-09-02 22:27:43.928  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Performance improvement with cache: 42.80%
test test_duckdb_object_cache_performance ... ok

Thank you!

@philippemnoel
Copy link
Collaborator Author

Hi @philippemnoel,

Could you please reassign the issue to me?

  • I've completed a draft version, but this implementation depends on PR#115, so I’m waiting for it to be reviewed and merged before proceeding further.
  • You can find the draft prototype here: link to draft prototype.
  • Please kindly find the traces for reference.
2024-09-02 22:27:30.592  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Completed data upload to S3
2024-09-02 22:27:30.593  INFO sqlx::postgres::notice: table "auto_sales" does not exist, skipping
2024-09-02 22:27:30.593  INFO sqlx::postgres::notice: server "auto_sales_server" does not exist, skipping
2024-09-02 22:27:30.593  INFO sqlx::postgres::notice: foreign-data wrapper "parquet_wrapper" does not exist, skipping
2024-09-02 22:27:30.594  INFO sqlx::postgres::notice: server "auto_sales_server" does not exist, skipping
2024-09-02 22:27:43.928  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Average execution time with cache disabled: 479.805658ms
2024-09-02 22:27:43.928  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Average execution time with cache enabled: 272.384816ms
2024-09-02 22:27:43.928  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Average execution time after disabling cache: 476.185084ms
2024-09-02 22:27:43.928  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Performance improvement with cache: 42.80%
test test_duckdb_object_cache_performance ... ok

Thank you!

Done! Really exciting stuff, I had removed you since you were still working on the partition PR. Excited to see it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment