Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support custom partitioning schemes based on "patterns" in the OPTIONS files parameter of CREATE FOREIGN TABLE #79

Open
pdpark opened this issue Aug 14, 2024 · 4 comments
Labels
feature New feature or request priority-low Low priority issue question Further information is requested user-request This issue was directly requested by a user

Comments

@pdpark
Copy link

pdpark commented Aug 14, 2024

What feature are you requesting?

The ability to specify a custom partitioning scheme through the use of a pattern in the files option when creating foreign tables, like this:

CREATE FOREIGN TABLE my_table ()
SERVER parquet_server
OPTIONS (files 's3://bucket/data_{id_1}_{id_2}.parquet')

Why are you requesting this feature?

To support existing custom partitioning scheme.

What is your proposed implementation for this feature?

Foreign tables could be created like this:

CREATE FOREIGN TABLE my_table ()
SERVER parquet_server
OPTIONS (files 's3://bucket/data_{id_1}_{id_2}.parquet')

...or this:

CREATE FOREIGN TABLE my_table ()
SERVER parquet_server
OPTIONS (files 's3://bucket/data_{id_1}_*.parquet')

The values in brackets must correspond with column names defined in the referenced parquet files or the statement will fail.

When running a query like this on the first table defined above:

select *
from my_table
where id_1 = '1234'
and id_2 = '0987'

...the id_1 and id_2 column values from the sql where clause will be substituted into the files pattern producing a string that must correspond with an actual parquet file at the specified s3 location:

s3://bucket/data_1234_0987.parquet

A query on the second table table defined above:

select *
from my_table
where id_1 = '1234'

...will produce a files pattern after substitution that looks like this:

s3://bucket/data_1234_*.parquet

Full Name:

Patrick Park

Affiliation:

Payzer

@philippemnoel philippemnoel added feature New feature or request good first issue Good for newcomers priority-high High priority issue labels Aug 23, 2024
@Weijun-H
Copy link
Contributor

Weijun-H commented Sep 22, 2024

It seems that we could introduce hive_partitioning setting to fix this ticket.

@philippemnoel
Copy link
Collaborator

philippemnoel commented Sep 22, 2024

It seems that we could introduce hive_partitioning setting to fix this ticket.

@shamb0 has made a PR to document hive partitioned, we just need to review and merge it. As for custom partitioning scheme that is not Hive, I'm not convinced we want to expose that as it is probably an edge case. Unless you have an idea that we haven't considered

EDIT: We're still open to considering this, but are waiting for more user requests

@philippemnoel philippemnoel added priority-low Low priority issue question Further information is requested user-request This issue was directly requested by a user and removed good first issue Good for newcomers priority-high High priority issue labels Sep 22, 2024
@Weijun-H
Copy link
Contributor

It seems that we could introduce hive_partitioning setting to fix this ticket.

@shamb0 has made a PR to document hive partitioned, we just need to review and merge it. As for custom partitioning scheme that is not Hive, I'm not convinced we want to expose that as it is probably an edge case. Unless you have an idea that we haven't considered

I see, it makes sense to me.
Btw we should add an example in https://docs.paradedb.com/ingest/import/parquet#parquet-options for the hive partitioned

@philippemnoel
Copy link
Collaborator

It seems that we could introduce hive_partitioning setting to fix this ticket.

@shamb0 has made a PR to document hive partitioned, we just need to review and merge it. As for custom partitioning scheme that is not Hive, I'm not convinced we want to expose that as it is probably an edge case. Unless you have an idea that we haven't considered

I see, it makes sense to me. Btw we should add an example in https://docs.paradedb.com/ingest/import/parquet#parquet-options for the hive partitioned

Agreed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request priority-low Low priority issue question Further information is requested user-request This issue was directly requested by a user
Projects
None yet
Development

No branches or pull requests

3 participants