-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(planner): Allowing setting sort order of parquet files without specifying the schema #12466
Changes from all commits
2b39944
8a65625
356a5b5
a3042a1
95e0341
6d432a3
fc59587
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
|
@@ -228,3 +228,50 @@ OPTIONS ( | |||||||
format.delimiter '|', | ||||||||
has_header false, | ||||||||
compression gzip); | ||||||||
|
||||||||
# Create an external parquet table and infer schema to order by | ||||||||
|
||||||||
# query should succeed | ||||||||
statement ok | ||||||||
CREATE EXTERNAL TABLE t STORED AS parquet LOCATION '../../parquet-testing/data/alltypes_plain.parquet' WITH ORDER (id); | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you also add a test that shows the table is actually ordered correctly? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can do |
||||||||
|
||||||||
## Verify that the table is created with a sort order. Explain should show output_ordering=[id@0 ASC] | ||||||||
query TT | ||||||||
EXPLAIN SELECT id FROM t ORDER BY id ASC; | ||||||||
---- | ||||||||
logical_plan | ||||||||
01)Sort: t.id ASC NULLS LAST | ||||||||
02)--TableScan: t projection=[id] | ||||||||
physical_plan ParquetExec: file_groups={1 group: [[WORKSPACE_ROOT/parquet-testing/data/alltypes_plain.parquet]]}, projection=[id], output_ordering=[id@0 ASC NULLS LAST] | ||||||||
|
||||||||
## Test a DESC order and verify that output_ordering is ASC from the previous OBRDER BY | ||||||||
query TT | ||||||||
EXPLAIN SELECT id FROM t ORDER BY id DESC; | ||||||||
---- | ||||||||
logical_plan | ||||||||
01)Sort: t.id DESC NULLS FIRST | ||||||||
02)--TableScan: t projection=[id] | ||||||||
physical_plan | ||||||||
01)SortExec: expr=[id@0 DESC], preserve_partitioning=[false] | ||||||||
02)--ParquetExec: file_groups={1 group: [[WORKSPACE_ROOT/parquet-testing/data/alltypes_plain.parquet]]}, projection=[id], output_ordering=[id@0 ASC NULLS LAST] | ||||||||
|
||||||||
statement ok | ||||||||
DROP TABLE t; | ||||||||
|
||||||||
# Create table with non default sort order | ||||||||
statement ok | ||||||||
CREATE EXTERNAL TABLE t STORED AS parquet LOCATION '../../parquet-testing/data/alltypes_plain.parquet' WITH ORDER (id DESC NULLS FIRST); | ||||||||
|
||||||||
## Verify that the table is created with a sort order. Explain should show output_ordering=[id@0 DESC NULLS FIRST] | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this test shows one small bug (the output ordering is I think this is due to the fact that this PR computes nulls first like this: let nulls_first = ordered_expr.nulls_first.unwrap_or(true); But the SQL planner computes it like this: datafusion/datafusion/sql/src/expr/order_by.rs Lines 105 to 107 in 94d178e
|
||||||||
query TT | ||||||||
EXPLAIN SELECT id FROM t; | ||||||||
---- | ||||||||
logical_plan TableScan: t projection=[id] | ||||||||
physical_plan ParquetExec: file_groups={1 group: [[WORKSPACE_ROOT/parquet-testing/data/alltypes_plain.parquet]]}, projection=[id], output_ordering=[id@0 DESC] | ||||||||
|
||||||||
statement ok | ||||||||
DROP TABLE t; | ||||||||
|
||||||||
# query should fail with bad column | ||||||||
statement error DataFusion error: Error during planning: Column foo is not in schema | ||||||||
CREATE EXTERNAL TABLE t STORED AS parquet LOCATION '../../parquet-testing/data/alltypes_plain.parquet' WITH ORDER (foo); | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another reason this will fail is that there is already a table named There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Thank you @devanbenz -- perfect