Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow DynamicFileCatalog support to query partitioned file #12671

Closed
goldmedal opened this issue Sep 29, 2024 · 2 comments · Fixed by #12683
Closed

Allow DynamicFileCatalog support to query partitioned file #12671

goldmedal opened this issue Sep 29, 2024 · 2 comments · Fixed by #12683
Assignees
Labels
enhancement New feature or request

Comments

@goldmedal
Copy link
Contributor

Is your feature request related to a problem or challenge?

#11035 supports to query files through their URL. If the target dataset is partitioned, DynamicFileCatalog can't recognize the partitioned columns well.
Given the file structure like:

partitioned_table
     c_date=2018-11-13
         data.csv 
     c_date=2018-12-13
         data.csv

If we tried to query it through the dynamic file catalog

    let sql = "SELECT * FROM 'datafusion/core/tests/data/partitioned_table'";
    let session_config =
        SessionConfig::new().set_str("datafusion.catalog.has_header", "false");
    let state = SessionStateBuilder::default()
        .with_default_features()
        .with_config(session_config)
        .build();
    let ctx = SessionContext::new_with_state(state).enable_url_table();
    ctx.sql(sql).await?.show().await?;

The result is

+----------+-------------------------+
| column_1 | column_2                |
+----------+-------------------------+
| Jorge    | 2018-12-13T12:12:10.011 |
| Andrew   | 2018-11-13T17:11:10.011 |
| Jorge    | 2018-12-13T12:12:10.011 |
| Andrew   | 2018-11-13T17:11:10.011 |
+----------+-------------------------+

The partitioned column c_date won't be used.

Describe the solution you'd like

When inferring the ListingTableConfig, we can register the table partition column automatically. I think we can invoke ListingOption::infer_partitions to infer the required partition columns at the runtime.

Describe alternatives you've considered

No response

Additional context

No response

@goldmedal goldmedal added the enhancement New feature or request label Sep 29, 2024
@goldmedal
Copy link
Contributor Author

take

@goldmedal
Copy link
Contributor Author

There is a test case to assert querying partitioned table fails

# dynamic file query doesn't support partitioned table
statement error DataFusion error: Error during planning: table 'datafusion.public.../core/tests/data/partitioned_table_arrow' not found
SELECT * FROM '../core/tests/data/partitioned_table_arrow' ORDER BY f0;

However, I found that it's not a partitioned-table-related issue. The root cause is the input files, partitioned_table_arrow/part=123/data.arrow and partitioned_table_arrow/part=456/data.arrow, have different arrow schema:

{
   "index_columns":[
      {
         "kind":"range",
         "name":null,
         "start":0,
         "stop":2,
         "step":1
      }
   ],
   "column_indexes":[
      {
         "name":null,
         "field_name":null,
         "pandas_type":"unicode",
         "numpy_type":"object",
         "metadata":{
            "encoding":"UTF-8"
         }
      }
   ],
   "columns":[
      {
         "name":"f0",
         "field_name":"f0",
         "pandas_type":"int64",
         "numpy_type":"int64",
         "metadata":null
      },
      {
         "name":"f1",
         "field_name":"f1",
         "pandas_type":"unicode",
         "numpy_type":"object",
         "metadata":null
      },
      {
         "name":"f2",
         "field_name":"f2",
         "pandas_type":"bool",
         "numpy_type":"object",
         "metadata":null
      }
   ],
   "creator":{
      "library":"pyarrow",
      "version":"15.0.0"
   },
   "pandas_version":"2.2.1"
}

and

{
   "index_columns":[
      {
         "kind":"range",
         "name":null,
         "start":0,
         "stop":2,
         "step":1
      }
   ],
   "column_indexes":[
      {
         "name":null,
         "field_name":null,
         "pandas_type":"unicode",
         "numpy_type":"object",
         "metadata":{
            "encoding":"UTF-8"
         }
      }
   ],
   "columns":[
      {
         "name":"f0",
         "field_name":"f0",
         "pandas_type":"int64",
         "numpy_type":"int64",
         "metadata":null
      },
      {
         "name":"f1",
         "field_name":"f1",
         "pandas_type":"unicode",
         "numpy_type":"object",
         "metadata":null
      },
      {
         "name":"f2",
         "field_name":"f2",
         "pandas_type":"bool",
         "numpy_type":"bool",
         "metadata":null
      }
   ],
   "creator":{
      "library":"pyarrow",
      "version":"15.0.0"
   },
   "pandas_version":"2.2.1"
}

Their numpy_type of f2 are different. One is bool and another is object. It causes an ArrowError for merging schema at

let merged_schema = Schema::try_merge(schemas)?;

We can create an external table for them and query them well.

CREATE EXTERNAL TABLE arrow_partitioned (
part Int,
f0 Bigint,
f1 String,
f2 Boolean
)
STORED AS ARROW
LOCATION '../core/tests/data/partitioned_table_arrow/'
PARTITIONED BY (part);

I guess it may be an issue of ArrowFormat::infer_schema or arrow_schema::Schema::try_merge(schemas) 🤔.
@alamb WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant