Allow DynamicFileCatalog support to query partitioned file #12671

goldmedal · 2024-09-29T09:33:58Z

Is your feature request related to a problem or challenge?

#11035 supports to query files through their URL. If the target dataset is partitioned, DynamicFileCatalog can't recognize the partitioned columns well.
Given the file structure like:

partitioned_table
     c_date=2018-11-13
         data.csv 
     c_date=2018-12-13
         data.csv

If we tried to query it through the dynamic file catalog

    let sql = "SELECT * FROM 'datafusion/core/tests/data/partitioned_table'";
    let session_config =
        SessionConfig::new().set_str("datafusion.catalog.has_header", "false");
    let state = SessionStateBuilder::default()
        .with_default_features()
        .with_config(session_config)
        .build();
    let ctx = SessionContext::new_with_state(state).enable_url_table();
    ctx.sql(sql).await?.show().await?;

The result is

+----------+-------------------------+
| column_1 | column_2                |
+----------+-------------------------+
| Jorge    | 2018-12-13T12:12:10.011 |
| Andrew   | 2018-11-13T17:11:10.011 |
| Jorge    | 2018-12-13T12:12:10.011 |
| Andrew   | 2018-11-13T17:11:10.011 |
+----------+-------------------------+

The partitioned column c_date won't be used.

Describe the solution you'd like

When inferring the ListingTableConfig, we can register the table partition column automatically. I think we can invoke ListingOption::infer_partitions to infer the required partition columns at the runtime.

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

goldmedal · 2024-09-29T09:34:44Z

take

goldmedal · 2024-09-29T09:58:09Z

There is a test case to assert querying partitioned table fails

datafusion/datafusion/sqllogictest/test_files/dynamic_file.slt

Lines 28 to 30 in 810e908

    
           # dynamic file query doesn't support partitioned table 
        
           statement error DataFusion error: Error during planning: table 'datafusion.public.../core/tests/data/partitioned_table_arrow' not found 
        
           SELECT * FROM '../core/tests/data/partitioned_table_arrow' ORDER BY f0;

However, I found that it's not a partitioned-table-related issue. The root cause is the input files, partitioned_table_arrow/part=123/data.arrow and partitioned_table_arrow/part=456/data.arrow, have different arrow schema:

{
   "index_columns":[
      {
         "kind":"range",
         "name":null,
         "start":0,
         "stop":2,
         "step":1
      }
   ],
   "column_indexes":[
      {
         "name":null,
         "field_name":null,
         "pandas_type":"unicode",
         "numpy_type":"object",
         "metadata":{
            "encoding":"UTF-8"
         }
      }
   ],
   "columns":[
      {
         "name":"f0",
         "field_name":"f0",
         "pandas_type":"int64",
         "numpy_type":"int64",
         "metadata":null
      },
      {
         "name":"f1",
         "field_name":"f1",
         "pandas_type":"unicode",
         "numpy_type":"object",
         "metadata":null
      },
      {
         "name":"f2",
         "field_name":"f2",
         "pandas_type":"bool",
         "numpy_type":"object",
         "metadata":null
      }
   ],
   "creator":{
      "library":"pyarrow",
      "version":"15.0.0"
   },
   "pandas_version":"2.2.1"
}

and

{
   "index_columns":[
      {
         "kind":"range",
         "name":null,
         "start":0,
         "stop":2,
         "step":1
      }
   ],
   "column_indexes":[
      {
         "name":null,
         "field_name":null,
         "pandas_type":"unicode",
         "numpy_type":"object",
         "metadata":{
            "encoding":"UTF-8"
         }
      }
   ],
   "columns":[
      {
         "name":"f0",
         "field_name":"f0",
         "pandas_type":"int64",
         "numpy_type":"int64",
         "metadata":null
      },
      {
         "name":"f1",
         "field_name":"f1",
         "pandas_type":"unicode",
         "numpy_type":"object",
         "metadata":null
      },
      {
         "name":"f2",
         "field_name":"f2",
         "pandas_type":"bool",
         "numpy_type":"bool",
         "metadata":null
      }
   ],
   "creator":{
      "library":"pyarrow",
      "version":"15.0.0"
   },
   "pandas_version":"2.2.1"
}

Their numpy_type of f2 are different. One is bool and another is object. It causes an ArrowError for merging schema at

datafusion/datafusion/core/src/datasource/file_format/arrow.rs

Line 154 in c4b48d7

let merged_schema = Schema::try_merge(schemas)?;

We can create an external table for them and query them well.

datafusion/datafusion/sqllogictest/test_files/arrow_files.slt

Lines 53 to 61 in a0a635a

    
           CREATE EXTERNAL TABLE arrow_partitioned ( 
        
               part Int, 
        
               f0 Bigint, 
        
               f1 String, 
        
               f2 Boolean 
        
           ) 
        
           STORED AS ARROW 
        
           LOCATION '../core/tests/data/partitioned_table_arrow/' 
        
           PARTITIONED BY (part);

I guess it may be an issue of ArrowFormat::infer_schema or arrow_schema::Schema::try_merge(schemas) 🤔.
@alamb WDYT?

goldmedal added the enhancement New feature or request label Sep 29, 2024

github-actions bot assigned goldmedal Sep 29, 2024

goldmedal mentioned this issue Sep 30, 2024

Allow DynamicFileCatalog support to query partitioned file #12683

Merged

alamb closed this as completed in #12683 Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow DynamicFileCatalog support to query partitioned file #12671

Allow DynamicFileCatalog support to query partitioned file #12671

goldmedal commented Sep 29, 2024

goldmedal commented Sep 29, 2024

goldmedal commented Sep 29, 2024

Allow DynamicFileCatalog support to query partitioned file #12671

Allow DynamicFileCatalog support to query partitioned file #12671

Comments

goldmedal commented Sep 29, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

goldmedal commented Sep 29, 2024

goldmedal commented Sep 29, 2024