Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request for support for struct and arry data types #3617

Closed
kesavkolla opened this issue Sep 26, 2022 · 14 comments
Closed

Feature request for support for struct and arry data types #3617

kesavkolla opened this issue Sep 26, 2022 · 14 comments
Labels
enhancement New feature or request

Comments

@kesavkolla
Copy link

datafusion doesn't support all possible data types the arrow supports. What is the roadmap for supporting for structs, lists etc...? It would be good to support some pushdowns to the complex data to arrow.

@alamb
Copy link
Contributor

alamb commented Sep 28, 2022

You may find more information on #2326 -- it would be great to get some idea of what you are trying to do / what datafusion can't do for you today

@kesavkolla
Copy link
Author

I want to be able to specify so.e kind of path expression like select a.b.c for nested structs. For list types also some notation to access the index.

My data is a heavy nested and list structs. Currently I can't query them individual fields can't use nested columns in filters.

@alamb
Copy link
Contributor

alamb commented Oct 6, 2022

I want to be able to specify so.e kind of path expression like select a.b.c for nested structs. For list types also some notation to access the index.

Have you tried the [] syntax?

Like struct_column["b"]["c"] for nested structs and list_column[3] for list access? I was pleasantly surprised that it worked when I last tried it

@kesavkolla
Copy link
Author

I tried for list list_column[0] it didn't work.

I get following exception:

thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: ArrowError(ComputeError("concat requires input of at least one array"))', /home/kesav/.cargo/git/checkouts/arrow-datafusion-71ae82d9dec9a01c/8df5496/datafusion/common/src/scalar.rs:1383:18

@ahmedriza
Copy link
Contributor

If we take the Parquet provided by @kesavkolla, we have the following column, text whose Parquet schema is:

 |-- text: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- extension: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- status: string (nullable = true)
 |    |-- div: string (nullable = true)

and sample data:

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| text                                                                                                                                                                                   |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                                                        |
| {"id": null, "extension": null, "status": "generated", ...

I tried the following SQL to select one of the fields in the struct:

ctx.register_parquet("t", "t.parquet", ParquetReadOptions::default()).await?;
ctx.sql("select t.text['id'] from t").await?;

However, this resulted in the following error:

Error: Arrow error: External error: Execution error: Job zlH3pzz failed: Error planning job zlH3pzz: DataFusionError(Internal("physical_plan::to_proto() unsupported expression GetIndexedFieldExpr { arg: Column { name: \"text\", index: 0 }, key: Utf8(\"id\") }"))

Looking at datafusion/proto/src/physical_plan/to_proto.rs it does appear that this is not supported at present. Or perhaps I have made a mistake in my SQL?

@alamb
Copy link
Contributor

alamb commented Feb 15, 2023

If we take the Parquet provided by @kesavkolla, we have the following column, text whose Parquet schema is:

Hi @ahmedriza -- I am not sure what your system is doing exactly, but that error appears to be related to protobuf serialization

I looked at to_proto and it seems like it has the right code
https://github.com/apache/arrow-datafusion/blob/d05647c65e14d865b854a845ae970797a6086e2c/datafusion/proto/src/logical_plan/to_proto.rs#L859C30-L867

Could you share the file you are using on this ticket so I can give it a try? Maybe we have fixed this in another version of DataFusion

@ahmedriza
Copy link
Contributor

ahmedriza commented Feb 15, 2023

@alamb Apologies, I should have been more clear. The Parquet file mentioned was in #2439. Attaching here as well:
part-00000-f6337bce-7fcd-4021-9f9d-040413ea83f8-c000.snappy.parquet.zip

The above mentioned error was when I ran the SQL from ballista and I've checked that ballista on the master branch is currently using datafusion version 18.0.0.

Hence, I just wrote two little tests, using the datafusion and the ballista context respectively.

SQL from the datafusion context works, whilst the one that uses the ballista context fails. Test code:

#[tokio::test]
async fn test_datafusion_sql() {
    let ctx = SessionContext::new();
    let filename = "part-00000-f6337bce-7fcd-4021-9f9d-040413ea83f8-c000.snappy.parquet";
    ctx.register_parquet("t", filename, ParquetReadOptions::default()).await.unwrap();
    let df = ctx.sql("select t.text['status'] from t").await.unwrap();
    df.show().await.unwrap();
}

Output:

+----------------+
| t.text[status] |
+----------------+
|                |
| generated      |
| generated      |
| generated      |
| generated      |
| generated      |
| generated      |
| generated      |
| generated      |
| generated      |
+----------------+
#[tokio::test]
async fn test_ballista_sql() {
    let config = BallistaConfig::builder().build().unwrap();
    let ctx = BallistaContext::standalone(&config, 10).await.unwrap();
    let filename = "part-00000-f6337bce-7fcd-4021-9f9d-040413ea83f8-c000.snappy.parquet";
    ctx.register_parquet("t", filename, ParquetReadOptions::default()).await.unwrap();
    let df = ctx.sql("select t.text['status'] from t").await.unwrap();
    df.show().await.unwrap();
}

Output:

thread 'query::test::test_ballista_sql' panicked at 'called `Result::unwrap()` on an `Err` value: ArrowError(ExternalError(Execution("Job QeRwZCh failed: Error planning job QeRwZCh: DataFusionError(Internal(\"physical_plan::to_proto() unsupported expression GetIndexedFieldExpr { arg: Column { name: \\\"text\\\", index: 0 }, key: Utf8(\\\"status\\\") }\"))")))', src/query.rs:44:25

Please note that the error is coming from datafusion/proto/src/physical_plan/to_proto.rs and not the logical_plan.

Here are the relevant parts of my Cargo.toml.

ballista = { git = "https://github.com/apache/arrow-ballista", features = ["s3"] }
ballista-cli = { git = "https://github.com/apache/arrow-ballista", features = ["s3"] }
ballista-core = { git = "https://github.com/apache/arrow-ballista", features = ["s3"] }
datafusion = "18.0.0"

futures = "0.3"
object_store = "0.5"
tokio = { version = "1", features = ["full"] }

@alamb
Copy link
Contributor

alamb commented Feb 16, 2023

Yeah, it seems to work just fine for me in datafusion-cli. Thus I think we should close this ticket in datafusion. I am not sure what is going on with ballista.

alamb@MacBook-Pro-8:~/Downloads$ datafusion-cli
DataFusion CLI v18.0.0
❯ select text['status'] from 'part-00000-f6337bce-7fcd-4021-9f9d-040413ea83f8-c000.snappy.parquet' limit 10;
+----------------------------------------------------------------------------------+
| part-00000-f6337bce-7fcd-4021-9f9d-040413ea83f8-c000.snappy.parquet.text[status] |
+----------------------------------------------------------------------------------+
|                                                                                  |
| generated                                                                        |
| generated                                                                        |
| generated                                                                        |
| generated                                                                        |
| generated                                                                        |
| generated                                                                        |
| generated                                                                        |
| generated                                                                        |
| generated                                                                        |
+----------------------------------------------------------------------------------+
10 rows in set. Query took 0.010 seconds.

@alamb alamb closed this as completed Feb 16, 2023
@alamb alamb reopened this Feb 16, 2023
@alamb
Copy link
Contributor

alamb commented Feb 16, 2023

No I take it back, what we should really do is probably start categorizing what works and what doesn't for these structureed types. Like array access via ["foo"] works, but not list access, for example.

@ahmedriza
Copy link
Contributor

ahmedriza commented Feb 16, 2023

Yeah, it's a good idea to document the features that already work for nested types as there are several cases that work already.

I'll take a look to see if I can find out where ballista is going wrong, although, of course, ultimately, the call ends in datafusion.

Looks like the protobuf serialisation is missing in ballista (for the GetIndexedFieldExpr), which is specific to ballista from what I can read. Hopefully an easy fix.

@alamb
Copy link
Contributor

alamb commented Feb 16, 2023

Thanks @ahmedriza !

@ahmedriza
Copy link
Contributor

ahmedriza commented Feb 17, 2023

Raised #5324 that will fix the issue with ballista.

@alamb
Copy link
Contributor

alamb commented Jun 23, 2023

#2326 is tracking such support

@alamb
Copy link
Contributor

alamb commented Jun 19, 2024

I think we have significant support now for structs / lists so closing this down in favor of more specific asks / feature requests

@alamb alamb closed this as completed Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants