-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request for support for struct and arry data types #3617
Comments
You may find more information on #2326 -- it would be great to get some idea of what you are trying to do / what datafusion can't do for you today |
I want to be able to specify so.e kind of path expression like select a.b.c for nested structs. For list types also some notation to access the index. My data is a heavy nested and list structs. Currently I can't query them individual fields can't use nested columns in filters. |
Have you tried the Like |
I tried for list list_column[0] it didn't work. I get following exception:
|
If we take the Parquet provided by @kesavkolla, we have the following column,
and sample data:
I tried the following SQL to select one of the fields in the
However, this resulted in the following error:
Looking at |
Hi @ahmedriza -- I am not sure what your system is doing exactly, but that error appears to be related to protobuf serialization I looked at Could you share the file you are using on this ticket so I can give it a try? Maybe we have fixed this in another version of DataFusion |
@alamb Apologies, I should have been more clear. The Parquet file mentioned was in #2439. Attaching here as well: The above mentioned error was when I ran the SQL from Hence, I just wrote two little tests, using the SQL from the #[tokio::test]
async fn test_datafusion_sql() {
let ctx = SessionContext::new();
let filename = "part-00000-f6337bce-7fcd-4021-9f9d-040413ea83f8-c000.snappy.parquet";
ctx.register_parquet("t", filename, ParquetReadOptions::default()).await.unwrap();
let df = ctx.sql("select t.text['status'] from t").await.unwrap();
df.show().await.unwrap();
} Output:
#[tokio::test]
async fn test_ballista_sql() {
let config = BallistaConfig::builder().build().unwrap();
let ctx = BallistaContext::standalone(&config, 10).await.unwrap();
let filename = "part-00000-f6337bce-7fcd-4021-9f9d-040413ea83f8-c000.snappy.parquet";
ctx.register_parquet("t", filename, ParquetReadOptions::default()).await.unwrap();
let df = ctx.sql("select t.text['status'] from t").await.unwrap();
df.show().await.unwrap();
} Output:
Please note that the error is coming from Here are the relevant parts of my ballista = { git = "https://github.com/apache/arrow-ballista", features = ["s3"] }
ballista-cli = { git = "https://github.com/apache/arrow-ballista", features = ["s3"] }
ballista-core = { git = "https://github.com/apache/arrow-ballista", features = ["s3"] }
datafusion = "18.0.0"
futures = "0.3"
object_store = "0.5"
tokio = { version = "1", features = ["full"] } |
Yeah, it seems to work just fine for me in datafusion-cli. Thus I think we should close this ticket in datafusion. I am not sure what is going on with ballista.
|
No I take it back, what we should really do is probably start categorizing what works and what doesn't for these structureed types. Like array access via |
Yeah, it's a good idea to document the features that already work for nested types as there are several cases that work already. I'll take a look to see if I can find out where Looks like the |
Thanks @ahmedriza ! |
Raised #5324 that will fix the issue with |
#2326 is tracking such support |
I think we have significant support now for structs / lists so closing this down in favor of more specific asks / feature requests |
datafusion doesn't support all possible data types the arrow supports. What is the roadmap for supporting for structs, lists etc...? It would be good to support some pushdowns to the complex data to arrow.
The text was updated successfully, but these errors were encountered: