Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty strings in CSV files aren't being interpreted as null when using a Dictionary(_, Utf8) #12041

Open
rumpuslabs opened this issue Aug 17, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@rumpuslabs
Copy link

Describe the bug

Related to #7797

Empty strings in CSV files aren't being interpreted as null when using a Dictionary(_, Utf8)

To Reproduce

Create a simple input.csv file like this:

id,name
1,
2,bob

Run the following code:

#[tokio::main]
async fn main() -> Result<(), DataFusionError> {
    let ctx = SessionContext::new();

    let format = CsvFormat::default();
    let listing_options = ListingOptions::new(Arc::new(format));
    ctx.register_listing_table(
        "input",
        "input.csv",
        listing_options.clone(),
        Some(Arc::new(Schema::new(vec![
            Field::new("id", DataType::Utf8, false),
            Field::new(
                "name",
                DataType::Dictionary(Box::new(DataType::UInt8), Box::new(DataType::Utf8)),
                true,
            ),
        ]))),
        None,
    )
    .await?;

    let results = ctx
        .table("input")
        .await?
        .filter(col("name").is_not_null())?
        .collect()
        .await?;

    let pretty_results = arrow::util::pretty::pretty_format_batches(&results)?.to_string();

    println!("{}", pretty_results);

    Ok(())
}

Expected behavior

I was expecting the output to look like this:

+----+------+
| id | name |
+----+------+
| 2  | bob  |
+----+------+

But the full dataset is returned instead:

+----+------+
| id | name |
+----+------+
| 1  |      |
| 2  | bob  |
+----+------+

Additional context

Tested on v41.0.0

Replace DataType::Dictionary(Box::new(DataType::UInt8), Box::new(DataType::Utf8)) with DataType::Utf8 and it works.

@rumpuslabs rumpuslabs added the bug Something isn't working label Aug 17, 2024
@edmondop
Copy link
Contributor

take

@edmondop
Copy link
Contributor

@alamb shouldn't the csv reader also throw an error because "bob" is not a valid dictionary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants