Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential regression in Schema / nullability calculations after upgrade to 42.0.0 #12560

Closed
alamb opened this issue Sep 20, 2024 · 5 comments
Closed
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Sep 20, 2024

Describe the bug

@phillipleblanc and @itsjunetime have both hit upgrades related to nullability and other metadata in schemas after the DataFusion 42 upgrade.

In addition, @ion-elgreco has it something similar while updating in delta.rs (see delta-io/delta-rs#2886)

I am filing this ticket to make this more visibility

To Reproduce

Not sure (maybe someone could create a self contained reproducer of the problem)

Expected behavior

No response

Additional context

This might have been introduced here: #11989

There is a discussion happening here #11989 (comment)

@alamb alamb added the bug Something isn't working label Sep 20, 2024
@itsjunetime
Copy link
Contributor

I'm running into this behavior after #11989, specifically seeing schema mismatches where the only thing that is different is that a field's metadata disappears at some point (so the schemas are the same except for a field's metadata). E.g.:

&physical_input_schema = Schema {
    fields: [
        Field {
            name: "alias1",
            data_type: Utf8,
            nullable: true,
            dict_id: 0,
            dict_is_ordered: false,
            metadata: {},
        },
    ],
    metadata: {},
}
&physical_input_schema_from_logical = Schema {
    fields: [
        Field {
            name: "alias1",
            data_type: Utf8,
            nullable: true,
            dict_id: 0,
            dict_is_ordered: false,
            metadata: {
                "some_key": "some_value"
            },
        },
    ],
    metadata: {},
}

I've yet to figure out exactly where the metadata is being dropped and I haven't figured out a reproducer either. I suggested comparing only the fields' non-metadata fields here, but @jayzhan211 pointed out that that's more of a workaround than an actual fix, as it's still a problem if the metadata is disappearing.

The issue that I'm running into, though, seems to be somewhat different than the issue that others (like @phillipleblanc) are running into, where some fields completely disappear from the schema (see here). I don't think these are the same issue, exactly (since they manifest differently), but they may have the same root cause/solution, so I think it's fair to keep them all under this issue unless needed otherwise.

I'll work on getting a fix or reproducer today

@hveiga
Copy link
Contributor

hveiga commented Sep 24, 2024

Just to add to the visibility: we are also observing the same behavior after updating to 42.0.0. This is the error message we are getting, to help with debugging:

Caused by:
    0: Arrow error: Invalid argument error: Column 'column1' is declared as non-nullable but contains null values
    1: Invalid argument error: Column 'column1' is declared as non-nullable but contains null values

We are getting this error when reading some parquet files using Datafusion. I have verified with other tools (parquet cli, DuckDB) column1 does not contain any null values and metadata of the files is correct.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Sep 24, 2024

Just to add to the visibility: we are also observing the same behavior after updating to 42.0.0. This is the error message we are getting, to help with debugging:

Caused by:
    0: Arrow error: Invalid argument error: Column 'column1' is declared as non-nullable but contains null values
    1: Invalid argument error: Column 'column1' is declared as non-nullable but contains null values

We are getting this error when reading some parquet files using Datafusion. I have verified with other tools (parquet cli, DuckDB) column1 does not contain any null values and metadata of the files is correct.

Is there any small parquet file that has the same error, if we can reproduce the error, it is easier to find the root cause.

@hveiga
Copy link
Contributor

hveiga commented Sep 24, 2024

Just to add to the visibility: we are also observing the same behavior after updating to 42.0.0. This is the error message we are getting, to help with debugging:

Caused by:
    0: Arrow error: Invalid argument error: Column 'column1' is declared as non-nullable but contains null values
    1: Invalid argument error: Column 'column1' is declared as non-nullable but contains null values

We are getting this error when reading some parquet files using Datafusion. I have verified with other tools (parquet cli, DuckDB) column1 does not contain any null values and metadata of the files is correct.

Is there any small parquet file that has the same error, if we can reproduce the error, it is easier to find the root cause.

We have been digging more regarding this error and it seems it is not related to the Datafusion upgrade, I apologize for the confusion. Therefore there's no file I can provide :(

@alamb
Copy link
Contributor Author

alamb commented Sep 30, 2024

We have also been digging in what we saw in InfluxDB 3.0 and I filed #12687 to track it separately. Let's close this omnibus issue and we file individual issues for specific problems as we find them

@alamb alamb closed this as not planned Won't fix, can't repro, duplicate, stale Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants