Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Field Metadata Lost on COUNT DISTINCT queries resulting in Internal Error: Physical input schema should be the same as the one converted from logical input schema #12687

Closed
Tracked by #12733 ...
alamb opened this issue Sep 30, 2024 · 6 comments · Fixed by #12691
Assignees
Labels
bug Something isn't working regression Something that used to work no longer does

Comments

@alamb
Copy link
Contributor

alamb commented Sep 30, 2024

Describe the bug

During upgrade of DataFusion in InfluxDB @itsjunetime found that somewhere Field level metadata (aka Field::with_metadata is being lost during DataFusion logical planning

This manifests itself as an error during the physical planning

Internal Error: Physical input schema should be the same as the one converted from logical input schema

@itsjunetime found a workaround in #12631 (which is to ignore the Field metadata). This ticket tracks fixing it for real.

To Reproduce

(we are working on a reproducer)

Expected behavior

No error during physical planning

Additional context

Copying @itsjunetime's description here: #12560 (comment)

I'm running into this behavior after #11989, specifically seeing schema mismatches where the only thing that is different is that a field's metadata disappears at some point (so the schemas are the same except for a field's metadata). E.g.:

&physical_input_schema = Schema {
    fields: [
        Field {
            name: "alias1",
            data_type: Utf8,
            nullable: true,
            dict_id: 0,
            dict_is_ordered: false,
            metadata: {},
        },
    ],
    metadata: {},
}
&physical_input_schema_from_logical = Schema {
    fields: [
        Field {
            name: "alias1",
            data_type: Utf8,
            nullable: true,
            dict_id: 0,
            dict_is_ordered: false,
            metadata: {
                "some_key": "some_value"
            },
        },
    ],
    metadata: {},
}

I've yet to figure out exactly where the metadata is being dropped and I haven't figured out a reproducer either. I suggested comparing only the fields' non-metadata fields here, but @jayzhan211 pointed out that that's more of a workaround than an actual fix, as it's still a problem if the metadata is disappearing.

The issue that I'm running into, though, seems to be somewhat different than the issue that others (like @phillipleblanc) are running into, where some fields completely disappear from the schema (see here). I don't think these are the same issue, exactly (since they manifest differently), but they may have the same root cause/solution, so I think it's fair to keep them all under this issue unless needed otherwise.

I'll work on getting a fix or reproducer today

@alamb
Copy link
Contributor Author

alamb commented Sep 30, 2024

Here is my suggestion of how to try and reproduce the issue we see:

  1. Add a new sqllogictest like metadata.rs or something: https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest
  2. Add custom / rust setup code here that creates a table that has schemas with metadata (both metadata on the schema and on the fields):
    "joins.slt" => {
    info!("Registering partition table tables");
    let example_udf = create_example_udf();
    test_ctx.ctx.register_udf(example_udf);
    register_partition_table(&mut test_ctx).await;
    }
  3. Add various queries in metadata.slt with the appropriate plan nodes (GROUP BY, ORDER BY, etc) and see if you can provoke the same error.

@alamb
Copy link
Contributor Author

alamb commented Sep 30, 2024

I am happy to help with the queries if someone can help set up the infrastructure

@wiedld
Copy link
Contributor

wiedld commented Sep 30, 2024

take

@alamb alamb added the regression Something that used to work no longer does label Oct 1, 2024
@alamb
Copy link
Contributor Author

alamb commented Oct 1, 2024

Labeling this as a regression as it used to work and now does nt

@itsjunetime
Copy link
Contributor

I think this specific issue is fixed, but after #12691, I'm now seeing issues with added field metadata (as opposed to missing field metadata) at the same comparison point. I'll open another issue for it.

@alamb alamb changed the title Field Metadata Lost resulting in Internal Error: Physical input schema should be the same as the one converted from logical input schema Field Metadata Lost on COUNT DISTINCT queries resulting in Internal Error: Physical input schema should be the same as the one converted from logical input schema Oct 3, 2024
@alamb alamb mentioned this issue Oct 8, 2024
4 tasks
@alamb
Copy link
Contributor Author

alamb commented Oct 17, 2024

We plan to include a fix for this issue in 42.1.0 #12813

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regression Something that used to work no longer does
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants