Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

column types must match schema types occurs after unnest_columns on another column #14218

Open
ion-elgreco opened this issue Jan 20, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@ion-elgreco
Copy link

ion-elgreco commented Jan 20, 2025

Describe the bug

I am rewriting our CDF operation in delta-rs, the code looks roughly like this:

    let mut projected = if should_cdc {
        operation_count
            .clone()
            .with_column(
                CDC_COLUMN_NAME,
                when(col(TARGET_DELETE_COLUMN).is_null(), lit("delete")) // nulls are equal to True
                    .when(col(DELETE_COLUMN).is_null(), lit("source_delete"))
                    .when(col(TARGET_COPY_COLUMN).is_null(), lit("copy"))
                    .when(col(TARGET_INSERT_COLUMN).is_null(), lit("insert"))
                    .when(col(TARGET_UPDATE_COLUMN).is_null(), lit("update"))
                    .end()?,
            )?
            // .drop_columns(&["__delta_rs_path"])? // WEIRD bug caused by interaction with unnest_columns, has to be dropped otherwise throws schema error
            .with_column(
                "__delta_rs_update_expanded",
                when(
                    col(CDC_COLUMN_NAME).eq(lit("update")),
                    lit(ScalarValue::List(ScalarValue::new_list(
                        &[
                            ScalarValue::Utf8(Some("update_preimage".into())),
                            ScalarValue::Utf8(Some("update_postimage".into())),
                        ],
                        &DataType::List(Field::new("element", DataType::Utf8, false).into()),
                        true,
                    ))),
                )
                .end()?,
            )?
            .unnest_columns(&["__delta_rs_update_expanded"])?
            .with_column(
                CDC_COLUMN_NAME,
                when(
                    col(CDC_COLUMN_NAME).eq(lit("update")),
                    col("__delta_rs_update_expanded"),
                )
                .otherwise(col(CDC_COLUMN_NAME))?,
            )?
            .drop_columns(&["__delta_rs_update_expanded"])?
            .select(write_projection_with_cdf)?

I noticed that when I do unnest_columns on another column, it complains afterwards about a schema error:

Result::unwrap()` on an `Err` value: Arrow { source: InvalidArgumentError("column types must match schema types, expected Utf8 but found Dictionary(UInt16, Utf8) at column index 7") }

Since I don't need the column, I can safely drop it beforehand, but I don't understand why doesn't Dictionary(UInt16, Utf8) just coerce to utf8?

To Reproduce

Bit difficult but, you can run grab my branch: https://github.com/ion-elgreco/delta-rs/tree/refactor--combine_execution_plans

And then you run the test test_merge_cdc_enabled_simple, with this line commented out: .drop_columns(&["__delta_rs_path"])?

Expected behavior

I guess coerce gracefully?

Additional context

No response

@ion-elgreco ion-elgreco added the bug Something isn't working label Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant