Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support casting to and from unions #6218

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

samuelcolvin
Copy link
Contributor

@samuelcolvin samuelcolvin commented Aug 9, 2024

Which issue does this PR close?

Closes apache/datafusion#10180.

Rationale for this change

I want to cast from unions to primitive types (I've also added support for casting to a union here).

What changes are included in this PR?

Support casting to and from unions (hopefully), tests.

Casting between unions is not yet supported.

Are there any user-facing changes?

no

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 9, 2024
@samuelcolvin samuelcolvin changed the title WIP support casting to and from unions Support casting to and from unions Aug 9, 2024
Comment on lines +2536 to +2579
child.slice(*offset as usize, 1)
} else {
new_null_array(data_type, 1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how slow this will be?

In theory we could do something cleverer where we build up chunks rather than having a separate array for each element, but I've no idea how much difference it will make.

More to the point I wonder if there's a completely different way to do this that's much faster?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will definitely be slow.

I think the better API is this: https://docs.rs/arrow/latest/arrow/array/struct.MutableArrayData.html (which lets you copy chunks of data from different arrays)

Though if what you are trying to do is to get one child array and use the nulls of the parent, you could potentially just replace the null buffer directly

I can probably have some more specific suggestions once we sort out what the semantics of casting a union array are supposed to be (see my other comment)

@samuelcolvin samuelcolvin marked this pull request as ready for review August 9, 2024 16:45
@samuelcolvin
Copy link
Contributor Author

I think this is ready to review.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @samuelcolvin -- I am sorry for the delay in reviewing.

I am not sure about the desired semantics of a union cast. Thus I think we should at minimum document what casting Union means in https://docs.rs/arrow/latest/arrow/compute/fn.cast_with_options.html#behavior

It would probably help to research what other systems (e.g. pyarrow) do when casitng UnionArray to other types

Once we are clear / agree on the semantics we can then sort out the implementation

@@ -296,7 +296,7 @@ impl UnionArray {
}

/// Returns whether the `UnionArray` is dense (or sparse if `false`).
fn is_dense(&self) -> bool {
pub fn is_dense(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


(Union(_, _), Union(_, _)) => false,
(Union(from_fields, _), _) => {
from_fields.iter().any(|(_, f)| can_cast_types(f.data_type(), to_type))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this verify we can cast ALL the possible union field types to the target type , not just one of the fields?

)),

// unions
// we might be able to support this, but it's complex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 -- it is also not well specified if there are multiple field types that could be the target

For example, what would Union(Int32, Int64) --> Union(Float64, Decimal128) do? Would it cast all the integers to Float64 Or to Decimal128? Or something else 🤔

///
/// # Panics
/// Panics if `type_id` is not present in the union.
fn union_field_array(union_array: &UnionArray, type_id: i8) -> Result<Cow<ArrayRef>, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since an ArrayRef is already an Arc I suspect the extra Cow here doesn't matter much

Comment on lines +9736 to +9738
let cast_array = cast(&union_array, &DataType::Int64).unwrap();
let as_int64 = as_int_vec::<Int64Type>(&cast_array);
assert_eq!(as_int64, vec![Some(1), None, None, None]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the expected semantics of casting a UnionArray to another type?

What this test seems to verify is some sort of TAKE semantics, where casting from UnionArray to Int64 basically picks the Union variant that is the closest and takes

So it does

cast(union of [{A=1}, {A=}, {B=3.2}], Int64) --> [1, NULL, NULL]

I sort of expected that the result of casting a UnionArray would be to cast all the elements individually to the target type. So in this example

cast (union of [{A=1}, {A=}, {B=3.2}]) --> [1, NULL, 3]

Where the 3 results from casting 3.2 to 3

Comment on lines +2536 to +2579
child.slice(*offset as usize, 1)
} else {
new_null_array(data_type, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will definitely be slow.

I think the better API is this: https://docs.rs/arrow/latest/arrow/array/struct.MutableArrayData.html (which lets you copy chunks of data from different arrays)

Though if what you are trying to do is to get one child array and use the nulls of the parent, you could potentially just replace the null buffer directly

I can probably have some more specific suggestions once we sort out what the semantics of casting a union array are supposed to be (see my other comment)

@samuelcolvin
Copy link
Contributor Author

@alamb ye, I think you make a very good point. I hadn't thought about the case where multiple children (perhaps all children) of the union are valid in the output type.

I'll do some research.

@samuelcolvin
Copy link
Contributor Author

Let's discuss on #6247.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Cannot infer common argument type for comparison operation Union..."
2 participants