Introduce `Scalar` type for ColumnarValue #12536

notfilippo · 2024-09-19T14:30:33Z

This PR represents the first step originating from experiment #11978, which itself stems from the broad objective described in proposal #11513.

Rationale for this change

This change introduces the knowledge of Scalar type which identifies the physical representation of a scalar value. Scalar will replace ScalarValue in the ColumnarValue enum. This is done in order to allow, in future PRs, to remove some redundant variants of ScalarValue and transition to logical representation of scalar values in datafusion.

Design rationale

The new Scalar type embeds both a ScalarValue and a DataType. While the type information might seem redundant considering the current implementation ScalarValue it will be beneficial once we start to remove redundant variants (such as replacing ScalarValue::Utf8View and ScalarValue::LargeUtf8 while keeping ScalarValue::Utf8) in order to keep track of the represented type during physical execution.

Alternatives considered: `arrow::array::Scalar`

Arrow's array crate provides a Scalar type which implements Datum and which holds a reference to a len() = 1 arrow array. While this type fully identifies the physical representation of a scalar value it falls short when considering the overhead of copying a fully rendered ArrayRef instead of primitives types.

This possibility was also already discussed here: #7353 (comment).

What changes are included in this PR?

Introduced the Scalar struct
Replaced ScalarValue with Scalar in ColumnarValue's Scalar variant
Adapted existing code to the Scalar variant change

Are these changes tested?

This change should be a no-op as it doesn't include any behavioural change. The existing test suite is sufficient to verify this change.

TODO

rustdoc on new types and functions

notfilippo · 2024-09-19T15:02:53Z

The main change is contained in this file: https://github.com/apache/datafusion/pull/12536/files#diff-2cc034babb8e7f8601dda34ecaa2119104eccd76d8ad2f9e19b26b01463634d6 (takes a while to jump to the file in the diff once the page load).

All other files are mostly updates to support the change from ScalarValue to Scalar in ColumnarValue

alamb · 2024-09-19T19:23:48Z

FYI @findepi

alamb · 2024-09-19T19:23:57Z

I will try and find time to review this tomorrow -- thank you @notfilippo

findepi · 2024-09-20T03:55:44Z

Thank you @notfilippo for working on this!

ScalarValue::Utf8View and ScalarValue::LargeUtf8 while keeping ScalarValue::Utf8) in order to keep track of the represented type during physical execution.

Can you elaborate what this is needed for?

notfilippo · 2024-09-20T09:05:47Z

Can you elaborate what this is needed for?

@findepi -- DataType in Scalar is needed as a way to retain the physical type that the ScalarValues represents. That knowledge will be used when creating arrays from scalar types. It comes especially useful when converting a vector of scalars into an array of the correct physical type.

You can find some more context here: #11978 (comment)

As @jayzhan211 mentions in #11978 (comment) and #11978 (comment) it's a step towards the goal but it does not represent the full decoupling of physical and logical types.

jayzhan211

I'm fine with this change. Let's wait for others' views

notfilippo · 2024-09-20T13:03:14Z

Before merging I was also planning to add some rustdoc on the new type in order to explain its purpose and future plans.

findepi · 2024-09-20T17:05:52Z

datafusion/expr-common/src/scalar.rs

+pub struct Scalar {
+    value: ScalarValue,
+    data_type: DataType,


Should I assume that constructing Scalar { value: ScalarValue::Int32(..), data_type: DataType::Null } is not valid?
We need to define constraints on these values and ideally enforce them.

I think those constraint can be made implicit if we consider editing the possible way the Scalar object can be constructed. By only keeping the ::from(ScalarValue) and the ::try_from_array(...) method (which is used for casting) we artificially limit the possible values that data_type can hold to the ones that are cast-compatible with value.

Another potential way to represent this idea would be to use an enum, perhaps, like

pub enum Scalar { /// A ScalarValue (transitory) ScalarValue(ScalarValue), /// A logical String with associated Arrow physical typ String { value: Option<String>, data_type: DataType, }, }

I think we could modify this in the future as well

@alamb what would be "a logical String with associated data type"? i.e. what value of target data type does it represent and how to obtain it? And it it costly to obtain it during optimization process?
i didn't think about this too deeply yet, but would it be possible for the logical IR to avoid ambiguities?

I think those constraint can be made implicit if we consider editing the possible way the Scalar object can be constructed. By only keeping the ::from(ScalarValue) and the ::try_from_array(...) method (

@notfilippo that's great there are constraints.
let's make sure they are documented -- would be useful for the readers, and also would capture the design intent in the code.

@alamb what would be "a logical String with associated data type"? i.e. what value of target data type does it represent and how to obtain it? And it it costly to obtain it during optimization process?

I am not sure what this question is asking

I was thinking it would be possible to make Scalar more explict in terms of its type rather than wrapping a ScalarValue which is itself a wrapper around a value

findepi · 2024-09-20T17:07:26Z

datafusion/expr-common/src/scalar.rs

+    fn from(value: ScalarValue) -> Self {
+        Self {
+            data_type: value.data_type(),
+            value,


This code shows that Scalar is a redundant wrapper around ScalarValue.
I understand this is transitionary. Can we capture the desired move forward in the code?
It's great if the code self-documents, but without any commentary this code self-documents itself as being redundant

Totally understandable. As I mentioned #12536 (comment) I wanted to first gauge opinions about my implementation (which proved successful as you raised a lot of interesting points to account for) before committing on some documentation prose.

As I understood the idea, eventually the DataType of the scalar will be different than the underlying representation

I 100% agree with the sentiment that this intent should be captured in code comments as it is very much non obvious from the code

before committing on some documentation prose.

that's a good point
and actually, i have hard time with long documentation prose -- i am maybe a slow reader.
what i am looking for is exactly what you need -- to gauge opinions you need to understand what the code says today, but also what (today) it is supposed to say tomorrow

As I understood the idea, eventually the DataType of the scalar will be different than the underlying representation

@alamb , on the face value it seems to be opposite to what i understood from @notfilippo 's #12536 (comment). I don't have strong opinion yet what scalar should or should not be, but would be great to understand the desired end state. Can you please help me with that?

As I understood the idea, eventually the DataType of the scalar will be different than the underlying representation

@alamb , on the face value it seems to be opposite to what i understood from @notfilippo 's #12536 (comment). I don't have strong opinion yet what scalar should or should not be, but would be great to understand the desired end state. Can you please help me with that?

The main point of this PR is to remove variants of ScalarValue and represent them instead through the DataType contained in the Scalar type (which is what @\alamb is saying). The constraints put in place by the current constructors of Scalar allow it to only contain DataType fully compatible with the ScalarType provided (see ::from(ScalarValue)) or cast-compatible (see ::try_from_array(...)). Will try to make it clear via docs later today.

The main point of this PR is to remove variants of ScalarValue and represent them instead through the DataType contained in the Scalar type (which is what @\alamb is saying).

Yes I think that was my intention

datafusion/expr-common/src/scalar.rs

findepi · 2024-09-20T17:11:55Z

datafusion/expr-common/src/columnar_value.rs

@@ -89,7 +91,7 @@ pub enum ColumnarValue {
    /// Array of values
    Array(ArrayRef),
    /// A single value
-    Scalar(ScalarValue),
+    Scalar(Scalar),


when i read @jayzhan211 's #12488 (comment) o thought that ScalarValue is going to be left in ColumnarValue (least overhead, as relating to @alamb 's #12488 (review)).
But here is't being replaced.

What's the next step here? what's the end state?

The next step is to slowly remove variants of ScalarValue while still accounting for them via data_type (which in my opinion is the least intrusive change to support this effort). After we are satisfied with the variants that remain we reconsider the logic in expressions and operators in order to support what @jayzhan211 in proposing:

I think we could have LogicalType without any Arrow's DataType contained in it in the future

by sourcing the data_type directly from the execution schema.

Is ScalarValue supposed to be in the logical types layer, or physical?

is Scalar supposed to be in the logical types layer, or physical?

is ColumnarValue purely physical layer?

remove variants of ScalarValue while still accounting for them via data_type

what do we need this for?
(maybe that's obvious for someone with clarity on the preceding questions)

I think it is hard to tell the role given we are in the intermediate state. I would said all of ColumnarValue and ScalarValue and Scalar are physical types now, given they have either ArrayRef or DataType

The end state in my mind is that,
ColumnarValue stores ArrayRef for multiple rows and ScalarValue for single rows case. ScalarValue has either native type like i64 or arrow::Scalar.

We will have LogicalType which could be resolved from DataType given the mapping we determined, which should be customizable as well for user defined type. Similar to the role Scalar in this PR, which record the relationship between ScalarValue to DataType.

remove variants of ScalarValue while still accounting for them via data_type

Our first step is to minimize the variant of types, so we don't need ScalarValue::Utf8, ScalarValue::LargeUtf8 any more but Scalar which has ScalarValue::String + DataType::Utf8 or ScalarValue::String + DataType::LargeUtf8. We determine the actual physical type based on arrow's DataType. ScalarValue::String is then something close to logical type

I think there is nothing difference for literal case, we still get Literal(ScalarValue).

ScalarValue should be used all over datafusion, it is more a value and has no logical or physical concept. We just need to bring along with DataType so we understand how to decode the value in physical layer and understand what logical type it is to interact within the logical plan

ScalarValue along with DataType is a transition state, I guess we could get the DataType from schema so we don't even require DataType in ScalarValue

The rough idea is as the following:

As long as we have Schema we can get the DataType, and we can get the LogicalType from it if each DataType has at most one LogicalType. ScalarValue is just a wrapper for the scalar case of ArrayRef or rust native type. We can carry String for Utf8 / Utf8View / LargeUtf8 and decode it if needed, and also carry Scalar<ArrayRef> for List / LargeList and so on.

we can get the LogicalType from it if each DataType has at most one LogicalType

We should not assume this to be the case.
I suppose #12644 may convey why, but even if we don't do Extension Types, this assumption will be very limiting.

datafusion/expr-common/src/columnar_value.rs

findepi · 2024-09-20T17:19:01Z

Brief code-level documentation -- what this new Scalar is, what it is not, what are the constraints and what's the goal of this evolution is a must for this.

I would like to review again once this is added, so that I understand better what we're trying to accomplish.

The referenced issue #11513 is a good record of all the discussions, but it's not clear which comments are opinions that were later refined and which are normative.

Thanks for the patience.

alamb

First of all thank you @notfilippo and @findepi and @jayzhan211 -- I think this is a very nice step towards splitting out logical and physical types.

I took a brief look through this PR and it seems pretty good to me (other than lack of comments)

So one way to proceed if we are agreed on the basic idea is to create a new "Epic" ticket that lists the steps planned (it is fine if we don't know them all now). Here is an example apache/arrow-rs#5374

The first few steps might be

Add Scalar
Remove ScalarValue::LargeString in favor of Scalar
...

Having such an epic would also make it clear what the current plan was (as opposed to the various options that had been previously discussed)

cc @ozankabak as this project appears to be moving along

alamb · 2024-09-21T10:38:38Z

datafusion/expr-common/src/scalar.rs

+    fn from(value: ScalarValue) -> Self {
+        Self {
+            data_type: value.data_type(),
+            value,


As I understood the idea, eventually the DataType of the scalar will be different than the underlying representation

I 100% agree with the sentiment that this intent should be captured in code comments as it is very much non obvious from the code

alamb · 2024-09-21T10:40:13Z

datafusion/functions/src/encoding/inner.rs

@@ -266,7 +266,7 @@ macro_rules! decode_to_array {

 impl Encoding {
    fn encode_scalar(self, value: Option<&[u8]>) -> ColumnarValue {
-        ColumnarValue::Scalar(match self {
+        ColumnarValue::from(match self {


The switch from explicit use of a variant to a From clause is a nice change regardless, FWIW, and we could potentially break it out into its own PR

datafusion/expr-common/src/scalar.rs

alamb · 2024-09-21T10:46:20Z

datafusion/expr-common/src/scalar.rs

+pub struct Scalar {
+    value: ScalarValue,
+    data_type: DataType,


Another potential way to represent this idea would be to use an enum, perhaps, like

pub enum Scalar { /// A ScalarValue (transitory) ScalarValue(ScalarValue), /// A logical String with associated Arrow physical typ String { value: Option<String>, data_type: DataType, }, }

I think we could modify this in the future as well

ozankabak · 2024-09-21T14:41:06Z

Will carefully review next week. Thank you for working on this

findepi · 2024-09-21T14:51:05Z

So one way to proceed if we are agreed on the basic idea is to create a new "Epic" ticket that lists the steps planned (it is fine if we don't know them all now). Here is an example apache/arrow-rs#5374

I love the idea, @alamb !

just note that only maintainers can add to such epic.

Having such an epic would also make it clear what the current plan was (as opposed to the various options that had been previously discussed)

that would be helpful indeed!

notfilippo · 2024-09-24T08:44:24Z

So one way to proceed if we are agreed on the basic idea is to create a new "Epic" ticket that lists the steps planned (it is fine if we don't know them all now). Here is an example apache/arrow-rs#5374

I love the idea! My starting suggestion for potential issues is:

Add Scalar
Use Scalar in Expr::Literal
Introduce LogicalType enum and the logically equivalent notion
Remove ScalarValue::LargeUtf8 and ScalarValue::Utf8View in favour of ScalarValue::Utf8
Remove ScalarValue::LargeBinary and ScalarValue::BinaryView in favour of ScalarValue::Utf8
Remove ScalarValue::Dictionary (from Remove ScalarValue::Dictionary #12488)

notfilippo · 2024-09-24T08:45:19Z

Also, @findepi -- I've finally settled on some docs. 😄

alamb · 2024-09-24T20:36:21Z

just note that only maintainers can add to such epic.

I think the author of a ticket can do so too.

Another way that has worked in the past is just to leave comments and I or another maintainer can copy the links to the description

We could also explore adding people who are interested as "triage"ers (access to tickets/Prs without write access to code): https://github.com/apache/infrastructure-asfyaml?tab=readme-ov-file#triage

alamb · 2024-09-24T20:55:44Z

datafusion/expr-common/src/scalar.rs

+/// physical type as [`DataType`] alongside the value represented as
+/// [`ScalarValue`].
+///
+/// [`Scalar`] cannot be constructed directly as not all [`DataType`] variants


I don't think this is intentional (that some types can not be handled)

I'm not sure what you exactly mean 🤔. Currently (but potentially not in the future) the DataType of Scalar should be at least logically equivalent to the data type reported by ScalarValue::data_type()

alamb

Thanks -- this is looking good to me

Let's let @ozankabak have a chance to look at it and then figure out how to move forward

Even if we just get to the point that ScalarValue has a different representation I think it would be an improvement in readability, though maybe that doesn't justify the code churn.

Maybe we can set up a zoom meeting for anyone interested in this topic to discuss and get alignment if possible (with a writeup afterwards) as suggested by @findepi

alamb · 2024-09-24T20:57:33Z

datafusion/expr-common/src/scalar.rs

+pub struct Scalar {
+    value: ScalarValue,
+    data_type: DataType,


@alamb what would be "a logical String with associated data type"? i.e. what value of target data type does it represent and how to obtain it? And it it costly to obtain it during optimization process?

I am not sure what this question is asking

I was thinking it would be possible to make Scalar more explict in terms of its type rather than wrapping a ScalarValue which is itself a wrapper around a value

alamb · 2024-09-24T20:58:12Z

datafusion/expr-common/src/scalar.rs

+    fn from(value: ScalarValue) -> Self {
+        Self {
+            data_type: value.data_type(),
+            value,


The main point of this PR is to remove variants of ScalarValue and represent them instead through the DataType contained in the Scalar type (which is what @\alamb is saying).

Yes I think that was my intention

ozankabak · 2024-09-25T09:35:11Z

This looks good to me. The main challenge I had was extrapolating/imagining what the post-transitionary state of things will really look like. At this point part of the code seem redundant (as in @findepi's inline comments).

I am unclear on what is the best way to proceed. I fear we may run into unforeseen issues (in downstream projects and/or upstream DF) as we take the subsequent steps (i.e. evolve ScalarValue and remove certain data types from it).

One way we can proceed is to create a branch and make progress on it and rebase/merge main weekly. We can discover upstream DF issues this way without disrupting the progress on main. Also, downstream projects can experiment with their cargo setup to point to this branch and see what happens. This would give us the opportunity to discover issues without causing disruptions and ensure a smooth transition into the new paradigm if things still look good as we go through the motions. Obviously this will introduce some development burden, but it will avoid any unintended bad impacts, reverts etc.

What do you all think about the path forward?

notfilippo · 2024-09-25T12:45:38Z

One way we can proceed is to create a branch and make progress on it and rebase/merge main weekly. We can discover upstream DF issues this way without disrupting the progress on main. Also, downstream projects can experiment with their cargo setup to point to this branch and see what happens.

I love this approach. I think having this + the [Epic] tracking ticket will be a great way of managing progress for the proposal.

I think we are still discussing this PR so clearing the approval flag so we don't merge it by accident

notfilippo · 2024-09-30T19:59:33Z

So, should we progress with the approach @ozankabak suggested? cc @alamb @jayzhan211

jayzhan211 · 2024-10-01T11:55:25Z

So, should we progress with the approach @ozankabak suggested? cc @alamb @jayzhan211

Sure, we should work on the branch until the end state is clear to us.

notfilippo · 2024-10-01T14:35:09Z

Sure, we should work on the branch until the end state is clear to us.

Sounds good 🚀 So should I switch the target of this PR to the new branch? Can a maintainer create this branch?

alamb · 2024-10-01T19:16:32Z

Sure, we should work on the branch until the end state is clear to us.

Sounds good 🚀 So should I switch the target of this PR to the new branch? Can a maintainer create this branch?

I have created the branch https://github.com/apache/datafusion/tree/logical-types

notfilippo · 2024-10-01T19:30:43Z

I've changed the base branch of this PR to logical-types 👍

alamb · 2024-10-01T22:37:23Z

Let's merge to the branch and keep iterating there!

Introduce Scalar type for ColumnarValue

9b2b60c

github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate common Related to common crate functions labels Sep 19, 2024

jayzhan211 previously approved these changes Sep 20, 2024

View reviewed changes

findepi reviewed Sep 20, 2024

View reviewed changes

datafusion/expr-common/src/scalar.rs Outdated Show resolved Hide resolved

findepi reviewed Sep 20, 2024

View reviewed changes

datafusion/expr-common/src/columnar_value.rs Outdated Show resolved Hide resolved

Add constructor constraints for Scalar

24dd660

alamb mentioned this pull request Sep 21, 2024

Remove ScalarValue::Dictionary #12488

Closed

alamb reviewed Sep 21, 2024

View reviewed changes

alamb mentioned this pull request Sep 21, 2024

Remove LargeUtf8|Binary, Utf8|BinaryView, and Dictionary from ScalarValue #11978

Closed

Add rustdoc for Scalar

9b19149

Add TODO note on ColumnarValue::cast_to

c049e85

Add more Scalar rustdoc

c98efba

notfilippo mentioned this pull request Sep 24, 2024

[Proposal] Decouple logical from physical types #11513

Open

alamb reviewed Sep 24, 2024

View reviewed changes

notfilippo mentioned this pull request Sep 25, 2024

[EPIC] Decouple logical from physical types #12622

Open

6 tasks

notfilippo changed the base branch from main to logical-types October 1, 2024 19:30

alamb merged commit 454db7e into apache:logical-types Oct 1, 2024
25 checks passed

Introduce Scalar type for ColumnarValue #12536

Introduce Scalar type for ColumnarValue #12536

Conversation

notfilippo commented Sep 19, 2024 • edited Loading

Rationale for this change

Design rationale

Alternatives considered: arrow::array::Scalar

What changes are included in this PR?

Are these changes tested?

TODO

notfilippo commented Sep 19, 2024 • edited Loading

alamb commented Sep 19, 2024

alamb commented Sep 19, 2024 • edited Loading

findepi commented Sep 20, 2024

notfilippo commented Sep 20, 2024

jayzhan211 left a comment

Choose a reason for hiding this comment

notfilippo commented Sep 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Sep 22, 2024 • edited Loading

Choose a reason for hiding this comment

jayzhan211 Sep 22, 2024 • edited Loading

Choose a reason for hiding this comment

jayzhan211 Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

jayzhan211 Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

jayzhan211 Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

findepi commented Sep 20, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozankabak commented Sep 21, 2024

findepi commented Sep 21, 2024

notfilippo commented Sep 24, 2024

notfilippo commented Sep 24, 2024

alamb commented Sep 24, 2024

Choose a reason for hiding this comment

notfilippo Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozankabak commented Sep 25, 2024 • edited Loading

notfilippo commented Sep 25, 2024

notfilippo commented Sep 30, 2024

jayzhan211 commented Oct 1, 2024

notfilippo commented Oct 1, 2024

alamb commented Oct 1, 2024

notfilippo commented Oct 1, 2024 • edited Loading

alamb commented Oct 1, 2024

Introduce `Scalar` type for ColumnarValue #12536

Introduce `Scalar` type for ColumnarValue #12536

notfilippo commented Sep 19, 2024 •

edited

Loading

Alternatives considered: `arrow::array::Scalar`

notfilippo commented Sep 19, 2024 •

edited

Loading

alamb commented Sep 19, 2024 •

edited

Loading

jayzhan211 Sep 22, 2024 •

edited

Loading

jayzhan211 Sep 22, 2024 •

edited

Loading

jayzhan211 Sep 26, 2024 •

edited

Loading

jayzhan211 Sep 26, 2024 •

edited

Loading

jayzhan211 Sep 26, 2024 •

edited

Loading

notfilippo Sep 25, 2024 •

edited

Loading

ozankabak commented Sep 25, 2024 •

edited

Loading

notfilippo commented Oct 1, 2024 •

edited

Loading