Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Scalar type for ColumnarValue #12536

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

notfilippo
Copy link

@notfilippo notfilippo commented Sep 19, 2024

This PR represents the first step originating from experiment #11978, which itself stems from the broad objective described in proposal #11513.


Rationale for this change

This change introduces the knowledge of Scalar type which identifies the physical representation of a scalar value. Scalar will replace ScalarValue in the ColumnarValue enum. This is done in order to allow, in future PRs, to remove some redundant variants of ScalarValue and transition to logical representation of scalar values in datafusion.

Design rationale

The new Scalar type embeds both a ScalarValue and a DataType. While the type information might seem redundant considering the current implementation ScalarValue it will be beneficial once we start to remove redundant variants (such as replacing ScalarValue::Utf8View and ScalarValue::LargeUtf8 while keeping ScalarValue::Utf8) in order to keep track of the represented type during physical execution.

Alternatives considered: arrow::array::Scalar

Arrow's array crate provides a Scalar type which implements Datum and which holds a reference to a len() = 1 arrow array. While this type fully identifies the physical representation of a scalar value it falls short when considering the overhead of copying a fully rendered ArrayRef instead of primitives types.

This possibility was also already discussed here: #7353 (comment).

What changes are included in this PR?

  • Introduced the Scalar struct
  • Replaced ScalarValue with Scalar in ColumnarValue's Scalar variant
  • Adapted existing code to the Scalar variant change

Are these changes tested?

This change should be a no-op as it doesn't include any behavioural change. The existing test suite is sufficient to verify this change.

TODO

  • rustdoc on new types and functions

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate common Related to common crate functions labels Sep 19, 2024
@notfilippo
Copy link
Author

notfilippo commented Sep 19, 2024

The main change is contained in this file: https://github.com/apache/datafusion/pull/12536/files#diff-2cc034babb8e7f8601dda34ecaa2119104eccd76d8ad2f9e19b26b01463634d6 (takes a while to jump to the file in the diff once the page load).

All other files are mostly updates to support the change from ScalarValue to Scalar in ColumnarValue

@alamb
Copy link
Contributor

alamb commented Sep 19, 2024

FYI @findepi

@alamb
Copy link
Contributor

alamb commented Sep 19, 2024

I will try and find time to review this tomorrow -- thank you @notfilippo

@findepi
Copy link
Member

findepi commented Sep 20, 2024

Thank you @notfilippo for working on this!

ScalarValue::Utf8View and ScalarValue::LargeUtf8 while keeping ScalarValue::Utf8) in order to keep track of the represented type during physical execution.

Can you elaborate what this is needed for?

@notfilippo
Copy link
Author

Can you elaborate what this is needed for?

@findepi -- DataType in Scalar is needed as a way to retain the physical type that the ScalarValues represents. That knowledge will be used when creating arrays from scalar types. It comes especially useful when converting a vector of scalars into an array of the correct physical type.

You can find some more context here: #11978 (comment)

As @jayzhan211 mentions in #11978 (comment) and #11978 (comment) it's a step towards the goal but it does not represent the full decoupling of physical and logical types.

jayzhan211
jayzhan211 previously approved these changes Sep 20, 2024
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this change. Let's wait for others' views

@notfilippo
Copy link
Author

Before merging I was also planning to add some rustdoc on the new type in order to explain its purpose and future plans.

Comment on lines +25 to +27
pub struct Scalar {
value: ScalarValue,
data_type: DataType,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I assume that constructing Scalar { value: ScalarValue::Int32(..), data_type: DataType::Null } is not valid?
We need to define constraints on these values and ideally enforce them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those constraint can be made implicit if we consider editing the possible way the Scalar object can be constructed. By only keeping the ::from(ScalarValue) and the ::try_from_array(...) method (which is used for casting) we artificially limit the possible values that data_type can hold to the ones that are cast-compatible with value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential way to represent this idea would be to use an enum, perhaps, like

pub enum Scalar {
  /// A ScalarValue (transitory)
  ScalarValue(ScalarValue),
  /// A logical String with associated Arrow physical typ
  String {
    value: Option<String>,
    data_type: DataType,
  },
}

I think we could modify this in the future as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb what would be "a logical String with associated data type"? i.e. what value of target data type does it represent and how to obtain it? And it it costly to obtain it during optimization process?
i didn't think about this too deeply yet, but would it be possible for the logical IR to avoid ambiguities?

I think those constraint can be made implicit if we consider editing the possible way the Scalar object can be constructed. By only keeping the ::from(ScalarValue) and the ::try_from_array(...) method (

@notfilippo that's great there are constraints.
let's make sure they are documented -- would be useful for the readers, and also would capture the design intent in the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb what would be "a logical String with associated data type"? i.e. what value of target data type does it represent and how to obtain it? And it it costly to obtain it during optimization process?

I am not sure what this question is asking

I was thinking it would be possible to make Scalar more explict in terms of its type rather than wrapping a ScalarValue which is itself a wrapper around a value

Comment on lines +31 to +34
fn from(value: ScalarValue) -> Self {
Self {
data_type: value.data_type(),
value,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code shows that Scalar is a redundant wrapper around ScalarValue.
I understand this is transitionary. Can we capture the desired move forward in the code?
It's great if the code self-documents, but without any commentary this code self-documents itself as being redundant

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally understandable. As I mentioned #12536 (comment) I wanted to first gauge opinions about my implementation (which proved successful as you raised a lot of interesting points to account for) before committing on some documentation prose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understood the idea, eventually the DataType of the scalar will be different than the underlying representation

I 100% agree with the sentiment that this intent should be captured in code comments as it is very much non obvious from the code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before committing on some documentation prose.

that's a good point
and actually, i have hard time with long documentation prose -- i am maybe a slow reader.
what i am looking for is exactly what you need -- to gauge opinions you need to understand what the code says today, but also what (today) it is supposed to say tomorrow

As I understood the idea, eventually the DataType of the scalar will be different than the underlying representation

@alamb , on the face value it seems to be opposite to what i understood from @notfilippo 's #12536 (comment). I don't have strong opinion yet what scalar should or should not be, but would be great to understand the desired end state. Can you please help me with that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understood the idea, eventually the DataType of the scalar will be different than the underlying representation

@alamb , on the face value it seems to be opposite to what i understood from @notfilippo 's #12536 (comment). I don't have strong opinion yet what scalar should or should not be, but would be great to understand the desired end state. Can you please help me with that?

The main point of this PR is to remove variants of ScalarValue and represent them instead through the DataType contained in the Scalar type (which is what @\alamb is saying). The constraints put in place by the current constructors of Scalar allow it to only contain DataType fully compatible with the ScalarType provided (see ::from(ScalarValue)) or cast-compatible (see ::try_from_array(...)). Will try to make it clear via docs later today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main point of this PR is to remove variants of ScalarValue and represent them instead through the DataType contained in the Scalar type (which is what @\alamb is saying).

Yes I think that was my intention

@@ -89,7 +91,7 @@ pub enum ColumnarValue {
/// Array of values
Array(ArrayRef),
/// A single value
Scalar(ScalarValue),
Scalar(Scalar),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when i read @jayzhan211 's #12488 (comment) o thought that ScalarValue is going to be left in ColumnarValue (least overhead, as relating to @alamb 's #12488 (review)).
But here is't being replaced.

What's the next step here? what's the end state?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next step is to slowly remove variants of ScalarValue while still accounting for them via data_type (which in my opinion is the least intrusive change to support this effort). After we are satisfied with the variants that remain we reconsider the logic in expressions and operators in order to support what @jayzhan211 in proposing:

I think we could have LogicalType without any Arrow's DataType contained in it in the future

by sourcing the data_type directly from the execution schema.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Is ScalarValue supposed to be in the logical types layer, or physical?
  • is Scalar supposed to be in the logical types layer, or physical?
  • is ColumnarValue purely physical layer?

remove variants of ScalarValue while still accounting for them via data_type

what do we need this for?
(maybe that's obvious for someone with clarity on the preceding questions)

Copy link
Contributor

@jayzhan211 jayzhan211 Sep 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is hard to tell the role given we are in the intermediate state. I would said all of ColumnarValue and ScalarValue and Scalar are physical types now, given they have either ArrayRef or DataType

The end state in my mind is that,
ColumnarValue stores ArrayRef for multiple rows and ScalarValue for single rows case. ScalarValue has either native type like i64 or arrow::Scalar.

We will have LogicalType which could be resolved from DataType given the mapping we determined, which should be customizable as well for user defined type. Similar to the role Scalar in this PR, which record the relationship between ScalarValue to DataType.

Copy link
Contributor

@jayzhan211 jayzhan211 Sep 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove variants of ScalarValue while still accounting for them via data_type

Our first step is to minimize the variant of types, so we don't need ScalarValue::Utf8, ScalarValue::LargeUtf8 any more but Scalar which has ScalarValue::String + DataType::Utf8 or ScalarValue::String + DataType::LargeUtf8. We determine the actual physical type based on arrow's DataType. ScalarValue::String is then something close to logical type

Copy link
Contributor

@jayzhan211 jayzhan211 Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is nothing difference for literal case, we still get Literal(ScalarValue).

Copy link
Contributor

@jayzhan211 jayzhan211 Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScalarValue should be used all over datafusion, it is more a value and has no logical or physical concept. We just need to bring along with DataType so we understand how to decode the value in physical layer and understand what logical type it is to interact within the logical plan

ScalarValue along with DataType is a transition state, I guess we could get the DataType from schema so we don't even require DataType in ScalarValue

Copy link
Contributor

@jayzhan211 jayzhan211 Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rough idea is as the following:

As long as we have Schema we can get the DataType, and we can get the LogicalType from it if each DataType has at most one LogicalType. ScalarValue is just a wrapper for the scalar case of ArrayRef or rust native type. We can carry String for Utf8 / Utf8View / LargeUtf8 and decode it if needed, and also carry Scalar<ArrayRef> for List / LargeList and so on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can get the LogicalType from it if each DataType has at most one LogicalType

We should not assume this to be the case.
I suppose #12644 may convey why, but even if we don't do Extension Types, this assumption will be very limiting.

This comment was marked as outdated.

@findepi
Copy link
Member

findepi commented Sep 20, 2024

Brief code-level documentation -- what this new Scalar is, what it is not, what are the constraints and what's the goal of this evolution is a must for this.

I would like to review again once this is added, so that I understand better what we're trying to accomplish.

The referenced issue #11513 is a good record of all the discussions, but it's not clear which comments are opinions that were later refined and which are normative.

Thanks for the patience.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all thank you @notfilippo and @findepi and @jayzhan211 -- I think this is a very nice step towards splitting out logical and physical types.

I took a brief look through this PR and it seems pretty good to me (other than lack of comments)

So one way to proceed if we are agreed on the basic idea is to create a new "Epic" ticket that lists the steps planned (it is fine if we don't know them all now). Here is an example apache/arrow-rs#5374

The first few steps might be

  • Add Scalar
  • Remove ScalarValue::LargeString in favor of Scalar
  • ...

Having such an epic would also make it clear what the current plan was (as opposed to the various options that had been previously discussed)

cc @ozankabak as this project appears to be moving along

Comment on lines +31 to +34
fn from(value: ScalarValue) -> Self {
Self {
data_type: value.data_type(),
value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understood the idea, eventually the DataType of the scalar will be different than the underlying representation

I 100% agree with the sentiment that this intent should be captured in code comments as it is very much non obvious from the code

@@ -266,7 +266,7 @@ macro_rules! decode_to_array {

impl Encoding {
fn encode_scalar(self, value: Option<&[u8]>) -> ColumnarValue {
ColumnarValue::Scalar(match self {
ColumnarValue::from(match self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The switch from explicit use of a variant to a From clause is a nice change regardless, FWIW, and we could potentially break it out into its own PR

datafusion/expr-common/src/scalar.rs Show resolved Hide resolved
Comment on lines +25 to +27
pub struct Scalar {
value: ScalarValue,
data_type: DataType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential way to represent this idea would be to use an enum, perhaps, like

pub enum Scalar {
  /// A ScalarValue (transitory)
  ScalarValue(ScalarValue),
  /// A logical String with associated Arrow physical typ
  String {
    value: Option<String>,
    data_type: DataType,
  },
}

I think we could modify this in the future as well

@ozankabak
Copy link
Contributor

Will carefully review next week. Thank you for working on this

@findepi
Copy link
Member

findepi commented Sep 21, 2024

So one way to proceed if we are agreed on the basic idea is to create a new "Epic" ticket that lists the steps planned (it is fine if we don't know them all now). Here is an example apache/arrow-rs#5374

I love the idea, @alamb !

just note that only maintainers can add to such epic.

Having such an epic would also make it clear what the current plan was (as opposed to the various options that had been previously discussed)

that would be helpful indeed!

@notfilippo
Copy link
Author

So one way to proceed if we are agreed on the basic idea is to create a new "Epic" ticket that lists the steps planned (it is fine if we don't know them all now). Here is an example apache/arrow-rs#5374

I love the idea! My starting suggestion for potential issues is:

  • Add Scalar
  • Use Scalar in Expr::Literal
  • Introduce LogicalType enum and the logically equivalent notion
  • Remove ScalarValue::LargeUtf8 and ScalarValue::Utf8View in favour of ScalarValue::Utf8
  • Remove ScalarValue::LargeBinary and ScalarValue::BinaryView in favour of ScalarValue::Utf8
  • Remove ScalarValue::Dictionary (from Remove ScalarValue::Dictionary #12488)

@notfilippo
Copy link
Author

Also, @findepi -- I've finally settled on some docs. 😄

@alamb
Copy link
Contributor

alamb commented Sep 24, 2024

just note that only maintainers can add to such epic.

I think the author of a ticket can do so too.

Another way that has worked in the past is just to leave comments and I or another maintainer can copy the links to the description

We could also explore adding people who are interested as "triage"ers (access to tickets/Prs without write access to code): https://github.com/apache/infrastructure-asfyaml?tab=readme-ov-file#triage

/// physical type as [`DataType`] alongside the value represented as
/// [`ScalarValue`].
///
/// [`Scalar`] cannot be constructed directly as not all [`DataType`] variants
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is intentional (that some types can not be handled)

Copy link
Author

@notfilippo notfilippo Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you exactly mean 🤔. Currently (but potentially not in the future) the DataType of Scalar should be at least logically equivalent to the data type reported by ScalarValue::data_type()

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- this is looking good to me

Let's let @ozankabak have a chance to look at it and then figure out how to move forward

Even if we just get to the point that ScalarValue has a different representation I think it would be an improvement in readability, though maybe that doesn't justify the code churn.

Maybe we can set up a zoom meeting for anyone interested in this topic to discuss and get alignment if possible (with a writeup afterwards) as suggested by @findepi

Comment on lines +25 to +27
pub struct Scalar {
value: ScalarValue,
data_type: DataType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb what would be "a logical String with associated data type"? i.e. what value of target data type does it represent and how to obtain it? And it it costly to obtain it during optimization process?

I am not sure what this question is asking

I was thinking it would be possible to make Scalar more explict in terms of its type rather than wrapping a ScalarValue which is itself a wrapper around a value

Comment on lines +31 to +34
fn from(value: ScalarValue) -> Self {
Self {
data_type: value.data_type(),
value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main point of this PR is to remove variants of ScalarValue and represent them instead through the DataType contained in the Scalar type (which is what @\alamb is saying).

Yes I think that was my intention

@ozankabak
Copy link
Contributor

ozankabak commented Sep 25, 2024

This looks good to me. The main challenge I had was extrapolating/imagining what the post-transitionary state of things will really look like. At this point part of the code seem redundant (as in @findepi's inline comments).

I am unclear on what is the best way to proceed. I fear we may run into unforeseen issues (in downstream projects and/or upstream DF) as we take the subsequent steps (i.e. evolve ScalarValue and remove certain data types from it).

One way we can proceed is to create a branch and make progress on it and rebase/merge main weekly. We can discover upstream DF issues this way without disrupting the progress on main. Also, downstream projects can experiment with their cargo setup to point to this branch and see what happens. This would give us the opportunity to discover issues without causing disruptions and ensure a smooth transition into the new paradigm if things still look good as we go through the motions. Obviously this will introduce some development burden, but it will avoid any unintended bad impacts, reverts etc.

What do you all think about the path forward?

@notfilippo
Copy link
Author

One way we can proceed is to create a branch and make progress on it and rebase/merge main weekly. We can discover upstream DF issues this way without disrupting the progress on main. Also, downstream projects can experiment with their cargo setup to point to this branch and see what happens.

I love this approach. I think having this + the [Epic] tracking ticket will be a great way of managing progress for the proposal.

@alamb alamb dismissed jayzhan211’s stale review September 29, 2024 11:26

I think we are still discussing this PR so clearing the approval flag so we don't merge it by accident

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate functions logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants