-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Scalar
type for ColumnarValue
#12536
Introduce Scalar
type for ColumnarValue
#12536
Conversation
The main change is contained in this file: https://github.com/apache/datafusion/pull/12536/files#diff-2cc034babb8e7f8601dda34ecaa2119104eccd76d8ad2f9e19b26b01463634d6 (takes a while to jump to the file in the diff once the page load). All other files are mostly updates to support the change from |
FYI @findepi |
I will try and find time to review this tomorrow -- thank you @notfilippo |
Thank you @notfilippo for working on this!
Can you elaborate what this is needed for? |
@findepi -- You can find some more context here: #11978 (comment) As @jayzhan211 mentions in #11978 (comment) and #11978 (comment) it's a step towards the goal but it does not represent the full decoupling of physical and logical types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this change. Let's wait for others' views
Before merging I was also planning to add some rustdoc on the new type in order to explain its purpose and future plans. |
pub struct Scalar { | ||
value: ScalarValue, | ||
data_type: DataType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I assume that constructing Scalar { value: ScalarValue::Int32(..), data_type: DataType::Null }
is not valid?
We need to define constraints on these values and ideally enforce them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think those constraint can be made implicit if we consider editing the possible way the Scalar
object can be constructed. By only keeping the ::from(ScalarValue)
and the ::try_from_array(...)
method (which is used for casting) we artificially limit the possible values that data_type
can hold to the ones that are cast-compatible with value
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another potential way to represent this idea would be to use an enum, perhaps, like
pub enum Scalar {
/// A ScalarValue (transitory)
ScalarValue(ScalarValue),
/// A logical String with associated Arrow physical typ
String {
value: Option<String>,
data_type: DataType,
},
}
I think we could modify this in the future as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb what would be "a logical String with associated data type"? i.e. what value of target data type does it represent and how to obtain it? And it it costly to obtain it during optimization process?
i didn't think about this too deeply yet, but would it be possible for the logical IR to avoid ambiguities?
I think those constraint can be made implicit if we consider editing the possible way the
Scalar
object can be constructed. By only keeping the::from(ScalarValue)
and the::try_from_array(...)
method (
@notfilippo that's great there are constraints.
let's make sure they are documented -- would be useful for the readers, and also would capture the design intent in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb what would be "a logical String with associated data type"? i.e. what value of target data type does it represent and how to obtain it? And it it costly to obtain it during optimization process?
I am not sure what this question is asking
I was thinking it would be possible to make Scalar more explict in terms of its type rather than wrapping a ScalarValue which is itself a wrapper around a value
fn from(value: ScalarValue) -> Self { | ||
Self { | ||
data_type: value.data_type(), | ||
value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code shows that Scalar is a redundant wrapper around ScalarValue.
I understand this is transitionary. Can we capture the desired move forward in the code?
It's great if the code self-documents, but without any commentary this code self-documents itself as being redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally understandable. As I mentioned #12536 (comment) I wanted to first gauge opinions about my implementation (which proved successful as you raised a lot of interesting points to account for) before committing on some documentation prose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understood the idea, eventually the DataType of the scalar will be different than the underlying representation
I 100% agree with the sentiment that this intent should be captured in code comments as it is very much non obvious from the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before committing on some documentation prose.
that's a good point
and actually, i have hard time with long documentation prose -- i am maybe a slow reader.
what i am looking for is exactly what you need -- to gauge opinions you need to understand what the code says today, but also what (today) it is supposed to say tomorrow
As I understood the idea, eventually the DataType of the scalar will be different than the underlying representation
@alamb , on the face value it seems to be opposite to what i understood from @notfilippo 's #12536 (comment). I don't have strong opinion yet what scalar should or should not be, but would be great to understand the desired end state. Can you please help me with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understood the idea, eventually the DataType of the scalar will be different than the underlying representation
@alamb , on the face value it seems to be opposite to what i understood from @notfilippo 's #12536 (comment). I don't have strong opinion yet what scalar should or should not be, but would be great to understand the desired end state. Can you please help me with that?
The main point of this PR is to remove variants of ScalarValue and represent them instead through the DataType contained in the Scalar type (which is what @\alamb is saying). The constraints put in place by the current constructors of Scalar allow it to only contain DataType fully compatible with the ScalarType provided (see ::from(ScalarValue)
) or cast-compatible (see ::try_from_array(...)
). Will try to make it clear via docs later today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main point of this PR is to remove variants of ScalarValue and represent them instead through the DataType contained in the Scalar type (which is what @\alamb is saying).
Yes I think that was my intention
@@ -89,7 +91,7 @@ pub enum ColumnarValue { | |||
/// Array of values | |||
Array(ArrayRef), | |||
/// A single value | |||
Scalar(ScalarValue), | |||
Scalar(Scalar), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when i read @jayzhan211 's #12488 (comment) o thought that ScalarValue is going to be left in ColumnarValue (least overhead, as relating to @alamb 's #12488 (review)).
But here is't being replaced.
What's the next step here? what's the end state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next step is to slowly remove variants of ScalarValue
while still accounting for them via data_type
(which in my opinion is the least intrusive change to support this effort). After we are satisfied with the variants that remain we reconsider the logic in expressions and operators in order to support what @jayzhan211 in proposing:
I think we could have LogicalType without any Arrow's DataType contained in it in the future
by sourcing the data_type
directly from the execution schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is ScalarValue supposed to be in the logical types layer, or physical?
- is Scalar supposed to be in the logical types layer, or physical?
- is ColumnarValue purely physical layer?
remove variants of
ScalarValue
while still accounting for them viadata_type
what do we need this for?
(maybe that's obvious for someone with clarity on the preceding questions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is hard to tell the role given we are in the intermediate state. I would said all of ColumnarValue
and ScalarValue
and Scalar
are physical types now, given they have either ArrayRef
or DataType
The end state in my mind is that,
ColumnarValue
stores ArrayRef
for multiple rows and ScalarValue
for single rows case. ScalarValue
has either native type like i64
or arrow::Scalar
.
We will have LogicalType
which could be resolved from DataType
given the mapping we determined, which should be customizable as well for user defined type. Similar to the role Scalar
in this PR, which record the relationship between ScalarValue
to DataType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove variants of ScalarValue while still accounting for them via data_type
Our first step is to minimize the variant of types, so we don't need ScalarValue::Utf8
, ScalarValue::LargeUtf8
any more but Scalar
which has ScalarValue::String
+ DataType::Utf8
or ScalarValue::String
+ DataType::LargeUtf8
. We determine the actual physical type
based on arrow's DataType. ScalarValue::String
is then something close to logical type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is nothing difference for literal case, we still get Literal(ScalarValue)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ScalarValue
should be used all over datafusion, it is more a value and has no logical or physical concept. We just need to bring along withDataType
so we understand how to decode the value in physical layer and understand what logical type it is to interact within the logical plan
ScalarValue
along with DataType
is a transition state, I guess we could get the DataType from schema so we don't even require DataType
in ScalarValue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rough idea is as the following:
As long as we have Schema
we can get the DataType
, and we can get the LogicalType from it if each DataType has at most one LogicalType. ScalarValue
is just a wrapper for the scalar case of ArrayRef
or rust native type. We can carry String
for Utf8
/ Utf8View
/ LargeUtf8
and decode it if needed, and also carry Scalar<ArrayRef>
for List
/ LargeList
and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can get the LogicalType from it if each DataType has at most one LogicalType
We should not assume this to be the case.
I suppose #12644 may convey why, but even if we don't do Extension Types, this assumption will be very limiting.
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
Brief code-level documentation -- what this new Scalar is, what it is not, what are the constraints and what's the goal of this evolution is a must for this. I would like to review again once this is added, so that I understand better what we're trying to accomplish. The referenced issue #11513 is a good record of all the discussions, but it's not clear which comments are opinions that were later refined and which are normative. Thanks for the patience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all thank you @notfilippo and @findepi and @jayzhan211 -- I think this is a very nice step towards splitting out logical and physical types.
I took a brief look through this PR and it seems pretty good to me (other than lack of comments)
So one way to proceed if we are agreed on the basic idea is to create a new "Epic" ticket that lists the steps planned (it is fine if we don't know them all now). Here is an example apache/arrow-rs#5374
The first few steps might be
- Add
Scalar
- Remove
ScalarValue::LargeString
in favor ofScalar
- ...
Having such an epic would also make it clear what the current plan was (as opposed to the various options that had been previously discussed)
cc @ozankabak as this project appears to be moving along
fn from(value: ScalarValue) -> Self { | ||
Self { | ||
data_type: value.data_type(), | ||
value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understood the idea, eventually the DataType of the scalar will be different than the underlying representation
I 100% agree with the sentiment that this intent should be captured in code comments as it is very much non obvious from the code
@@ -266,7 +266,7 @@ macro_rules! decode_to_array { | |||
|
|||
impl Encoding { | |||
fn encode_scalar(self, value: Option<&[u8]>) -> ColumnarValue { | |||
ColumnarValue::Scalar(match self { | |||
ColumnarValue::from(match self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The switch from explicit use of a variant to a From
clause is a nice change regardless, FWIW, and we could potentially break it out into its own PR
pub struct Scalar { | ||
value: ScalarValue, | ||
data_type: DataType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another potential way to represent this idea would be to use an enum, perhaps, like
pub enum Scalar {
/// A ScalarValue (transitory)
ScalarValue(ScalarValue),
/// A logical String with associated Arrow physical typ
String {
value: Option<String>,
data_type: DataType,
},
}
I think we could modify this in the future as well
Will carefully review next week. Thank you for working on this |
I love the idea, @alamb ! just note that only maintainers can add to such epic.
that would be helpful indeed! |
I love the idea! My starting suggestion for potential issues is:
|
Also, @findepi -- I've finally settled on some docs. 😄 |
I think the author of a ticket can do so too. Another way that has worked in the past is just to leave comments and I or another maintainer can copy the links to the description We could also explore adding people who are interested as "triage"ers (access to tickets/Prs without write access to code): https://github.com/apache/infrastructure-asfyaml?tab=readme-ov-file#triage |
/// physical type as [`DataType`] alongside the value represented as | ||
/// [`ScalarValue`]. | ||
/// | ||
/// [`Scalar`] cannot be constructed directly as not all [`DataType`] variants |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is intentional (that some types can not be handled)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you exactly mean 🤔. Currently (but potentially not in the future) the DataType
of Scalar
should be at least logically equivalent to the data type reported by ScalarValue::data_type()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks -- this is looking good to me
Let's let @ozankabak have a chance to look at it and then figure out how to move forward
Even if we just get to the point that ScalarValue has a different representation I think it would be an improvement in readability, though maybe that doesn't justify the code churn.
Maybe we can set up a zoom meeting for anyone interested in this topic to discuss and get alignment if possible (with a writeup afterwards) as suggested by @findepi
pub struct Scalar { | ||
value: ScalarValue, | ||
data_type: DataType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb what would be "a logical String with associated data type"? i.e. what value of target data type does it represent and how to obtain it? And it it costly to obtain it during optimization process?
I am not sure what this question is asking
I was thinking it would be possible to make Scalar more explict in terms of its type rather than wrapping a ScalarValue which is itself a wrapper around a value
fn from(value: ScalarValue) -> Self { | ||
Self { | ||
data_type: value.data_type(), | ||
value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main point of this PR is to remove variants of ScalarValue and represent them instead through the DataType contained in the Scalar type (which is what @\alamb is saying).
Yes I think that was my intention
This looks good to me. The main challenge I had was extrapolating/imagining what the post-transitionary state of things will really look like. At this point part of the code seem redundant (as in @findepi's inline comments). I am unclear on what is the best way to proceed. I fear we may run into unforeseen issues (in downstream projects and/or upstream DF) as we take the subsequent steps (i.e. evolve One way we can proceed is to create a branch and make progress on it and rebase/merge main weekly. We can discover upstream DF issues this way without disrupting the progress on What do you all think about the path forward? |
I love this approach. I think having this + the [Epic] tracking ticket will be a great way of managing progress for the proposal. |
I think we are still discussing this PR so clearing the approval flag so we don't merge it by accident
So, should we progress with the approach @ozankabak suggested? cc @alamb @jayzhan211 |
Sure, we should work on the branch until the end state is clear to us. |
Sounds good 🚀 So should I switch the target of this PR to the new branch? Can a maintainer create this branch? |
I have created the branch https://github.com/apache/datafusion/tree/logical-types |
I've changed the base branch of this PR to |
Let's merge to the branch and keep iterating there! |
This PR represents the first step originating from experiment #11978, which itself stems from the broad objective described in proposal #11513.
Rationale for this change
This change introduces the knowledge of
Scalar
type which identifies the physical representation of a scalar value.Scalar
will replaceScalarValue
in theColumnarValue
enum. This is done in order to allow, in future PRs, to remove some redundant variants ofScalarValue
and transition to logical representation of scalar values in datafusion.Design rationale
The new
Scalar
type embeds both aScalarValue
and aDataType
. While the type information might seem redundant considering the current implementationScalarValue
it will be beneficial once we start to remove redundant variants (such as replacingScalarValue::Utf8View
andScalarValue::LargeUtf8
while keepingScalarValue::Utf8
) in order to keep track of the represented type during physical execution.Alternatives considered:
arrow::array::Scalar
Arrow's array crate provides a
Scalar
type which implements Datum and which holds a reference to alen() = 1
arrow array. While this type fully identifies the physical representation of a scalar value it falls short when considering the overhead of copying a fully rendered ArrayRef instead of primitives types.This possibility was also already discussed here: #7353 (comment).
What changes are included in this PR?
Scalar
structScalarValue
withScalar
inColumnarValue
's Scalar variantAre these changes tested?
This change should be a no-op as it doesn't include any behavioural change. The existing test suite is sufficient to verify this change.
TODO