Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: specify the storage_dtype in ExtDType #1007

Merged
merged 4 commits into from
Nov 1, 2024
Merged

feat: specify the storage_dtype in ExtDType #1007

merged 4 commits into from
Nov 1, 2024

Conversation

a10y
Copy link
Contributor

@a10y a10y commented Oct 9, 2024

Adds scalars_dtype to ExtDType.

This PR adds scalars_dtype (alternative name options: canonical_dtype, storage_dtype) to ExtDType.

This is desirable for a few reasons

  • Makes it possible to canonicalize an empty chunked ExtensionArray
  • Makes it possible to determine the storage DType for a ConstantArray without examining its value
  • Makes it possible for Vortex to reason about externally authored extension types. This is still not fully complete, as an ideal experience would allow extension authors to override IntoCanonical, IntoArrow, Display, etc.

To avoid duplicating the nullability, we remove top-level nullability from the DType::Extension variant, instead nullability is accessed through the inner ExtDType.

This has the unfortunate effect of bringing size_of::<DType>() from 40 -> 48 bytes, which obviously makes every array's metadata 8 bytes larger.

let storage_dtype = storage_chunks
.first()
.ok_or_else(|| vortex_err!("Expected at least one chunk in ChunkedArray"))?
.dtype()
.clone();
let storage_dtype = ext_dtype.scalars_dtype().clone();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works now

let scalar_ext = ExtScalar::try_new(self.dtype(), self.scalar_value())
.vortex_expect("Expected an extension scalar");

// FIXME(ngates): there's not enough information to get the storage array.
let n = self.dtype().nullability();
let storage_dtype = match scalar_ext.value() {
ScalarValue::Bool(_) => DType::Binary(n),
ScalarValue::Primitive(pvalue) => DType::Primitive(pvalue.ptype(), n),
ScalarValue::Buffer(_) => DType::Binary(n),
ScalarValue::BufferString(_) => DType::Utf8(n),
ScalarValue::List(_) => vortex_panic!("List not supported"),
ScalarValue::Null => DType::Null,
};

let storage_dtype = self.ext_dtype().scalars_dtype().clone();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works now too

xtask/src/main.rs Show resolved Hide resolved
Comment on lines 25 to 26
#[lints]
#workspace = true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out a better way to kill the clippy errors on the generated code.

I tried both #[allow(clippy)] and #[allow(clippy::all)] on the module declarations and that didn't do it 🤷

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clippy::all doesn’t mean “all” it means the things that are not pedantic or nursery or something like that

@@ -1,22 +1,18 @@
// This file is @generated by prost-build.
#[allow(clippy::derive_partial_eq_without_eq)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was changed in tokio-rs/prost#1115 so we now need to allow(clippy::nursery) in the module level

vortex-array/src/canonical.rs Outdated Show resolved Hide resolved
@a10y a10y marked this pull request as ready for review October 30, 2024 21:20
@a10y a10y changed the title add scalars_dtype to ExtDType add storage_dtype to ExtDType Oct 30, 2024
@a10y a10y changed the title add storage_dtype to ExtDType feat: add storage_dtype to ExtDType Oct 30, 2024
@a10y a10y changed the title feat: add storage_dtype to ExtDType feat: specify the storage_dtype in ExtDType Oct 30, 2024
pyvortex/src/python_repr.rs Outdated Show resolved Hide resolved
@@ -14,7 +14,7 @@ use vortex_error::vortex_panic;
use vortex_scalar::{PValue, Scalar, ScalarValue};

pub fn scalar_into_py(py: Python, x: Scalar, copy_into_python: bool) -> PyResult<PyObject> {
let (value, dtype) = x.into_parts();
let (dtype, value) = x.into_parts();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol i guess dan wrote the same method at the same time as me we just wrote the args in a different order

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just sort of yolo resolved the conflict and this was the outcome

ext.metadata()
.map(|m| format!(", {:?}", m))
.unwrap_or_else(|| "".to_string()),
n
ext.storage_dtype().nullability(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the nullability split thing here might read weird (it will say NonNullable above... and then maybe-Nullable here?)

@a10y a10y enabled auto-merge (squash) November 1, 2024 20:58
@a10y a10y merged commit 2d439a2 into develop Nov 1, 2024
5 checks passed
@a10y a10y deleted the aduffy/ext-types branch November 1, 2024 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants