-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[draft] Add LogicalType
, try to support user-defined types
#8143
Conversation
Current PR has some unresolved issues requiring collaboration for discussion. |
I've organized the logic for the mutual conversion between DFSchema to SchemaNo need to changeDefaultPhysicalPlanner
To be changed
Schema to DFSchema (To be changed)
|
Thanks @yukkit -- I plan to give this a look, but probably will not have time until tomorrow |
@@ -661,6 +667,28 @@ impl SessionContext { | |||
} | |||
} | |||
|
|||
async fn create_type(&self, cmd: CreateType) -> Result<DataFrame> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the core processing logic for creating SQL UDT.
self.register_data_type( | ||
type_signature.to_owned_type_signature(), | ||
extension_type, | ||
)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Register the extension types created through SQL UDT.
@@ -1846,6 +1895,22 @@ impl SessionState { | |||
} | |||
} | |||
|
|||
impl TypeManager for SessionState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SessionState implements TymeManager for registering and retrieving extended data types.
@@ -37,7 +37,7 @@ use std::sync::Arc; | |||
/// trait to allow expr to typable with respect to a schema | |||
pub trait ExprSchemable { | |||
/// given a schema, return the type of the expr | |||
fn get_type<S: ExprSchema>(&self, schema: &S) -> Result<DataType>; | |||
fn get_type<S: ExprSchema>(&self, schema: &S) -> Result<LogicalType>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace all arrow DataType
used in the logical planning stage with LogicalType
(literal ScalarValue TBD).
// TODO not convert to DataType | ||
let data_types = data_types | ||
.into_iter() | ||
.map(|e| e.physical_type()) | ||
.collect::<Vec<_>>(); | ||
|
||
Ok((fun.return_type)(&data_types)?.as_ref().clone().into()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Functions signatures for udf/udaf
use TypeSignature
instead of the arrow DataType
.ok_or_else(|| { | ||
plan_datafusion_err!( | ||
// FIXME support logical data types, not convert to DataType | ||
let data_type = comparison_coercion( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modify the function comparison_coercion
to accept LogicalType as parameters.
@@ -886,39 +887,34 @@ pub(crate) fn find_column_indexes_referenced_by_expr( | |||
/// can this data type be used in hash join equal conditions?? | |||
/// data types here come from function 'equal_rows', if more data types are supported | |||
/// in equal_rows(hash join), add those data types here to generate join logical plan. | |||
pub fn can_hash(data_type: &DataType) -> bool { | |||
pub fn can_hash(data_type: &LogicalType) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function can_hash
might be abstracted as a property of logical data types.
let right_type = pattern.get_type(&self.schema)?; | ||
// FIXME like_coercion use LogicalType | ||
let left_type = expr.get_type(&self.schema)?.physical_type(); | ||
let right_type = pattern.get_type(&self.schema)?.physical_type(); | ||
let coerced_type = like_coercion(&left_type, &right_type).ok_or_else(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Modify the function like_coercion
to accept LogicalType as parameters.
@@ -295,15 +303,16 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { | |||
}) | |||
} | |||
|
|||
pub(crate) fn convert_data_type(&self, sql_type: &SQLDataType) -> Result<DataType> { | |||
pub(crate) fn convert_data_type( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convert data types in the AST
to logical data types.
SQLDataType::Custom(name, params) => { | ||
let name = object_name_to_string(name); | ||
let params = params.iter().map(Into::into).collect(); | ||
let type_signature = TypeSignature::new_with_params(name, params); | ||
if let Some(data_type) = self.context_provider.get_data_type(&type_signature) { | ||
return Ok(data_type); | ||
} | ||
|
||
plan_err!("User-Defined SQL type {sql_type:?} not found") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convert custom data types.
LogicalType
, try to support user-defined types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an epic PR @yukkit -- I looked through it and I really liked the structure and the overall code. Really nice 👏
I think the next steps for this PR are to circulate it more widely -- perhaps via a note to the arrow mailing list, and discord / slack channels (I can help with this) and potentially have people gauge how much disruption this will cause with downstream crates (I can test with IOx perhaps, and see what that experience is like)
Again, thanks for pushing this forward
_ => dt1 == dt2, | ||
} | ||
fn datatype_is_logically_equal(dt1: &LogicalType, dt2: &LogicalType) -> bool { | ||
dt1 == dt2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is certainly nicer
data_type: LogicalType, | ||
nullable: bool, | ||
/// A map of key-value pairs containing additional custom meta data. | ||
metadata: HashMap<String, String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could store in a BTreeMap to avoid sorting them each time 🤔
use arrow_schema::{DataType, Field, IntervalUnit, TimeUnit}; | ||
|
||
#[derive(Clone, Debug)] | ||
pub enum LogicalType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually we should add doc comments here, but it also makes sense to avoid over doing it on RFC / draft.
.collect::<Vec<_>>(); | ||
DataType::Struct(fields.into()) | ||
} | ||
other => panic!("not implemented {other:?}"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the idea that DataType::Dictionary
and DataType::RunEndEncoded would also be included here? If so it makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads up. This should include DataType::Dictionary
and DataType::RunEndEncoded
here.
@@ -253,6 +253,7 @@ pub struct ListingOptions { | |||
pub format: Arc<dyn FileFormat>, | |||
/// The expected partition column names in the folder structure. | |||
/// See [Self::with_table_partition_cols] for details | |||
/// TODO this maybe LogicalType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one usecase is to use dictionary encoding for the partition columns to minimize the overhead of creating such columns. As long as they can be represented / specified with LogicalType
I think it is a good idea to try changing this too.
What's the status of this pr? This should be a very useful feature. |
I think this PR is stalled and I don't have any update |
Please accept my apologies for the delay. Due to personal circumstances, I have been unable to attend to any work. I will now proceed to resume work on this PR. |
No worries at all -- I hope all is well and we look forward to this work |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Hello, sorry if this is a redundant question. What is the status of this PR? |
I think it is stale and on track to be closed from what I can see |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
FYI #11160 tracks a new proposal for this feature. It seems to be gaining traction |
Which issue does this PR close?
Closes #7923 .
Current Pull Request is an Experimental Demo for Validating the Feasibility of Logical Types
Rationale for this change
What changes are included in this PR?
Features
UDT
as a function signature inudf/udaf
.register_data_type
function in theSessionContext
.New Additions
LogicalType
struct.ExtensionType
trait. Abstraction for extension types.TypeSignature
struct. Uniquely identifies a data type.Major Changes
get_data_type(&self, _name: &TypeSignature) -> Option<LogicalType>
function to theContextProvider
trait.DFSchema
,DFField
now usesLogicalType
, removing arrowField
and retaining onlydata_type
,nullable
,metadata
sincedict_id
,dict_is_ordered
are not necessary at the logical stage.ExprSchemable
andExprSchema
now useLogicalType
.ast
to logical plan conversion now usesLogicalType
.To Be Implemented
TypeCoercionRewriter
in the analyze stage uses logical types. For example, functions likecomparison_coercion
,get_input_types
,get_valid_types
, etc.udf/udaf
useTypeSignature
instead of the existingDataType
for ease of use inudf/udaf
.To Be Determined
ScalarValue
useLogicalType
or arrowDataType
?LogicalType
.DataType
TableSource
returnDFSchema
or arrowSchema
?Schema
.DFSchema
DFSchema
toSchema
; logical plans useDFSchema
, physical plans useSchema
).Schema
andDFSchema
Schema
toDFSchema
?TableScan
node, obtain arrowSchema
throughTableSource/TableProvider
and then convert it toDFSchema
.TableSource/TableProvider
returnsDFSchema
instead ofSchema
.DFSchema
toSchema
?Schema
fromTableSource
in physical planner, no need for conversion.DFSchema
returned byTableSource
toSchema
in the physical planner stage.Some Thoughts
dyn Array
toLineStringArray
orMultiPointArray
was raised. In my perspective, assuming there is a function specifically designed for handlingLineString
data, the function signature can be defined asLineString
, ensuring that the input data must be of a type acceptable byLineStringArray
.Are these changes tested?
Are there any user-facing changes?