[DO NOT MERGE] Refactor type system to provide better extensibility of types and functions #315

jianqiao · 2017-10-11T21:32:08Z

This is a preliminary PR that is not ready to be merged but provides an overall view of the type system refactoring work. Many constructs are at their initial designs and maybe further improved.

The PR aims at reviewing the refactoring designs at the "architecture" level. Detailed code style and unit test issues may be addressed later in subsequent concrete PRs.

The overall purpose of the refactoring is to improve the extensibility of the existing type/function system (i.e. support more kinds of types/functions and make it easier to add new types and functions), while retaining the performance of the current system.

Major Changes

Part I. Type System

1. Categorize all types into four memory layouts.

The four memory layouts are:

CxxInlinePod _{(C++ plain old data)}
ParInlinePod _{(Parameterized inline plain old data)}
ParOutOfLinePod _{(Parameterized out-of-line plain old data)}
CxxGeneric _{(C++ generic types)}

Memory layout decides how the corresponding type's values are stored and represented.

Briefly speaking,

CxxInlinePod corresponds to C++ primitive types or POD structs.
- E.g. int, double, struct { double x, double y }.
- The size of a CxxInlinePod value is known at C++ compile time (e.g double has size 8, struct { double x, double y } has size 16).
ParInlinePod corresponds to database defined "fixed length" types.
- E.g. Char(8), Char(20).
- The size of such types' values are not known at C++ compile time. Instead, the type is parameterized by an unsigned integer, where the parameter's value is known at SQL query compile time (which is C++ run-time).
ParOutOfLinePod corresponds to database defined "variable length" types.
- E.g. Varchar(20).
- The size of such types' values are not known until SQL query run-time.
CxxGeneric correponds to C++ general types (i.e. any C++ type).
- E.g. std::set<int>, std::vector<const Type*>.
- Such types have to implement serialization/deserialization methods to have storage support.

2. Use TypeIDTrait to allow many information to be known at compile time.

With this per-type trait information, we can avoid many boilerplate code for each subclass of Type by using template techniques and specialize on the memory layout. See TypeSynthesizer and TypeFactory.

TypeIDTrait is also extensively used in many other places as it provides all the required compile-time information about a type.

3. Support more types.

Details will be written later about how to add a new type into the Quickstep system.

The current PR has some example types added:

The Bool type. It will be used later for connecting scalar functions and predicates.
The Text type. A general non-parameterized string type.
- TODO: We need some updates in the storage block module (potentially also other places) to handle the "infinite maximum byte size" types.
The MetaType type. It is "type of type". I.e. a value of MetaType has C++ type const Type*.
The Array type. A generic type that represents an array. This type takes a MetaType value as parameter, where the parameter specifies the array's element type.
- TODO: We need specialized array types such as IntArray and TextArray for performance consideration.

4. Improve the type casting mechanism.

Type casting (coersion) is an important feature that is needed in practice from time to time.

This PR's design defined an overall template

template <typename SourceType, typename TargetType, typename Enable = void>
struct CastFunctor;

which is then specialized by different source/target types.

The coercibility between two types is then inferred according to whether the corresponding specialization exists. Thus it suffices to just specialize CastFunctor when adding a new casting operation, and all the dependent places (e.g. Type::isCoercibleFrom()) will mostly be auto-generated by the system (unless the target type is a parameterized type and you want to do some further checks).

Note that safe-coercibility is a separate issue and needs to be taken care of mostly manually, by overriding Type::isSafelyCoercibleFrom().

Explicit casting is supported with a PostgreSQL-like syntax. E.g.

(1)

SELECT (i::text + (i+1)::text)::int AS result FROM generate_series(1, 3) AS g(i);

--
+-----------+
|result     |
+-----------+
|         12|
|         23|
|         34|
+-----------+

(2)

CREATE TABLE r(x varchar(16));

INSERT INTO r SELECT pow(10, i)::varchar(10) FROM generate_series(1, 3) AS g(i);

SELECT 'There are ' + length(x)::varchar(10) + ' characters in ' + x AS result FROM r;

--
+---------------------------------------------------+
|result                                             |
+---------------------------------------------------+
|                       There are 2 characters in 10|
|                      There are 3 characters in 100|
|                     There are 4 characters in 1000|
+---------------------------------------------------+

(3)

SELECT {1,2,3}::array(double) AS result from generate_series(1, 1);

--
+--------------------------------+
|result                          |
+--------------------------------+
|                         {1,2,3}|
+--------------------------------+

NOTE: The work is not yet fully completed so there may be LOG(FATAL) aborts for some combinations of queries.

Implicit coersion is supported when resolving scalar functions, see here. For example, we have support for the sqrt function where the parameter can be a Float or Double value. Consider the query

SELECT sqrt(x) FROM r;

where x has Int type, then an implicit coercion from Int to Float will be added.

5. Add GenericValue to represent typed-values of all four memory layouts.

The original TypedValue is not sufficient to represent CxxGeneric values, as we need to embed the overall Type information in order to handle value allocation/copy/destruction. However, due to performance consideration, we may not just replace TypedValue with a more generic but slower implementation. Thus, a separate GenericValue is added and we still use TypedValue when handling storage-related operations.

6. Move type resolving from parser to resolver.

This avoids the need of modifying SqlParser.ypp for adding a new type.

See ParseDataType and Resolver::resolveDataType().

~

Part II. Scalar Function

1. Implement UnaryOperationSynthesizer/UncheckedUnaryOperatorSynthesizer to make it easier to add unary functions.

Example unary functions:

2. Implement BinaryOperationSynthesizer/UncheckedBinaryOperatorSynthesizer to make it easier to add binary functions.

Example binary functions:

3. Use OperationSignature and OperationFactory to support general operation resolution.

See OperationFactory::OperationFactory() about how operations are registered.
See Resolver::resolveScalarFunction() about how a function from SQL query gets resolved.

~

Part III. TODOs

A lot of TODO(refactor-type) in the code to be fixed.
Refactor the predicate system (we will have something like ComparisonSynthesizer).
A lot unit tests are broken (due to API change) and need to be fixed.
Comments and style of template metaprogramming code.
More to be added ...

hbdeshmukh · 2017-10-12T14:34:15Z

@jianqiao This seems like a massive effort! Would you please create a JIRA issue for this PR? You can just copy the content of this PR description in the issue text.

jianqiao added 13 commits October 11, 2017 13:37

Refactor type system and operations.

cb56450

Some updates

0200550

Updates for adding generic types

ebf44cd

Add array expression

a7031a3

Continue the work

b6fd31f

Updates for array type

1e69fb1

Updates to meta type

bef66ad

Add text type

3a3772d

Type as first class citizen

0957264

More updates to types

9cb664c

More updates, refactor names

1cb97e3

Updates to casts

477c385

Updates to implicit casts

a3aec8e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Refactor type system to provide better extensibility of types and functions #315

[DO NOT MERGE] Refactor type system to provide better extensibility of types and functions #315

jianqiao commented Oct 11, 2017

hbdeshmukh commented Oct 12, 2017

[DO NOT MERGE] Refactor type system to provide better extensibility of types and functions #315

Are you sure you want to change the base?

[DO NOT MERGE] Refactor type system to provide better extensibility of types and functions #315

Conversation

jianqiao commented Oct 11, 2017

Major Changes

Part I. Type System

1. Categorize all types into four memory layouts.

2. Use TypeIDTrait to allow many information to be known at compile time.

3. Support more types.

4. Improve the type casting mechanism.

5. Add GenericValue to represent typed-values of all four memory layouts.

6. Move type resolving from parser to resolver.

Part II. Scalar Function

1. Implement UnaryOperationSynthesizer/UncheckedUnaryOperatorSynthesizer to make it easier to add unary functions.

2. Implement BinaryOperationSynthesizer/UncheckedBinaryOperatorSynthesizer to make it easier to add binary functions.

3. Use OperationSignature and OperationFactory to support general operation resolution.

Part III. TODOs

hbdeshmukh commented Oct 12, 2017