Skip to content
This repository has been archived by the owner on Feb 8, 2019. It is now read-only.

[DO NOT MERGE] Refactor type system to provide better extensibility of types and functions #315

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

jianqiao
Copy link
Contributor

This is a preliminary PR that is not ready to be merged but provides an overall view of the type system refactoring work. Many constructs are at their initial designs and maybe further improved.

The PR aims at reviewing the refactoring designs at the "architecture" level. Detailed code style and unit test issues may be addressed later in subsequent concrete PRs.

The overall purpose of the refactoring is to improve the extensibility of the existing type/function system (i.e. support more kinds of types/functions and make it easier to add new types and functions), while retaining the performance of the current system.

Major Changes

Part I. Type System


1. Categorize all types into four memory layouts.

The four memory layouts are:

  • CxxInlinePod (C++ plain old data)
  • ParInlinePod (Parameterized inline plain old data)
  • ParOutOfLinePod (Parameterized out-of-line plain old data)
  • CxxGeneric (C++ generic types)

Memory layout decides how the corresponding type's values are stored and represented.

Briefly speaking,

  • CxxInlinePod corresponds to C++ primitive types or POD structs.
    • E.g. int, double, struct { double x, double y }.
    • The size of a CxxInlinePod value is known at C++ compile time (e.g double has size 8, struct { double x, double y } has size 16).
  • ParInlinePod corresponds to database defined "fixed length" types.
    • E.g. Char(8), Char(20).
    • The size of such types' values are not known at C++ compile time. Instead, the type is parameterized by an unsigned integer, where the parameter's value is known at SQL query compile time (which is C++ run-time).
  • ParOutOfLinePod corresponds to database defined "variable length" types.
    • E.g. Varchar(20).
    • The size of such types' values are not known until SQL query run-time.
  • CxxGeneric correponds to C++ general types (i.e. any C++ type).
    • E.g. std::set<int>, std::vector<const Type*>.
    • Such types have to implement serialization/deserialization methods to have storage support.

2. Use TypeIDTrait to allow many information to be known at compile time.

With this per-type trait information, we can avoid many boilerplate code for each subclass of Type by using template techniques and specialize on the memory layout. See TypeSynthesizer and TypeFactory.

TypeIDTrait is also extensively used in many other places as it provides all the required compile-time information about a type.


3. Support more types.

Details will be written later about how to add a new type into the Quickstep system.

The current PR has some example types added:

  • The Bool type. It will be used later for connecting scalar functions and predicates.
  • The Text type. A general non-parameterized string type.
    • TODO: We need some updates in the storage block module (potentially also other places) to handle the "infinite maximum byte size" types.
  • The MetaType type. It is "type of type". I.e. a value of MetaType has C++ type const Type*.
  • The Array type. A generic type that represents an array. This type takes a MetaType value as parameter, where the parameter specifies the array's element type.
    • TODO: We need specialized array types such as IntArray and TextArray for performance consideration.

4. Improve the type casting mechanism.

Type casting (coersion) is an important feature that is needed in practice from time to time.

This PR's design defined an overall template

template <typename SourceType, typename TargetType, typename Enable = void>
struct CastFunctor;

which is then specialized by different source/target types.

The coercibility between two types is then inferred according to whether the corresponding specialization exists. Thus it suffices to just specialize CastFunctor when adding a new casting operation, and all the dependent places (e.g. Type::isCoercibleFrom()) will mostly be auto-generated by the system (unless the target type is a parameterized type and you want to do some further checks).

Note that safe-coercibility is a separate issue and needs to be taken care of mostly manually, by overriding Type::isSafelyCoercibleFrom().

Explicit casting is supported with a PostgreSQL-like syntax. E.g.

(1)

SELECT (i::text + (i+1)::text)::int AS result FROM generate_series(1, 3) AS g(i);

--
+-----------+
|result     |
+-----------+
|         12|
|         23|
|         34|
+-----------+

(2)

CREATE TABLE r(x varchar(16));

INSERT INTO r SELECT pow(10, i)::varchar(10) FROM generate_series(1, 3) AS g(i);

SELECT 'There are ' + length(x)::varchar(10) + ' characters in ' + x AS result FROM r;

--
+---------------------------------------------------+
|result                                             |
+---------------------------------------------------+
|                       There are 2 characters in 10|
|                      There are 3 characters in 100|
|                     There are 4 characters in 1000|
+---------------------------------------------------+

(3)

SELECT {1,2,3}::array(double) AS result from generate_series(1, 1);

--
+--------------------------------+
|result                          |
+--------------------------------+
|                         {1,2,3}|
+--------------------------------+

NOTE: The work is not yet fully completed so there may be LOG(FATAL) aborts for some combinations of queries.

Implicit coersion is supported when resolving scalar functions, see here. For example, we have support for the sqrt function where the parameter can be a Float or Double value. Consider the query

SELECT sqrt(x) FROM r;

where x has Int type, then an implicit coercion from Int to Float will be added.


5. Add GenericValue to represent typed-values of all four memory layouts.

The original TypedValue is not sufficient to represent CxxGeneric values, as we need to embed the overall Type information in order to handle value allocation/copy/destruction. However, due to performance consideration, we may not just replace TypedValue with a more generic but slower implementation. Thus, a separate GenericValue is added and we still use TypedValue when handling storage-related operations.


6. Move type resolving from parser to resolver.

This avoids the need of modifying SqlParser.ypp for adding a new type.

See ParseDataType and Resolver::resolveDataType().

~

Part II. Scalar Function


1. Implement UnaryOperationSynthesizer/UncheckedUnaryOperatorSynthesizer to make it easier to add unary functions.

Example unary functions:

2. Implement BinaryOperationSynthesizer/UncheckedBinaryOperatorSynthesizer to make it easier to add binary functions.

Example binary functions:

3. Use OperationSignature and OperationFactory to support general operation resolution.

~

Part III. TODOs

  • A lot of TODO(refactor-type) in the code to be fixed.
  • Refactor the predicate system (we will have something like ComparisonSynthesizer).
  • A lot unit tests are broken (due to API change) and need to be fixed.
  • Comments and style of template metaprogramming code.
  • More to be added ...

@hbdeshmukh
Copy link
Contributor

@jianqiao This seems like a massive effort! Would you please create a JIRA issue for this PR? You can just copy the content of this PR description in the issue text.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants