-
Notifications
You must be signed in to change notification settings - Fork 981
Batch Structure
Drill's internal data structures are vastly complicated because Drill, in an attempt to be more flexible, gave up the concepts that manage complexity in typical SQL engines.
A typical engine defines a schema which in turn defines a row format. Since SQL is tuple oriented, this is somewhat easy to do. We simply define a table of columns, each of which is a pointer to the implementation. For example, for a row-based system:
Name | Type | Offset |
---|---|---|
a | INT | 0 |
b | VARCHAR | 4 |
Row layout
a b Varchar buffer
| . . . . | . . . . | ... | . . . . . . .
r | ^
+------------------+
The table provides offsets into a tuple structure. The column b (VARCHAR) is 4-byte pointer into a separate buffer area. This is just one model, there are many.
The key point is that, given a schema, we can follow indirections into the row. Here, given a row start address (r above), we can compute column locations.
In Drill, however, things are far more complex.
- Data is stored in vectors. Given a row index, we can locate the column value within the vector, which is good. But, there is good way to correlate row offsets across vectors. (Drill has multiple ways to do this.)
- There is no good way to go from a schema to a vector. Vectors are stored in unordered "containers". The most general case requires a name look for each column access, which is slow.
- Drill allows nested types: MAP, UNION (or MAP), DICT. As a result, indexing into a column is complex: we can't simply know that column "a" corresponds to some vector; each vector has structure (and so nested columns require name lookups.)
- Drill allows arrays (including of MAPs). Thus, the row index for a set of vectors is not consistent: Scalar a uses one index, but Scalar b (within a map within an array) uses a different index.
The net result is that, over time, Drill has gotten ever more complex. One cannot simply create parallel structures which all reference the same column. For example, for a PROJECT operation:
Name | Type | Offset | Source | Transform |
---|---|---|---|---|
a | INT | 0 | $in | $copy |
b | VARCHAR | 4 | $fn | $concat |
Instead, since one cannot easily address either vectors or values, Drill uses complex structures. Each structure must be a tree (to handle arbitrary data structures.) Each column in the table corresponds to a separate tree structure, and each tree structure must link to the next one in line. Something like:
Projection Accessor Vector DrillBuf
() --------------> () ---------> (Container)
/ \ / \ / \
(a) (b) --------> (a) (b) ------ (vv) (vv) -----> (buf) (buf)
And so on to as many layers of MAP, ARRAY, DICT and other structures are available. Linkages are pointers. This causes odd behavior such as operators bind to vectors, but the contents of vectors change. Accessors contain pointers to vectors, and so work OK for single-input operators, but require complex pointer-indirection for multiple-input operators (such as merge) or for hyper-batches.
Each column has different semantics. So, each new layer requires that we re-implement a whole new set of structures: one for each vector type. Each time we add a new vector type, we must add a new class to each layer. This ends up with (On^2) complexity, which is not a "Good Thing."
SQL systems are designed to interpret programs written in SQL. To do that they apply a large number of transforms to each column value in each row. A tuple layout allows the interpreter to walk tables that say what to do for each row.
In Drill, we try to model arbitrary tree data structures. And, we allow the tree structures to change order and content at any time. We want to interpret operations on these trees. Doing so is very, very complex.
Since we cannot use simple tables, we end up adding a new tree structure at each level of abstraction. Since each column behaves differently, we reproduce the type hierarchy at each level. Each node points to its children in the current abstraction, and to its counterpart in a lower-level abstraction. And since the structure can change, the entire hierarchy must be dynamic, and can change at any batch boundary.
This is far too complex. SQL is not designed for arbitrary tree structures. So, even if we got it to work, SQL can't exploit the structure.
We need a simpler solution.
The goal of this analysis is to answer the following questions:
- What are the major existing abstractions (the layers of trees)?
- Can we find a way to convert the trees into tables via effective column indexing?
- How can we simplify the abstraction layers?
Containers
-
DrillBuf
- Semi-typed, physical direct-memory representation of a column. -
ValueVector
- Typed collection of column values. -
VectorAccessible
- Iterator over a set of vectors. -
VectorContainer
- Vectors plusBatchSchema
plus SV mode -
RecordBatch
- Container plus SVs plus all of the operator machinery -
WritableBatch
- "Specialized version of record batch that can moves out buffers and preps them for writing." -
RowSet
and subclasses - Generalized wrapper around a set of vectors -
BatchAccessor
- Subset ofRecordBatch
methods for accessing a batch & SVs.
Accessors
-
ValueVector.Mutator
,Accessor
- Low level per-vector access. -
ComplexWriter
,ComplexReader
- Older JSON-like accessors. -
ColumnWriter
,ColumnReader
- Newer accessors. - Generated code - Directly manipulates value vectors
Metadata
-
MaterializedField
- Description of a vector (and its children) -
BatchSchema
- List ofMaterializedField
in no specific order -
ColumnMetadata
- Enriched alternative toMaterializedField
-
TupleMetadta
- Enriched form ofBatchSchema
(and schema for a map) -
SerializedField
- Protobuf column description for RPC -
RecordBatchDef
- Protobuf batch definition for RPC
Operators
-
RecordBatch
- Combination of a batch of vectors plus a Volcano iterator. Is aVectorAccessible
. -
OperatorRecordBatch
- Wrapper around newer, simpler operator structure to look like aRecordBatch
-
OperatorExec
- Interface for the "guts" of an operator. - ?? - Shim around a
RecordBatch
to make it look like aOperatorExec
- Column schema - Description of one column,
- Tuple schema - Description of a set of columns: indexed by name and position.
- Column vector - Opaque (to clients) storage of column data
- Column bundle - Collection of vectors, indexed by name (slow) or position.
- Column accessors - Read/write access to a bundle of vectors
- Record set - Vector bundle plus a record count and schema.
- Record batch - Record set plus possible SV
- Operator - Follows Volcano protocol, provides an output record batch
Rough structure:
Column vector o-- `ValueVector` o-- `DrillBuf`
o-- `Mutator`, `Accessor`
o-- `MaterializedField`
o- `ColumnMetadata`
Column bundle o-- Column vector *
o-- `TupleMetadata`
o-- Row count
Column batch o-- Column bundle
o-- Selection vector mode
o-- Selection vector
Conversions:
Column vector --> `ColumnWriter`
--> `ColumnReader`
Column bundle --> `RowSetWriter`
--> `RowSetReader`
Column batch --> `RowSetReader` (with SV)
--> `WritableBatch`
- Cannot change the client API: must return existing vector, metadata structures to existing JDBC, C++ clients.
- An a-priori schema can include children; a vector-description schema should not duplicate vector structure.
- Required stickiness of vectors: two operators share pointers to vectors, but becomes awkward for scan, merge batches.
- SV4 complexity: normal and hyper batches
External APIs that must not change:
- Serialized form of batches: data, metadata
- Protobuf form of metadata
Leave existing structures alone: wrap them in new ones. Gradually migrate operators. Then, remove old abstractions (which, at that point, are just shims on top of the new ones.)
Gradually migrate all vector access to accessors, to allow eventual evolution of underlying physical storage.
Very preliminary mapping:
Current | Revised | Public API? | Notes |
---|---|---|---|
DrillBuf |
Same | Yes | Now holds buffer, not used for value read/write. |
ValueVector |
Same | Yes | Internally, value access is via accessors. |
ValueVector. Accessor , Mutator
|
Same | Yes | Deprecated internally. |
MaterializedField |
Same | Yes | Deprecated internally in favor of ColumnMetadata
|
ComplexWriter , ComplexReader
|
ColumnWriter , ColumnReader
|
Partial | Exposed in UDF definition. |
VectorContainer |
VectorBundle |
Check |
- Define vector bundle a internal helper abstraction
- Refactor readers/writers per the
svr-exp3
code- Readers & writers work with vector bundles
- Builders convert multiple vector structures to a bundle internally
- Rationalize the BatchAccessor
- Original purpose: interface for batch-related methods of
RecordBatch
- Has morphed into a way of holding, tracking batches. Revise?
- Original purpose: interface for batch-related methods of
- Remove SV4 from public API, use only internally
- Copier handles SV4 cleanly
- Remove old external sort
- Managed sort does SVR internally so exposes SV2 or none
- What other operators produce, consume SV4?
- Migrate to batch accessor
- Add batch accessor to
RecordBatch
as wrapper around itself. - Extends to hold the container, SVs;
RecordBatch
methods defer to batch accessor - Gradually replace direct
RecordBatch
calls to calls to batch accessor
- Add batch accessor to
-
ColumnMetadata
wrapper around aMaterializedField
-
TupleMetadata
wrapper around aBatchSchema
-
VectorBundle
to hold a set of vectors- In operators which use ad-hoc structures, replace with the
VectorBundle
-
VectorBundle
wrapper around aVectorAccessible
orVectorContainer
- In operators which work with a
VectorContainer
, change to useVectorBundle
instead
- In operators which use ad-hoc structures, replace with the