CrateDB Architecture

This document assumes that you are familiar with using CrateDB. If you have not used CrateDB before, please check out our tutorials.

CrateDB is a distributed SQL database. From a high-level perspective, the following components can be found in each CrateDB node:

Components

SQL Engine

CrateDB's core consists of a SQL Engine which takes care of parsing SQL statements and executing them in the cluster.

The engine is comprised of the following components:

Parser: Breaks down the SQL statement into its components and creates an AST (Abstract Syntax Tree).
Analyzer: Performs semantic processing of the statement including verification, type annotation, and normalization.
Planner: Builds an execution plan from the analyzed statement and optimizes it if possible. This is first done on a logical level (LogicalPLan), then broken down to the physical level (ExecutionPlan).
Executor: Executes the plan in the cluster and collects results.

Class entry point:

SqlParser
Analyzer
Planner
LogicalPlan / ExecutionPlan
NodeOperationTree
ExecutionPhasesTask
BatchIterator

Input

CrateDB accepts input from a HTTP/REST interface as well as from a PostgreSQL compatible wire protocol. The admin interface, Crash CLI and some of the connectors use the HTTP interface while the JDBC client uses the PostgreSQL wire protocol.

The PostgreSQL wire protocol support in CrateDB allows you to use CrateDB in applications which were originally built to communicate with PostgreSQL. You can also use PostgreSQL tools like psql with CrateDB.

Class entry points:

CrateRestMainAction
PostgresWireProtocol

Transport

Transport refers to the communication between nodes in a CrateDB cluster. CrateDB manages the cluster state and the transfer of data between cluster nodes. This includes node discovery, partitioning of data, recovery, and replication.

Class entry points:

TransportJobAction
TransportShardUpsertAction

Storage

CrateDB enables storing data in tables like you would in a traditional SQL database with a strict schema. You can dynamically extend the schema or store JSON objects inside columns which you can also query later.

To distribute the data being stored, they are clustered by a column or expression. Partitioned tables allow you to further distribute and split up your data using other columns or expressions.

A table corresponds to an Elasticsearch index. An index may have one or multiple shards which represent the slices of a table which are stored across the cluster. If a table is partitioned, each partition is a separate index.

The following table illustrates the relationship between table, partitions, and indices. Table1 has one index (=no partitions) with four shards. Table2 has two indices (=two partitions) with two shards each. Table3 has one index which is split across three shards.

	Table1	Table2		Table3
Partition	Index	Index1	Index2	Index
Shard1	X	X	X	X
Shard2	X	X	X	X
Shard3	X			X
Shard4	X
...

To be able to retrieve data efficiently and perform aggregations, CrateDB uses Lucene for indexing and storing data. Lucene itself stores a document with the contents of each row. Retrieval is efficient because the fields of the document are indexed. Aggregations can also be performed efficiently due to Lucene's column store feature which stores columns separately in order to quickly perform aggregations with them.

Class entry points:

LuceneQueryBuilder
LuceneBatchIterator

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

architecture.rst

architecture.rst

CrateDB Architecture

Components

SQL Engine

Input

Transport

Storage

Files

architecture.rst

Latest commit

History

architecture.rst

File metadata and controls

CrateDB Architecture

Components

SQL Engine

Input

Transport

Storage