This document assumes that you are familiar with using CrateDB. If you have not used CrateDB before, please check out our tutorials.
CrateDB is a distributed SQL database. From a high-level perspective, the following components can be found in each CrateDB node:
CrateDB's core consists of a SQL Engine which takes care of parsing SQL statements and executing them in the cluster.
The engine is comprised of the following components:
- Parser: Breaks down the SQL statement into its components and creates an AST (Abstract Syntax Tree).
- Analyzer: Performs semantic processing of the statement including verification, type annotation, and normalization.
- Planner: Builds an execution plan from the analyzed statement and optimizes it if possible. This is first done on a logical level (LogicalPLan), then broken down to the physical level (ExecutionPlan).
- Executor: Executes the plan in the cluster and collects results.
Class entry point:
- SqlParser
- Analyzer
- Planner
- LogicalPlan / ExecutionPlan
- NodeOperationTree
- ExecutionPhasesTask
- BatchIterator
CrateDB accepts input from a HTTP/REST interface as well as from a PostgreSQL compatible wire protocol. The admin interface, Crash CLI and some of the connectors use the HTTP interface while the JDBC client uses the PostgreSQL wire protocol.
The PostgreSQL wire protocol support in CrateDB allows you to use CrateDB in applications which were originally built to communicate with PostgreSQL. You can also use PostgreSQL tools like psql with CrateDB.
Class entry points:
Transport refers to the communication between nodes in a CrateDB cluster. CrateDB manages the cluster state and the transfer of data between cluster nodes. This includes node discovery, partitioning of data, recovery, and replication.
Class entry points:
CrateDB enables storing data in tables like you would in a traditional SQL database with a strict schema. You can dynamically extend the schema or store JSON objects inside columns which you can also query later.
To distribute the data being stored, they are clustered by a column or expression. Partitioned tables allow you to further distribute and split up your data using other columns or expressions.
A table corresponds to an Elasticsearch index. An index may have one or multiple shards which represent the slices of a table which are stored across the cluster. If a table is partitioned, each partition is a separate index.
The following table illustrates the relationship between table, partitions, and indices. Table1 has one index (=no partitions) with four shards. Table2 has two indices (=two partitions) with two shards each. Table3 has one index which is split across three shards.
Table1 | Table2 | Table3 | ... | ||
---|---|---|---|---|---|
Partition | Index | Index1 | Index2 | Index | |
Shard1 | X | X | X | X | |
Shard2 | X | X | X | X | |
Shard3 | X | X | |||
Shard4 | X | ||||
... |
To be able to retrieve data efficiently and perform aggregations, CrateDB uses Lucene for indexing and storing data. Lucene itself stores a document with the contents of each row. Retrieval is efficient because the fields of the document are indexed. Aggregations can also be performed efficiently due to Lucene's column store feature which stores columns separately in order to quickly perform aggregations with them.
Class entry points: