docs: add mdBook based documentation (#2)

## Problem We need to record our design and retain knowledge of the codebase as a team. Documentation is one of the best ways to do it. ## Summary of changes - Check-in mdBook. - Add an initial overview that includes a wish list. - Add an initial glossary containing definitions and terms that optd developer will use. - Add a `rfcs/` directory with a template. - Add a pull request template (in `.github`). ## Future Work - Set up gh-pages for the documentation. It might conflict with the benchmark site. We will delay deploying mdBook to gh-pages once a decision is made --------- Signed-off-by: Yuchen Liang <[email protected]> Co-authored-by: Connor Tsui <[email protected]>
cmu-db · Jan 15, 2025 · de96ad3 · de96ad3
1 parent 36ed2ce
commit de96ad3
Show file tree

Hide file tree

Showing 12 changed files with 178 additions and 0 deletions.
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
@@ -0,0 +1,3 @@
+## Problem
+
+## Summary of changes
diff --git a/README.md b/README.md
@@ -1,2 +1,6 @@
 # optd
 Query Optimizer Service
+
+## Documentation
+
+The [docs](docs/) directory contains high-level documentation of optd and RFCs in the mdBook format.
diff --git a/docs/.gitignore b/docs/.gitignore
@@ -0,0 +1 @@
+book
diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,12 @@
+# Development Documentation
+
+The `src` folder contains the documentation on the optd query optimizer in the `mdBook` format.
+
+
+To view the documentation locally, you can follow the [`mdBook` installation guide](https://rust-lang.github.io/mdBook/guide/installation.html) to set up the environment. After installing `mdBook`, run the following command from the root of the optd repository:
+
+```shell
+mdbook serve --open docs/
+```
+
+If you want to edit or add a chapter to the book, start from [SUMMARY.md](./src/SUMMARY.md) which lists a table of contents. For more information, please check out the [mdBook documentation](https://rust-lang.github.io/mdBook/format/index.html).
diff --git a/docs/book.toml b/docs/book.toml
@@ -0,0 +1,9 @@
+[book]
+authors = ["Yuchen Liang"]
+language = "en"
+multilingual = false
+src = "src"
+title = "The optd Query Optimizer Documentation"
+
+[output.html]
+additional-css = ["custom.css"]
diff --git a/docs/custom.css b/docs/custom.css
@@ -0,0 +1,5 @@
+.content img {
+    margin-left: auto;
+    margin-right: auto;
+    display: block;
+}
diff --git a/docs/src/SUMMARY.md b/docs/src/SUMMARY.md
@@ -0,0 +1,16 @@
+# Summary
+
+[Overview](./overview.md)
+
+# Architecture
+
+- [Glossary](./architecture/glossary.md)
+
+# Contributor Guide
+
+- [Installaton]()
+
+# RFCs
+
+- [Writing an RFC](./rfcs/README.md)
+- [RFC-0001: The Core Objects in a Cascades-style Query Optimizer and How to Store Them]()
diff --git a/docs/src/architecture/glossary.md b/docs/src/architecture/glossary.md
@@ -0,0 +1,76 @@
+# Glossary
+
+Definitions in query optimization can get very overloaded. Below is the language optd developers speak.
+
+### Relational operator
+A **relation operator** (`RelNode`) describes an operation that can be evaluated to obtain a bag of tuples. In other literature this is also referred to as a query plan. A relational operator can be either logical or physical.
+
+### Scalar operator
+
+A **scalar operator** (`ScalarNode`) describes an operation that can be evaluated to obtain a single value. In other literature this is also referred to as a sql expression or a row expression.
+
+## Cascades
+
+### Expressions
+
+A **logical expression** is a tree/DAG of logical operators.
+
+A **physical expression** is a tree/DAG of physical operators.
+
+The term **expression** in the context of Cascades can refer to either a relational or a scalar expression.
+
+### Properties
+
+**Properties** are metadata computed (and sometimes stored) for each node in an expression.
+Properties of an expression may be **required** by the original SQL query or **derived** from **physical properties of one of its inputs.**
+
+
+**Logical properties** describe the structure and content of data returned by an expression.
+
+- Examples: row count, operator type,statistics, whether relational output columns can contain nulls.
+
+**Physical properties** are characteristics of an expression that
+impact its layout, presentation, or location, but not its logical content.
+
+- Examples: order and data distribution.
+
+
+### Equivalence
+
+Two logical expressions are equivalent if the logical properties of the two expressions are the same. They should produce the same set of rows and columns.
+
+Two physical expressions are equivalent if their logical and physical properties are the same.
+
+Logical expression with a required physical property is equivalent to a physical expression if the physical expression has the same logical property and delivers the physical property.
+
+
+### Group
+
+A **group** consists of equivalent logical expressions.
+
+A **relational group** consists of logically equivalent logical relational operators.
+
+A **scalar group** consists of logically equivalent logical scalar operators.
+
+### Rule
+
+a **rule** in Cascades transforms an expression into equivalent expressions. It has the following interface.
+
+```rust
+trait Rule {
+    /// Checks whether the rule is applicable on the input expression.
+	fn check_pattern(expr: Expr) -> bool;
+    /// Transforms the expression into one or more equivalent expressions.
+	fn transform(expr: Expr) -> Vec<Expr>;
+}
+```
+
+A **transformation rule** transforms a **part** of the logical expression into logical expressions. This is also called a logical to logical transformation in other systems.
+
+A **implementation rule** transforms a **part** of a logical expression to an equivalent physical expression with physical properties.
+
+In Cascades, you don't need to materialize the entire query tree when applying rules. Instead, you can materialize expressions on demand while leaving unrelated parts of the tree as group identifiers.
+
+In other systems, there are physical to physical expression transformation for execution engine specific optimization, physical property enforcement, or distributed planning. At the moment, we are **not** considering physical-to-physical transformations.
+
+**Enforcer rule:** *TODO!*
diff --git a/docs/src/citations.md b/docs/src/citations.md
@@ -0,0 +1,5 @@
+[1] Bailu Ding, Vivek Narasayya and Surajit Chaudhuri (2024), "Extensible Query Optimizers in Practice", Foundations and Trends® in Databases: Vol. 14: No. 3-4, pp 186-402. http://dx.doi.org/10.1561/1900000077
+
+[2] Guido Moerkotte, Pit Fender, and Marius Eich. 2013. On the correct and complete enumeration of the core search space. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD '13). Association for Computing Machinery, New York, NY, USA, 493–504. https://doi.org/10.1145/2463676.2465314
+
+[3] Florian M. Waas and Joseph M. Hellerstein. 2009. Parallelizing extensible query optimizers. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data (SIGMOD '09). Association for Computing Machinery, New York, NY, USA, 871–878. https://doi.org/10.1145/1559845.1559938
diff --git a/docs/src/overview.md b/docs/src/overview.md
@@ -0,0 +1,13 @@
+# Introduction
+
+optd is a database query optimizer service。The project is in active development.
+
+## Our Wishlist
+
+- Correct and complete enumeration of the search space.
+- An accurate cost model powered with advanced statistics that can differentiate plans under a mix of optimization objectives.
+- An efficient search algorithm to navigate the vast search space.
+- A persistent storage (cache) of query optimizer state that allows us to reuse past optimizations for future queries.
+- An explainable, self-correcting, and human-assisted optimization process by producing and consuming a trail of breadcrumbs that could explain every decision that the optimizer makes.
+- An intelligent scheduler that exploit parallelism in modern hardwares to boost search performance.
+- An extensible operator and rule system.
diff --git a/docs/src/rfcs/0000-template.md b/docs/src/rfcs/0000-template.md
@@ -0,0 +1,27 @@
+- Feature Name: (fill me in with a unique feature name)
+- Authors: (fill me in with the name of the authors)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [cmu-db/optd#0000](https://github.com/cmu-db/optd/pull/0000)
+- Tracking Issue: [cmu-db/optd#0000](https://github.com/cmu-db/optd/issues/0000)
+
+## Summary
+
+## Motivation
+
+## Non Goals (if relevant)
+
+## Impacted components (e.g. core, memo table, representation, rule engine, etc.)
+
+## Proposed implementation
+
+### Reliability, failure modes and corner cases (if relevant)
+
+### Scalability (if relevant)
+
+### Unresolved questions (if relevant)
+
+## Alternative implementation (if relevant)
+
+## Pros/cons of proposed approaches (if relevant)
+
+## Definition of Done (if relevant)
diff --git a/docs/src/rfcs/README.md b/docs/src/rfcs/README.md
@@ -0,0 +1,7 @@
+# RFCs
+
+This section contains RFCs for features and technical concepts proposed to be integrated into the system. In some cases, they could be retroactive and document why certain design decisions were made. Writing RFCs enables us to early validate our concepts and keep peers informed around the design of codebase. Since it is not our goal to keep these documents up to date, please refer to the Architecture section if you are looking for the most up-to-date description of the system. However, with context, these RFCs should still provide useful insights into the reasoning and thought process behind certain design decisions.
+
+To write a new RFC, copy `docs/rfcs/0000-templated.md` to a new file and start editing the header and all the other sections. 
+
+For your RFC PR, you should also make sure to edit [SUMMARY.md](../SUMMARY.md) to make it visible in the mdBook docs. Make sure your RFC numbering does not collides with the ones other people wrote and have not been merged yet.