Quantco · pavelzw · Jul 13, 2024 · Jul 13, 2024 · Aug 14, 2024
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -0,0 +1,23 @@
+name: Docs
+on:
+  pull_request:
+  push:
+    branches:
+      - main
+permissions:
+  contents: write
+jobs:
+  docs:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout branch
+        uses: actions/checkout@v4
+      - name: Set up pixi
+        uses: prefix-dev/setup-pixi@ba3bb36eb2066252b2363392b7739741bb777659
+        with:
+          environments: docs
+      - name: Build docs
+        run: pixi run -e docs docs-build
+      - name: Deploy docs
+        if: github.event_name == 'push' && github.ref == 'refs/heads/main'
+        run: pixi run -e docs mkdocs gh-deploy --force
diff --git a/.readthedocs.yml b/.readthedocs.yml
diff --git a/docs/Makefile b/docs/Makefile
diff --git a/docs/api-documentation.md b/docs/api-documentation.md
@@ -0,0 +1,3 @@
+# API Documentation
+
+::: datajudge
diff --git a/docs/development.md b/docs/development.md
@@ -0,0 +1,52 @@
+# Development
+
+`datajudge` development relies on [pixi](https://pixi.sh/latest/).
+In order to work on `datajudge`, you can create a development environment as follows:
+
+```bash
+git clone https://github.com/Quantco/datajudge
+cd datajudge
+pixi run postinstall
+```
+
+Unit tests can be run by executing
+
+```bash
+pixi run test
+```
+
+Integration tests are run against a specific backend at a time. As of now, we provide helper
+scripts to spin up either a Postgres or MSSQL backend.
+
+To run integration tests against Postgres, first start a docker container with a Postgres database:
+
+```bash
+./start_postgres.sh
+```
+
+In your current environment, install the `psycopg2` package.
+After this, you may execute integration tests as follows:
+
+```bash
+pixi run -e postgres-py38 test
+```
+
+Analogously, for MSSQL, run
+
+```bash
+./start_mssql.sh
+```
+
+and
+
+```bash
+pixi run -e mssql-py310 test
+```
+
+or
+
+```bash
+pixi run -e mssql-py310 test_freetds
+```
+
+depending on the driver you'd like to use.
diff --git a/docs/examples/company-data.md b/docs/examples/company-data.md
@@ -0,0 +1,128 @@
+# Company data
+
+To get started, we will create a sample database using sqlite that contains a list of companies.
+
+The table "companies_archive" contains three entries:
+
+**companies_archive**
+
+| id  | name    | num_employees |
+| --- | ------- | ------------- |
+| 1   | QuantCo | 90            |
+| 2   | Google  | 140,000       |
+| 3   | BMW     | 110,000       |
+
+While "companies" contains an additional entry:
+
+**companies**
+
+| id  | name    | num_employees |
+| --- | ------- | ------------- |
+| 1   | QuantCo | 100           |
+| 2   | Google  | 150,000       |
+| 3   | BMW     | 120,000       |
+| 4   | Apple   | 145,000       |
+
+```python
+import sqlalchemy as sa
+
+eng = sa.create_engine('sqlite:///example.db')
+
+with eng.connect() as con:
+    con.execute("CREATE TABLE companies (id INTEGER PRIMARY KEY, name TEXT, num_employees INTEGER)")
+    con.execute("INSERT INTO companies (name, num_employees) VALUES ('QuantCo', 100), ('Google', 150000), ('BMW', 120000), ('Apple', 145000)")
+    con.execute("CREATE TABLE companies_archive (id INTEGER PRIMARY KEY, name TEXT, num_employees INTEGER)")
+    con.execute("INSERT INTO companies_archive (name, num_employees) VALUES ('QuantCo', 90), ('Google', 140000), ('BMW', 110000)")
+```
+
+As an example, we will run 4 tests on this table:
+
+1. Does the table "companies" contain a column named "name"?
+2. Does the table "companies" contain at least 1 entry with the name "QuantCo"?
+3. Does the column "num_employees" of the "companies" table have all positive values?
+4. Does the column "name" of the table "companies" contain at least all the values of
+   the corresponding column in "companies_archive"?
+
+```python
+import pytest
+import sqlalchemy as sa
+
+from datajudge import (
+    Condition,
+    WithinRequirement,
+    BetweenRequirement,
+)
+from datajudge.pytest_integration import collect_data_tests
+
+
+# We create a Requirement, within a table. This object will contain
+# all the constraints we want to test on the specified table.
+# To test another table or test the same table against another table,
+# we would create another Requirement object.
+companies_req = WithinRequirement.from_table(
+    db_name="example", schema_name=None, table_name="companies"
+)
+
+# Constraint 1: Does the table "companies" contain a column named "name"?
+companies_req.add_column_existence_constraint(columns=["name"])
+
+# Constraint 2: Does the table "companies" contain at least 1 entry with the name "QuantCo"?
+condition = Condition(raw_string="name = 'QuantCo'")
+companies_req.add_n_rows_min_constraint(n_rows_min=1, condition=condition)
+
+# Constraint 3: Does the column "num_employees" of the "companies" table have all
+# positive values?
+companies_req.add_numeric_min_constraint(column="num_employees", min_value=1)
+
+# We create a new Requirement, this time between different tables.
+# Concretely, we intent to test constraints between the table "companies"
+# and the table "companies_archive".
+companies_between_req = BetweenRequirement.from_tables(
+    db_name1="example",
+    schema_name1=None,
+    table_name1="companies",
+    db_name2="example",
+    schema_name2=None,
+    table_name2="companies_archive",
+)
+
+# Constraint 4: Does the column "name" of the table "companies" contain at least all
+# the values of the corresponding column in "companies_archive"?
+companies_between_req.add_row_superset_constraint(
+    columns1=['name'], columns2=['name'], constant_max_missing_fraction=0
+)
+
+# collect_data_tests expects a pytest fixture with the name
+# "datajudge_engine" that is a SQLAlchemy engine
+
+@pytest.fixture()
+def datajudge_engine():
+    return sa.create_engine("sqlite:///example.db")
+
+# We gather our distinct Requirements in a list.
+requirements = [companies_req, companies_between_req]
+
+# "collect_data_tests" takes all requirements and turns their respective
+# Constraints into individual tests. pytest will be able to pick
+# up these tests.
+test_constraint = collect_data_tests(requirements)
+```
+
+Saving this file as `specification.py` and running `$ pytest specification.py`
+will verify that all constraints are satisfied. The output you see in the terminal
+should be similar to this:
+
+```
+=================================== test session starts ===================================
+...
+collected 4 items
+
+specification.py::test_constraint[ColumnExistence::companies] PASSED                [ 25%]
+specification.py::test_constraint[NRowsMin::companies] PASSED                       [ 50%]
+specification.py::test_constraint[NumericMin::companies] PASSED                     [ 75%]
+specification.py::test_constraint[RowSuperset::companies|companies_archive] PASSED  [100%]
+
+==================================== 4 passed in 0.31s ====================================
+```
+
+You can also use a formatted html report using the `--html=report.html` flag.