The great_expectations folder is a Great Expectations project.
-
Core directories
- expectations: JSON files indicating the tests to be run (similar to yml files containing schema tests in DBT);
these are not in version control, but generated by the tool from
dot.configured_tests
- notebooks: Jupyter notebooks automatically created by Great Expectations during setup to allow for a more
convenient front-end to edit the JSON files in
expectations
(check out the flow described by the Great Expectations documentation) - plugins: Additional code for customizing this Great Expectations project. Most important in here is
custom_expectations.py
, which is where tests requiring arbitrary python should be added as methods under theCustomSqlAlchemyDataset
class (somewhat similar to the custom SQL tests in DBT, except written in python).
- expectations: JSON files indicating the tests to be run (similar to yml files containing schema tests in DBT);
these are not in version control, but generated by the tool from
-
Non-version controlled directories (automatically created by Great Expectations)
-
uncommitted: Generic place to capture all Great Expecations files that should not be version controlled, including logs and database connection details
-
Scripts
- great_expectations.py: Utility script to automatically run GE tests and create coverage reports. GE's CLI commands for this are much more verbose and more difficult to remember than DBTs.
-
Config files -not in version control, these are either managed by the tool or set in the project-dependent config
- batch_config.json: Defines datasources, test suites, and tables to be included when running Great Expectations
- great_expectations.yml: Main config file for Great Expectations (similar to
dbt_project.yml
)
-
- An expectation is a particular function accepting one or multiple parameters (defined in Python)
- A test is an instance of an expectation, with a specific set of parameters (defined in a JSON file)
- Out Of The Box (OOTB) expectations are provided by Great Expectations and built-in with the library's codebase.
-
Tests are defined in JSON, akin to dbt schema tests in yaml
- These live in a suite called
great_expectations/expectations/<FILE>.json
- These live in a suite called
-
Tests could be used OOTB, but most of the time are written custom
- Custom expectations can operate on any tables passed as a parameter, but OOTB expectations will only be applied on the selected table in batch_config.json (see extra notes below for details)
- OOTB tests can use views defined in DBT
- OOTB tests can be defined directly in the JSON file
- Custom expectations need to be added as decorated methods in
plugins/custom_expectations.py
- Custom tests can run arbitrary python/pandas (even though this isn't well-documented in the GE published docs)
- Once added to in custom_expectations.py, tests can be defined in the JSON file similarly to OOTB tests
- OOTB tests have a variety of outputs and therefore might not conform to the format expected by the Data Integrity framework -whenever possible, use DBT tests instead
- Custom tests can run arbitrary python/pandas (even though this isn't well-documented in the GE published docs)
If a mix of OOTB and custom expectations are needed, it is suggested to keep them in two suites of tests to manage their differences efficiently
The data integrity tool works with a few assumptions in terms of what an expectation should accept and return.
-
We create views out of the DOT results with Postgresql-specific syntax. If you're using any other database engine, please adapt the query in great_expectations.py.
-
An expectation accepts both column names and table names as arguments. Great Expectations generally has table-agnostic suites running on specific single tables, but we're changing this model a bit because data integrity queries often depend on more than one table. Therefore, a default empty dataset is added in the
batch_config.json
for all custom expectations, and a relevant table name should be passed to the expectation in the suite definition. The default dataset won't be read at all and is used as a placeholder. -
Custom expectations are found in custom_expectations.py under plugins, it is recommended to follow their format and to add your own custom expectations as methods of that same class.
-
The tool's post-processing step expects a few specific field in the output of the expectations (refer to example custom expectations to see how they're implemented)