Skip to content

Latest commit

 

History

History

great_expectations

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

Validating Community Health Data with Great Expectations

Directory structure

The great_expectations folder is a Great Expectations project.

  • Core directories

    • expectations: JSON files indicating the tests to be run (similar to yml files containing schema tests in DBT); these are not in version control, but generated by the tool from dot.configured_tests
    • notebooks: Jupyter notebooks automatically created by Great Expectations during setup to allow for a more convenient front-end to edit the JSON files in expectations (check out the flow described by the Great Expectations documentation)
    • plugins: Additional code for customizing this Great Expectations project. Most important in here is custom_expectations.py, which is where tests requiring arbitrary python should be added as methods under the CustomSqlAlchemyDataset class (somewhat similar to the custom SQL tests in DBT, except written in python).
  • Non-version controlled directories (automatically created by Great Expectations)

    • uncommitted: Generic place to capture all Great Expecations files that should not be version controlled, including logs and database connection details

    • Scripts

      • great_expectations.py: Utility script to automatically run GE tests and create coverage reports. GE's CLI commands for this are much more verbose and more difficult to remember than DBTs.
    • Config files -not in version control, these are either managed by the tool or set in the project-dependent config

      • batch_config.json: Defines datasources, test suites, and tables to be included when running Great Expectations
      • great_expectations.yml: Main config file for Great Expectations (similar to dbt_project.yml)

Terminology

  • An expectation is a particular function accepting one or multiple parameters (defined in Python)
  • A test is an instance of an expectation, with a specific set of parameters (defined in a JSON file)
  • Out Of The Box (OOTB) expectations are provided by Great Expectations and built-in with the library's codebase.

Structuring tests

  • Tests are defined in JSON, akin to dbt schema tests in yaml

    • These live in a suite called great_expectations/expectations/<FILE>.json
  • Tests could be used OOTB, but most of the time are written custom

    • Custom expectations can operate on any tables passed as a parameter, but OOTB expectations will only be applied on the selected table in batch_config.json (see extra notes below for details)
    • OOTB tests can use views defined in DBT
    • OOTB tests can be defined directly in the JSON file
    • Custom expectations need to be added as decorated methods in plugins/custom_expectations.py
      • Custom tests can run arbitrary python/pandas (even though this isn't well-documented in the GE published docs)
        • Once added to in custom_expectations.py, tests can be defined in the JSON file similarly to OOTB tests
      • OOTB tests have a variety of outputs and therefore might not conform to the format expected by the Data Integrity framework -whenever possible, use DBT tests instead

If a mix of OOTB and custom expectations are needed, it is suggested to keep them in two suites of tests to manage their differences efficiently

Extra notes

The data integrity tool works with a few assumptions in terms of what an expectation should accept and return.

  1. We create views out of the DOT results with Postgresql-specific syntax. If you're using any other database engine, please adapt the query in great_expectations.py.

  2. An expectation accepts both column names and table names as arguments. Great Expectations generally has table-agnostic suites running on specific single tables, but we're changing this model a bit because data integrity queries often depend on more than one table. Therefore, a default empty dataset is added in the batch_config.json for all custom expectations, and a relevant table name should be passed to the expectation in the suite definition. The default dataset won't be read at all and is used as a placeholder.

  3. Custom expectations are found in custom_expectations.py under plugins, it is recommended to follow their format and to add your own custom expectations as methods of that same class.

  4. The tool's post-processing step expects a few specific field in the output of the expectations (refer to example custom expectations to see how they're implemented)