Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Great Expectations integration to 1.0.4 #3025

Open
wants to merge 22 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 29 additions & 91 deletions docs/book/component-guide/data-validators/great-expectations.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,16 @@ description: >-

# Great Expectations

The Great Expectations [Data Validator](./data-validators.md) flavor provided with the ZenML integration uses [Great Expectations](https://greatexpectations.io/) to run data profiling and data quality tests on the data circulated through your pipelines. The test results can be used to implement automated corrective actions in your pipelines. They are also automatically rendered into documentation for further visual interpretation and evaluation.
The Great Expectations [Data Validator](./data-validators.md) flavor provided with the ZenML integration uses [Great Expectations](https://greatexpectations.io/) to run data validation tests on the data circulated through your pipelines. The test results can be used to implement automated corrective actions in your pipelines. They are also automatically rendered into documentation for further visual interpretation and evaluation.

### When would you want to use it?

[Great Expectations](https://greatexpectations.io/) is an open-source library that helps keep the quality of your data in check through data testing, documentation, and profiling, and to improve communication and observability. Great Expectations works with tabular data in a variety of formats and data sources, of which ZenML currently supports only `pandas.DataFrame` as part of its pipelines.
[Great Expectations](https://greatexpectations.io/) is an open-source library that helps keep the quality of your data in check through data testing, and documentation and to improve communication and observability. Great Expectations works with tabular data in a variety of formats and data sources, of which ZenML currently supports only `pandas.DataFrame` as part of its pipelines.

You should use the Great Expectations Data Validator when you need the following data validation features that are possible with Great Expectations:

* [Data Profiling](https://docs.greatexpectations.io/docs/oss/guides/expectations/creating_custom_expectations/how_to_add_support_for_the_auto_initializing_framework_to_a_custom_expectation/#build-a-custom-profiler-for-your-expectation): generates a set of validation rules (Expectations) automatically by inferring them from the properties of an input dataset.
* [Data Quality](https://docs.greatexpectations.io/docs/oss/guides/validation/checkpoints/how_to_pass_an_in_memory_dataframe_to_a_checkpoint/): runs a set of predefined or inferred validation rules (Expectations) against an in-memory dataset.
* [Data Docs](https://docs.greatexpectations.io/docs/reference/learn/terms/data_docs_store/): generate and maintain human-readable documentation of all your data validation rules, data quality checks and their results.
* [Data Validation](https://docs.greatexpectations.io/docs/core/trigger_actions_based_on_results/run_a_checkpoint): runs a set of predefined or inferred validation rules (Expectations) against an in-memory dataset.
* [Data Docs](https://docs.greatexpectations.io/docs/core/configure_project_settings/configure_data_docs/): generate and maintain human-readable documentation of all your data validation rules, data quality checks and their results.

You should consider one of the other [Data Validator flavors](./data-validators.md#data-validator-flavors) if you need a different set of data validation features.

Expand Down Expand Up @@ -105,80 +104,16 @@ For more, up-to-date information on the Great Expectations Data Validator config

The core Great Expectations concepts that you should be aware of when using it within ZenML pipelines are Expectations / Expectation Suites, Validations and Data Docs.

ZenML wraps the Great Expectations' functionality in the form of two standard steps:

* a Great Expectations data profiler that can be used to automatically generate Expectation Suites from an input `pandas.DataFrame` dataset
* a Great Expectations data validator that uses an existing Expectation Suite to validate an input `pandas.DataFrame` dataset
ZenML wraps the Great Expectations' functionality in the form of a standard data validator step that uses an existing Expectation Suite or a list of Expectations to validate an input `pandas.DataFrame` dataset

You can visualize Great Expectations Suites and Results in Jupyter notebooks or view them directly in the ZenML dashboard.

#### The Great Expectation's data profiler step

The standard Great Expectation's data profiler step builds an Expectation Suite automatically by running a [`UserConfigurableProfiler`](https://docs.greatexpectations.io/docs/guides/expectations/how\_to\_create\_and\_edit\_expectations\_with\_a\_profiler) on an input `pandas.DataFrame` dataset. The generated Expectation Suite is saved in the Great Expectations Expectation Store, but also returned as an `ExpectationSuite` artifact that is versioned and saved in the ZenML Artifact Store. The step automatically rebuilds the Data Docs.

At a minimum, the step configuration expects a name to be used for the Expectation Suite:

```python
from zenml.integrations.great_expectations.steps import (
great_expectations_profiler_step,
)

ge_profiler_step = great_expectations_profiler_step.with_options(
parameters={
"expectation_suite_name": "steel_plates_suite",
"data_asset_name": "steel_plates_train_df",
}
)
```

The step can then be inserted into your pipeline where it can take in a pandas dataframe, e.g.:

```python
from zenml import pipeline

docker_settings = DockerSettings(required_integrations=[SKLEARN, GREAT_EXPECTATIONS])

@pipeline(settings={"docker": docker_settings})
def profiling_pipeline():
"""Data profiling pipeline for Great Expectations.

The pipeline imports a reference dataset from a source then uses the builtin
Great Expectations profiler step to generate an expectation suite (i.e.
validation rules) inferred from the schema and statistical properties of the
reference dataset.

Args:
importer: reference data importer step
profiler: data profiler step
"""
dataset, _ = importer()
ge_profiler_step(dataset)


profiling_pipeline()
```

As can be seen from the [step definition](https://apidocs.zenml.io/latest/integration\_code\_docs/integrations-great\_expectations/#zenml.integrations.great\_expectations.steps.ge\_profiler.great\_expectations\_profiler\_step) , the step takes in a `pandas.DataFrame` dataset, and it returns a Great Expectations `ExpectationSuite` object:

```python
@step
def great_expectations_profiler_step(
dataset: pd.DataFrame,
expectation_suite_name: str,
data_asset_name: Optional[str] = None,
profiler_kwargs: Optional[Dict[str, Any]] = None,
overwrite_existing_suite: bool = True,
) -> ExpectationSuite:
...
```

You can view [the complete list of configuration parameters](https://apidocs.zenml.io/latest/integration\_code\_docs/integrations-great\_expectations/#zenml.integrations.great\_expectations.steps.ge\_profiler.great\_expectations\_profiler\_step) in the SDK docs.

#### The Great Expectations data validator step

The standard Great Expectations data validator step validates an input `pandas.DataFrame` dataset by running an existing Expectation Suite on it. The validation results are saved in the Great Expectations Validation Store, but also returned as an `CheckpointResult` artifact that is versioned and saved in the ZenML Artifact Store. The step automatically rebuilds the Data Docs.
The standard Great Expectations data validator step validates an input `pandas.DataFrame` dataset by running an existing Expectation Suite or a list of Expectations on it. The validation results are saved in the Great Expectations Validation Store, but also returned as an `CheckpointResult` artifact that is versioned and saved in the ZenML Artifact Store. The step automatically rebuilds the Data Docs.

At a minimum, the step configuration expects the name of the Expectation Suite to be used for the validation:
At a minimum, the step configuration expects the name of the Expectation Suite or a list of Expectations to be used for the validation. In the example below, we use a list of Expectations. Each expectation is defined as a `GreatExpectationExpectationConfig` object with the name of the expectation written in snake case and the arguments of the expectation defined as a dictionary:

```python
from zenml.integrations.great_expectations.steps import (
Expand All @@ -187,13 +122,18 @@ from zenml.integrations.great_expectations.steps import (

ge_validator_step = great_expectations_validator_step.with_options(
parameters={
"expectation_suite_name": "steel_plates_suite",
"data_asset_name": "steel_plates_train_df",
}
"expectations_list": [
GreatExpectationExpectationConfig(
expectation_name="expect_column_values_to_not_be_null",
expectation_args={"column": "X_Minimum"},
)
],
"data_asset_name": "my_data_asset",
},
)
```

The step can then be inserted into your pipeline where it can take in a pandas dataframe and a bool flag used solely for order reinforcement purposes, e.g.:
The step can then be inserted into your pipeline where it can take in a pandas dataframe e.g.:
hyperlint-ai[bot] marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The step can then be inserted into your pipeline where it can take in a pandas dataframe e.g.:
The step can then be inserted into your pipeline where it can take in a pandas DataFrame e.g.:

Issues:

  • Style Guide - (Spelling-error) Did you really mean 'dataframe'?

Fix Explanation:

The correct spelling for the pandas object is 'DataFrame'. This aligns with the official pandas documentation and ensures consistency and accuracy in technical writing.


```python
docker_settings = DockerSettings(required_integrations=[SKLEARN, GREAT_EXPECTATIONS])
Expand All @@ -211,23 +151,26 @@ def validation_pipeline():
validator: dataset validation step
checker: checks the validation results
"""
dataset, condition = importer()
results = ge_validator_step(dataset, condition)
dataset = importer()
results = ge_validator_step(dataset)
message = checker(results)


validation_pipeline()
```

As can be seen from the [step definition](https://apidocs.zenml.io/latest/integration\_code\_docs/integrations-great\_expectations/#zenml.integrations.great\_expectations.steps.ge\_validator.great\_expectations\_validator\_step) , the step takes in a `pandas.DataFrame` dataset and a boolean `condition` and it returns a Great Expectations `CheckpointResult` object. The boolean `condition` is only used as a means of ordering steps in a pipeline (e.g. if you must force it to run only after the data profiling step generates an Expectation Suite):
As can be seen from the [step definition](https://apidocs.zenml.io/latest/integration\_code\_docs/integrations-great\_expectations/#zenml.integrations.great\_expectations.steps.ge\_validator.great\_expectations\_validator\_step) , the step takes in a `pandas.DataFrame` dataset and it returns a Great Expectations `CheckpointResult` object:

```python
@step
def great_expectations_validator_step(
dataset: pd.DataFrame,
expectation_suite_name: str,
expectation_suite_name: Optional[str] = None,
data_asset_name: Optional[str] = None,
action_list: Optional[List[Dict[str, Any]]] = None,
action_list: Optional[List[ge.checkpoint.actions.ValidationAction]] = None,
expectation_parameters: Optional[Dict[str, Any]] = None,
expectations_list: Optional[List[GreatExpectationExpectationConfig]] = None,
result_format: str = "SUMMARY",
exit_on_error: bool = False,
) -> CheckpointResult:
```
Expand Down Expand Up @@ -262,18 +205,13 @@ def create_custom_expectation_suite(
# context = ge.get_context()

expectation_suite_name = "custom_suite"
suite = context.create_expectation_suite(
expectation_suite_name=expectation_suite_name
)
expectation_configuration = ExpectationConfiguration(...)
suite.add_expectation(expectation_configuration=expectation_configuration)
...
context.save_expectation_suite(
expectation_suite=suite,
expectation_suite_name=expectation_suite_name,
expectation_suite = ExpectationSuite(
name=expectation_suite_name,
expectations=[],
)
context.suites.add(expectation_suite)
context.build_data_docs()
return suite
return expectation_suite
```

The same approach must be used if you are using a Great Expectations configuration managed by ZenML and are using the Jupyter notebooks generated by the Great Expectations CLI.
Expand Down
4 changes: 4 additions & 0 deletions docs/mocked_libs.json
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,10 @@
"google.oauth2",
"great_expectations",
"great_expectations.checkpoint",
"great_expectations.checkpoint.checkpoint",
"great_expectations.core.batch_definition",
"great_expectations.expectations",
"great_expectations.expectations.expectation",
"great_expectations.checkpoint.types",
"great_expectations.checkpoint.types.checkpoint_result",
"great_expectations.core",
Expand Down
2 changes: 1 addition & 1 deletion examples/e2e_nlp/.copier-answers.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Changes here will be overwritten by Copier
_commit: 2024.08.29
_commit: 2024.09.23
_src_path: gh:zenml-io/template-nlp
accelerator: cpu
cloud_of_choice: aws
Expand Down
1 change: 0 additions & 1 deletion examples/e2e_nlp/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@ settings:
- mlflow
- discord
requirements:
- accelerate
- zenml[server]

extra:
Expand Down
2 changes: 1 addition & 1 deletion examples/e2e_nlp/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
torchvision
accelerate
gradio
zenml[server]>=0.56.3
datasets>=2.12.0,<3.0.0
2 changes: 1 addition & 1 deletion examples/mlops_starter/.copier-answers.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Changes here will be overwritten by Copier
_commit: 2024.08.28
_commit: 2024.09.23
_src_path: gh:zenml-io/template-starter
email: [email protected]
full_name: ZenML GmbH
Expand Down
8 changes: 4 additions & 4 deletions examples/mlops_starter/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Along the way we will also show you how to:

You can use Google Colab to see ZenML in action, no signup / installation required!

<a href="https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/quickstart/quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<a href="https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/mlops_starter/quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## :computer: Run Locally

Expand All @@ -36,7 +36,7 @@ pip install "zenml[server]"

# clone the ZenML repository
git clone https://github.com/zenml-io/zenml.git
cd zenml/examples/quickstart
cd zenml/examples/mlops_starter
```

Now we're ready to start. You have two options for running the quickstart locally:
Expand All @@ -45,13 +45,13 @@ Now we're ready to start. You have two options for running the quickstart locall
```bash
pip install notebook
jupyter notebook
# open notebooks/quickstart.ipynb
# open quickstart.ipynb
```

#### Option 2 - Execute the whole ML pipeline from a Python script:
```bash
# Install required zenml integrations
zenml integration install sklearn -y
zenml integration install sklearn pandas -y

# Initialize ZenML
zenml init
Expand Down
1 change: 1 addition & 0 deletions examples/mlops_starter/configs/feature_engineering.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ settings:
docker:
required_integrations:
- sklearn
- pandas
requirements:
- pyarrow

Expand Down
1 change: 1 addition & 0 deletions examples/mlops_starter/configs/inference.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ settings:
docker:
required_integrations:
- sklearn
- pandas
requirements:
- pyarrow

Expand Down
1 change: 1 addition & 0 deletions examples/mlops_starter/configs/training_rf.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ settings:
docker:
required_integrations:
- sklearn
- pandas
requirements:
- pyarrow

Expand Down
1 change: 1 addition & 0 deletions examples/mlops_starter/configs/training_sgd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ settings:
docker:
required_integrations:
- sklearn
- pandas
requirements:
- pyarrow

Expand Down
2 changes: 1 addition & 1 deletion examples/mlops_starter/quickstart.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
"required!\n",
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](\n",
"https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/quickstart/quickstart.ipynb)"
"https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/mlops_starter/quickstart.ipynb)"
]
},
{
Expand Down
1 change: 1 addition & 0 deletions examples/mlops_starter/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ zenml[server]>=0.50.0
notebook
scikit-learn
pyarrow
pandas
2 changes: 1 addition & 1 deletion src/zenml/integrations/great_expectations/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ class GreatExpectationsIntegration(Integration):
"""Definition of Great Expectations integration for ZenML."""

NAME = GREAT_EXPECTATIONS
REQUIREMENTS = ["great-expectations>=0.17.15,<1.0"]
wjayesh marked this conversation as resolved.
Show resolved Hide resolved
REQUIREMENTS = ["great-expectations~=1.0.0"]

REQUIREMENTS_IGNORED_ON_UNINSTALL = ["pandas"]

Expand Down
Loading
Loading