Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-686] Use S3 as GX validations store #151

Merged
merged 4 commits into from
Nov 5, 2024
Merged

[ETL-686] Use S3 as GX validations store #151

merged 4 commits into from
Nov 5, 2024

Conversation

philerooski
Copy link
Contributor

A lot of files touched here, but most of those changes involve shuffling parameters around. The primary changes are:

src/glue/jobs/run_great_expectations_on_parquet.py

  • A large part of the GX data context configuration was moved to src/glue/resources/great_expectations.yml. This file is downloaded and formatted by configure_gx_config. The configuration of the datasource (get_batch_request, which now uses fluent datasources rather than the older datasource API) and the expectation suite (add_expectations_from_json) is still done in Python.
  • In order to have a cumulative collection of validation reports in our docs, I had to change two things: persisting our validation store between job runs, which meant using S3 as the store backed rather than the local file system, and using checkpoints to update the data docs, rather than the context.build_data_docs method, which builds the entire site from scratch each time and can apparently take quite a while if we are adding to the data docs each day, judging from some GX issue reports. Checkpoints were necessary, since they allow us to use the UpdateDataDocsAction to update our data docs. As far as I can tell, there is no method to do this using a data context method. We also get a StoreValidationResultAction with checkpoints, so that saves us a step.

tests/test_run_great_expectations_on_parquet.py

Because of the big refactor above, the tests changed a lot, as well.

src/scripts/manage_artifacts/artifacts.py

I added the namespaced location of our data docs to the --remove option of this script, so feature branch data docs ought to be cleaned up upon branch deletion.

@philerooski philerooski requested a review from a team as a code owner October 30, 2024 21:36
Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! So much to redo just because of the non-persistent data context before :'). Just had some comments/clarification questions

Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

sonarcloud bot commented Nov 5, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants