-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ETL-686] Use S3 as GX validations store #151
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! So much to redo just because of the non-persistent data context before :'). Just had some comments/clarification questions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Just a failing test: https://github.com/Sage-Bionetworks/recover/actions/runs/11637550526/job/32410945073
Quality Gate passedIssues Measures |
A lot of files touched here, but most of those changes involve shuffling parameters around. The primary changes are:
src/glue/jobs/run_great_expectations_on_parquet.py
src/glue/resources/great_expectations.yml
. This file is downloaded and formatted byconfigure_gx_config
. The configuration of the datasource (get_batch_request
, which now uses fluent datasources rather than the older datasource API) and the expectation suite (add_expectations_from_json
) is still done in Python.context.build_data_docs
method, which builds the entire site from scratch each time and can apparently take quite a while if we are adding to the data docs each day, judging from some GX issue reports. Checkpoints were necessary, since they allow us to use theUpdateDataDocsAction
to update our data docs. As far as I can tell, there is no method to do this using a data context method. We also get aStoreValidationResultAction
with checkpoints, so that saves us a step.tests/test_run_great_expectations_on_parquet.py
Because of the big refactor above, the tests changed a lot, as well.
src/scripts/manage_artifacts/artifacts.py
I added the namespaced location of our data docs to the
--remove
option of this script, so feature branch data docs ought to be cleaned up upon branch deletion.