The focus of this lab module is auto data quality - checks with BYO AGGREGATE SQL, where you can specify a rule - in the case of this lab, it will be for a volume check - do we have enough data for a given date.
Successful completion of prior modules
5 minutes or less
Data Quality Overview
About Auto Data Quality
Use Auto Data Quality
- Understand options for data quality in Dataplex
- Practical knowledge of running Auto Data Quality - SQL AGGREGATE checks feature
We will use the same table as in the Data Profiling lab module.
Familiarize yourself with the table, from the BigQuery UI by running the SQL below-
SELECT * FROM oda_dq_scratch_ds.customer_master LIMIT 20
We will need to grant permissions to the Dataplex Google Managed Service Account to run the aggregate query. Run this in the Cloud Console-
PROJECT_ID=`gcloud config list --format "value(core.project)" 2>/dev/null`
PROJECT_NBR=`gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | tr -d "'" | xargs`
DATAPLEX_GMSA_FQN="service-${PROJECT_NBR}@gcp-sa-dataplex.iam.gserviceaccount.com"
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:$DATAPLEX_GMSA_FQN \
--role="roles/dataplex.serviceAgent"
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:$DATAPLEX_GMSA_FQN \
--role="roles/bigquery.user"
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:$DATAPLEX_GMSA_FQN \
--role="roles/bigquery.jobUser"
2.3. Define Data Quality Rules - SQL AGGREGATE checks to ensure the volume of data in the table is accurate
Click on the scan and define rules. Select SQL aggregate rule. Enter the rule -
COUNT(client_id) > 2000
Lets check all the fields for quality scan and click on "run now".
Play with the rule to fail the DQ check by increasing count to 5000.
COUNT(client_id) > 5000
This concludes the module. Proceed to the next module.