Skip to content

Latest commit

 

History

History

dataplex-quickstart-labs

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Dataplex Quickstart for Cloud Architects and Engineers

1. About

Dataplex is a Google Cloud service for Data Governance and Management. Using Dataplex and complementary Data Analytics services, enterprises can stand up Data Mesh architecture in their Google Cloud data estate. To get started with Data Mesh on Google Cloud, one of the prerequisites is knowledge of Dataplex.

This repository is designed to demystify Dataplex features, through a series of self-contained instructional lab modules, with minimal automation, detailed instructions with screenshots for the full developer experience. Once you are well versed with Dataplex, you can proceeed to the advanced labs that feature Data Mesh. The labs are product sponsored and you can expect to see new modules released as and when there are new features/updates to features announced.



**NOTE: There have been changes to the Dataplex APIs. The labs are community contributed, dated and a best effort to keep current. We recommend reading the docs, trying out the lab well ahead of any demos you may be planning and fixing any issues arising from API changes. **



2. Format & Duration

The lab is fully scripted (no research needed), with (fully automated) environment setup, data, code, commands, notebooks, orchestration, and configuration. Clone the repo and follow the step by step instructions for an end to end developer experience.

Expect to spend ~8 hours to fully understand and execute if new to GCP and the services and at least ~6 hours otherwise.


3. Level

L200 - L300 (includes Apache Spark code, Apache Airflow orchestration, Data Science notebooks and more)


4. Audience

The intended audience is anyone with interest in architecting governance and Data Mesh on Google Cloud.


5. Prerequisites

Foundational knowledge of governance, and GCP products would be beneficial but is not entirely required, given the format of the lab. Access to Google Cloud is a must unless you want to just read the content.


6. Goal

Simplify your learning and adoption journey of our product stack for governance with -

  1. Just enough product knowledge of Dataplex for governance
  2. Quick start code that can be repurposed for your use cases
  3. Terraform for provisioning a variety of Google Cloud data services, that can be repurposed for your use case

7. Use cases covered

There are various usecases covered including Chicago Crimes Analytics, TelCo Customer Churn Prediction, Cell Tower Anomaly Detection, Icecream Sales Forecasting and more. This is an ever-evovlving lab series, we recommend reviewing the release history for updates on use cases.


8. Flow of the lab

LP-00


For your convenience, all the code is pre-authored, so you can focus on understanding product features and integration.


9. The lab modules

Complete the lab modules in a sequential manner. For a better lab experience, read all the modules and then start working on them.

# Feature Module Duration
minutes
01 Lab environment overview 10
02 Lab environment provisioning with Terraform 45
03 Organize Organize your data lake with Dataplex 15
04 Organize Register assets into your Dataplex lake zones 15
05 Discovery Discovery of structured Cloud Storage objects - study of entities, schemas, automated external table defintions in Dataproc Metastore Service and BigQuery 15
06 Catalog Dataplex Catalog basics 10
07 Catalog Creating a tag template in Dataplex and populating tags 15
08 Catalog Creating a custom metadata entry in Dataplex Catalog 10
09 Catalog Creating a custom metadata filesets entry in Dataplex Catalog 10
10 Catalog Create an overview of a Dataplex Catalog entry 10
11 Catalog Searching the Dataplex Catalog 10
12 Lineage Out of the box lineage capture for BigQuery objects 15
13 Lineage BigQuery lineage with Apache Airflow on Cloud Composer for orchestration 15
14 Lineage Custom lineage for Apache Spark applications on Cloud Dataproc with Apache Airflow on Cloud Composer pipelines 30
15 Lineage Custom lineage for custom entries in Catalog 15
16 Lineage Manage lineage with lineage API 15
17 Lineage Out of the box lineage capture for Dataproc Spark jobs 20
18 Profiling Data profiling by example 15
19 Quality Auto Data Quality for completeness - null checks 15
20 Quality Auto Data Quality for validity - pattern checks 15
21 Quality Auto Data Quality for validity - allowed values checks 15
22 Quality Auto Data Quality for uniqueness - cell value checks 15
23 Quality Auto Data Quality for validity - date checks with SQL row function 15
24 Quality Data profiling by example 15
25 Quality Auto Data Quality for completeness - null checks 15
26 Quality Auto Data Quality for validity - pattern checks 15
27 Quality Auto Data Quality for validity - allowed values checks 15
28 Quality Auto Data Quality for uniqueness - duplicate checks 15
29 Quality Auto Data Quality for validity - date checks with SQL row function 15
30 Quality Auto Data Quality for validity - volume checks with SQL aggregate function 15
31 Quality Auto Data Quality for validity - data freshness checks with SQL aggregate function 15
32 Quality Auto Data Quality challenge lab 15
33 Quality Data Quality Task - YAML authoring primer -1 30
34 Quality Data Quality Task - YAML authoring primer -2 10
35 Quality Data Quality Incident Management 10
36 Quality Data Quality Dashboard 10
37 Quality Data Quality Score Tags in Dataplex Catalog tags 15
38 Quality Data Quality process automation with Apache Airflow on Cloud Composer 15
39 Quality Data Quality operationalization end to end 15
40 BigLake Upgrading external tables to BigLake and performance acceleration 30

10. Dont forget to

Shut down/delete resources when done to avoid unnecessary billing.


11. Credits

# Google Cloud Collaborators Contribution
1. Anagha Khanolkar Creator, Primary author, and Maintainer
2. Mansi Maharana Data Quality Task labs are evolved from Banking Data Mesh labs
3. Jay O'Leary Contributor

12. Contributions welcome

Community contribution to improve the lab is very much appreciated.


13. Getting help

If you have any questions or if you found any problems with this repository, please report through GitHub issues.


14. Release History

Date Details
20230227 Initial release
20230320 Added modules for Dataplex Auto Data Quality
20230321 Added modules for Dataplex Data Quality Tasks
20230328 Added additional modules for Dataplex Data Quality Tasks
20230411 Added BigLake module
20230921 Module on Data Profiling redone from BQ UI
20230921 New module for Dataproc Lineage
20230921 Removed references to Explore
20240207 Added example of Filesets Catalog Entry Type