Dataplex is a Google Cloud service for Data Governance and Management. Using Dataplex and complementary Data Analytics services, enterprises can stand up Data Mesh architecture in their Google Cloud data estate. To get started with Data Mesh on Google Cloud, one of the prerequisites is knowledge of Dataplex.
This repository is designed to demystify Dataplex features, through a series of self-contained instructional lab modules, with minimal automation, detailed instructions with screenshots for the full developer experience. Once you are well versed with Dataplex, you can proceeed to the advanced labs that feature Data Mesh. The labs are product sponsored and you can expect to see new modules released as and when there are new features/updates to features announced.
**NOTE: There have been changes to the Dataplex APIs. The labs are community contributed, dated and a best effort to keep current. We recommend reading the docs, trying out the lab well ahead of any demos you may be planning and fixing any issues arising from API changes. **
The lab is fully scripted (no research needed), with (fully automated) environment setup, data, code, commands, notebooks, orchestration, and configuration. Clone the repo and follow the step by step instructions for an end to end developer experience.
Expect to spend ~8 hours to fully understand and execute if new to GCP and the services and at least ~6 hours otherwise.
L200 - L300 (includes Apache Spark code, Apache Airflow orchestration, Data Science notebooks and more)
The intended audience is anyone with interest in architecting governance and Data Mesh on Google Cloud.
Foundational knowledge of governance, and GCP products would be beneficial but is not entirely required, given the format of the lab. Access to Google Cloud is a must unless you want to just read the content.
Simplify your learning and adoption journey of our product stack for governance with -
- Just enough product knowledge of Dataplex for governance
- Quick start code that can be repurposed for your use cases
- Terraform for provisioning a variety of Google Cloud data services, that can be repurposed for your use case
There are various usecases covered including Chicago Crimes Analytics, TelCo Customer Churn Prediction, Cell Tower Anomaly Detection, Icecream Sales Forecasting and more. This is an ever-evovlving lab series, we recommend reviewing the release history for updates on use cases.
For your convenience, all the code is pre-authored, so you can focus on understanding product features and integration.
Complete the lab modules in a sequential manner. For a better lab experience, read all the modules and then start working on them.
Shut down/delete resources when done to avoid unnecessary billing.
# | Google Cloud Collaborators | Contribution |
---|---|---|
1. | Anagha Khanolkar | Creator, Primary author, and Maintainer |
2. | Mansi Maharana | Data Quality Task labs are evolved from Banking Data Mesh labs |
3. | Jay O'Leary | Contributor |
Community contribution to improve the lab is very much appreciated.
If you have any questions or if you found any problems with this repository, please report through GitHub issues.
Date | Details |
---|---|
20230227 | Initial release |
20230320 | Added modules for Dataplex Auto Data Quality |
20230321 | Added modules for Dataplex Data Quality Tasks |
20230328 | Added additional modules for Dataplex Data Quality Tasks |
20230411 | Added BigLake module |
20230921 | Module on Data Profiling redone from BQ UI |
20230921 | New module for Dataproc Lineage |
20230921 | Removed references to Explore |
20240207 | Added example of Filesets Catalog Entry Type |