Standing Up a Cloud-Native Regulatory Reporting Architecture with BigQuery

Introduction

This document shows you how to get started with a regulatory reporting solution for cloud and run a basic pipeline. It's intended for data engineers in financial institutions who want to familiarize themselves with an architecture and best practices for producing stable, reliable regulatory reports.

In this tutorial, you establish a working example of a regulatory data processing platform on Google Cloud resources. The example platform demonstrates how you can implement a data processing pipeline that maintains quality of data, auditability, and ease of change and deployment and also meets the following requirements of regulatory reporting:

Ingestion of data from source
Processing of large volumes of granular data
Aggregation of data into reports.

This document assumes that you're familiar with Terraform version 1.1.7, data build tool (dbt) version 1.0.4, GCS, and BigQuery.

Objectives

Create infrastructure from a cloned repository.
Load manufactured data into Big Query.
Extract regulatory metrics from granular data.
Containerize the extraction pipeline.

Costs

This tutorial uses the following billable components of Google Cloud :

BigQuery
Cloud Storage
Optionally, Cloud Composer

To generate a cost estimate based on your projected usage, use the pricing calculator.

When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Clean up.

Before you begin

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

Go to project selector

Make sure that billing is enabled for your Cloud project. L earn how to check if billing is enabled on a project.

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

Prepare your environment

In Cloud Shell, specify the project that you want to use for this tutorial:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with the ID of the project that you selected or created for this tutorial.

If a dialog displays, click Authorize.
Specify a default region to use for infrastructure creation:
```
 gcloud config set compute/region REGION
```
Create and activate a Python virtual environment:
```
python -m venv reg-rpt-env
source reg-rpt-env/bin/activate
```
You see that your command-line prompt is prefixed with the name of the virtual environment.

Clone the repository:

git clone \
"https://github.com/GoogleCloudPlatform/reg-reporting-blueprint"

Install Terraform. To learn how to do this installation, see the HashiCorp documentation.
Verify the installation.
Install dbt:
```
pip3 install dbt-bigquery --upgrade
```
Verify the dbt installation:
```
dbt --version
```
You see the installation details displayed.

Initialize the environment variables:

cd reg-reporting-blueprint && source environment-variables.sh

Run the setup script:

cd common_components && ./setup_script.sh

Enable the Service Usage API and Resource Manager API:

gcloud services enable serviceusage.googleapis.com cloudresourcemanager.googleapis.com

Run terraform to create the required infrastructure:

NOTE: The APIs in the previous step sometimes take a while to enable, so it may not work the first few times.
```
cd orchestration/infrastructure/
terraform init -upgrade
terraform plan
terraform apply
```
Type 'yes' when you see the confirmation prompt.
To verify that an ingest bucket has been created, in the Google Cloud console, go to the Cloud Storage page and check for a bucket with a name that's similar to the value of PROJECT ID.
Go to the BigQuery page and verify that the following datasets have been created:
```
regrep
regrep_source
```

Upload the sample data

In this section, you explore the contents of the repository's data and data_load folders, and load sample data to BigQuery.

In the Cloud Shell Editor instance, navigate to the data folder in the repository:
```
cd ../../../use_cases/examples/home_loan_delinquency/data/
```
This folder contains two subfolders which are named input and expected.

Inspect the contents of the input folder. This folder contains CSV files with sample input data. This sample data is provided only for test purposes.

Inspect the contents of the expected folder. This folder contains the CSV files specifying the expected results once the transformations are applied.
Open, and inspect, the data_load/schema folder, which contains files specifying the schema of the staging data:
```
cd ../data_load
```
The scripts in this folder allow the data to be loaded into Cloud Storage first, and then into BigQuery. The data conforms to the expected schema for the example regulatory reporting pipeline use case in this tutorial.
Load the data into Cloud Storage:
```
./load_to_gcs.sh ../data/input
./load_to_gcs.sh ../data/expected
```
The data is now available in your Cloud Storage ingest bucket.
Load the data from the Cloud Storage ingest bucket to BigQuery:
```
./load_to_bq.sh
```
To verify that the data has been loaded in BigQuery, in the Google Cloud console, go to the BigQuery page and select a table in the regrep_source dataset and open a table starting with homeloan_ref and starting with homeloan_expected.

Select the Preview tab for each table, and confirm that each table has data.

Run the regulatory reporting pipeline

In your development environment, initialize the dependencies of dbt:
```
cd ../dbt/
dbt deps
```
This will install any needed dbt dependencies in your dbt project.
Test the connection between your local dbt installation and your BigQuery datasets:
```
dbt debug
```
At the end of the connectivity, configuration, and dependency information returned by the command, you should see the following message: All checks passed!

In the models folder, open a SQL file and inspect the logic of the sample reporting transformations implemented in dbt.
Run the reporting transformations to create the regulatory reporting metrics:
```
dbt run
```
Run the transformations for a date of your choice:
```
dbt run --vars '{"reporting_day": "2021-09-03"}'
```
Notice the variables that control the execution of the transformations. The variable reporting_day indicates the date value that the portfolio should have. When you run the dbt run command, it's a best practice to provide this value.
In the Google Cloud console, go to the BigQuery page and inspect the homeloan_dev dataset. Notice how the data has been populated, and how the reporting_day variable that you passed is used in the control.reporting_day field of the wh_denormalised view.
Inspect the models/schema.yml file:
```
models:
 - <<: *src_current_accounts_attributes
   name: src_current_accounts_attributes
   columns:
     - name: ACCOUNT_KEY
       tests:
         - unique
              - not_null
```
Notice how the file defines the definitions of the columns and the associated data quality tests. For example, the ACCOUNT_KEY field in the src_current_accounts_attributes table must be unique and not null.
Run the data quality tests that are specified in the config files:
```
dbt test -s test_type:generic
```
Inspect the code in the use_cases/examples/home_loan_delinquency/dbt/tests folder, which contains singular tests. Notice that the tests in this folder implement a table comparison between the actual results that are output by the dbt run command, and the expected results that are saved in the homeloan_expectedresults dataset.
Run the singular tests:
```
dbt test -s test_type:singular
```
Generate the documentation for the project:
```
dbt docs generate && dbt docs serve
```
In the output that you see, search for, and then click, the following URL text: http://127.0.0.1:8080

Your browser opens a new tab that shows the dbt documentation web interface.
Inspect the lineage of the models and their documentation. You see that the documentation includes all of the code and the documentation for the models (as specified in the models/schema.yml files).
In Cloud Shell, enter the following:
```
Ctrl + c
```
Cloud Shell stops hosting the dbt web interface.

Optional: Containerize the transformations

In Cloud Shell, create containers for data load and DBT and push the container to Artifact Repository:

cd ../../../../  # the gcloud command should be executed from the root 
gcloud builds submit \
  --config use_cases/examples/home_loan_delinquency/cloudbuild.yaml \
  --substitutions "_SOURCE_URL=${SOURCE_URL},_REGISTRY_URL=${REGISTRY_URL},COMMIT_SHA=main" \
  --service-account projects/${PROJECT_ID}/serviceAccounts/builder@${PROJECT_ID}.iam.gserviceaccount.com \
  --gcs-source-staging-dir gs://${PROJECT_ID}-cloudbuild-source-staging-bucket/source

The Dockerfile in the dbt and data_load directories enables containerization, which simplifies orchestration of the workflow.

Retrieve the path of the Airflow page and the Cloud Storage bucket for dags, and store them in environment variables:

cd common_components/orchestration/infrastructure/ 
AIRFLOW_DAG_GCS=$(terraform output --raw airflow_dag_gcs_prefix)
AIRFLOW_UI=$(terraform output --raw airflow_uri)

Upload the home loan delinquency dag:

cd ../../../use_cases/examples/home_loan_delinquency/deploy/
gsutil cp run_homeloan_dag.py $AIRFLOW_DAG_GCS

Go to the Airflow page by executing the following command to retrieve the UI, and clicking on the link:
```
echo $AIRFLOW_UI
```

Optional: Setup the Operations Dashboard

From the root of the repository find out the Looker Studio template dashboard URL:
```
echo "Operations Dashboard: $(cd common_components/orchestration/infrastructure; terraform output --raw lookerstudio_operations_dashboard_url)"
```
Click on the URL in the terminal and investigate the dashboard. Edit and share if you want to keep and evolve the dashboard.

Optional: Repeat for all the use cases

Use a shell that has the environment variables set. Ensure that you have run the following in the root of the repository:
```
source environment-variables.sh
```

Build all of the use cases.

for build in use_cases/examples/*/cloudbuild.yaml ; do
   gcloud builds submit \
     --config $build \
     --substitutions "_SOURCE_URL=${SOURCE_URL},_REGISTRY_URL=${REGISTRY_URL},COMMIT_SHA=main" \
     --service-account projects/${PROJECT_ID}/serviceAccounts/builder@${PROJECT_ID}.iam.gserviceaccount.com \
     --gcs-source-staging-dir gs://${PROJECT_ID}-cloudbuild-source-staging-bucket/source
done

Deploy all of the dags.

AIRFLOW_DAG_GCS=$(cd common_components/orchestration/infrastructure && terraform output --raw airflow_dag_gcs_prefix)
for dag in use_cases/examples/*/deploy/*.py ; do
  gsutil cp $dag $AIRFLOW_DAG_GCS
done

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the individual resources

To avoid incurring further charges, delete the individual resources that you use in this tutorial:

cd infrastructure/environment
terraform destroy

What's next

Explore more Google Cloud for financial service solutions.
For more reference architectures, diagrams, tutorials, and best practices, explore the Cloud Architecture Center.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TUTORIAL.md

TUTORIAL.md

Standing Up a Cloud-Native Regulatory Reporting Architecture with BigQuery

Table of contents

Introduction

This document assumes that you're familiar with Terraform version 1.1.7, data build tool (dbt) version 1.0.4, GCS, and BigQuery.

Objectives

Costs

Before you begin

Prepare your environment

Upload the sample data

Run the regulatory reporting pipeline

Optional: Containerize the transformations

Optional: Setup the Operations Dashboard

Optional: Repeat for all the use cases

Clean up

Delete the individual resources

What's next

Files

TUTORIAL.md

Latest commit

History

TUTORIAL.md

File metadata and controls

Standing Up a Cloud-Native Regulatory Reporting Architecture with BigQuery

Table of contents

Introduction

This document assumes that you're familiar with Terraform version 1.1.7, data build tool (dbt) version 1.0.4, GCS, and BigQuery.

Objectives

Costs

Before you begin

Prepare your environment

Upload the sample data

Run the regulatory reporting pipeline

Optional: Containerize the transformations

Optional: Setup the Operations Dashboard

Optional: Repeat for all the use cases

Clean up

Delete the individual resources

What's next