Merge branch 'main' into refactor-payments-rides-device-transactions

cal-itp · Dec 15, 2023 · 9c4ed1f · 9c4ed1f
2 parents d07ac4e + 29d614b
commit 9c4ed1f
Show file tree

Hide file tree

Showing 55 changed files with 2,695 additions and 975 deletions.
diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
@@ -12,6 +12,8 @@ jobs:
     steps:
       - uses: actions/checkout@v2
       - uses: actions/setup-python@v2
+        with:
+          python-version: '3.12.0'
       - uses: pre-commit/[email protected]
       - uses: crate-ci/typos@master
         with:

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -3,6 +3,7 @@ repos:
     rev: v4.4.0
     hooks:
       - id: trailing-whitespace
+        exclude_types: ['markdown']
       - id: end-of-file-fixer
         exclude_types: ['jupyter']
       - id: check-yaml
@@ -64,7 +65,10 @@ repos:
     rev: 0.7.16
     hooks:
     - id: mdformat
-      exclude: ^warehouse/(?!README.md)
+      # list of exclusions: https://stackoverflow.com/a/75560858
+      # mdformat does not play nice with GitHub callouts: https://github.com/orgs/community/discussions/16925
+      # so skip README files that use them
+      exclude: 'README.md|warehouse/.*'
       args: ["--number"]
       additional_dependencies:
       - mdformat-gfm

diff --git a/README.md b/README.md
@@ -6,8 +6,8 @@ Documentation for this codebase lives at [docs.calitp.org/data-infra](https://do
 
 ## Repository Structure
 
-- [./airflow](./airflow) contains the local dev setup and source code for Airflow DAGs (i.e. ETL)
-- [./ci](./ci) contains continuous integration and deployment scripts using GitHub actions.
+- [./airflow](./airflow) contains the local dev setup and source code for Airflow DAGs (i.e. ETL).
+- [./ci](./ci) contains continuous integration and deployment scripts using GitHub Actions.
 - [./docs](./docs) builds the [docs site](https://docs.calitp.org/data-infra).
 - [./kubernetes](./kubernetes) contains helm charts, scripts and more for deploying apps/services (e.g. Metabase, JupyterHub) on our kubernetes cluster.
 - [./images](./images) contains images we build and deploy for use by services such as JupyterHub.
@@ -16,22 +16,25 @@ Documentation for this codebase lives at [docs.calitp.org/data-infra](https://do
 
 ## Contributing
 
-- Follow the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) standard for all commits
-- Use Conventional Commit format for PR titles
-- Use GitHub's *draft* status to indicate PRs that are not ready for review/merging
-- Do not use GitHub's "update branch" button or merge the `main` branch back into a PR branch to update it. Instead, rebase PR branches to update them and resolve any merge conflicts.
-- We use GitHub's "code owners" functionality to designate a person or group of people who are in the line of approval for changes to some parts of this repository - if one or more people are automatically tagged as reviewers by GitHub when you create a PR, an approving review from at least one of them is required to merge. This does not automatically place the PR review in somebody's list of priorities, so please reach out to a reviewer to get eyes on your PR if it's time-sensitive.
+### Pre-commit
+
+This repository uses pre-commit hooks to format code, including [Black](https://black.readthedocs.io/en/stable/index.html). This ensures baseline consistency in code formatting.
+
+> [!IMPORTANT]  
+> Before contributing to this project, please install pre-commit locally by running `pip install pre-commit` and `pre-commit install` in the root of the repo. 
 
-## Linting and type-checking
+Once installed, pre-commit checks will run before you can make commits locally. If a pre-commit check fails, it will need to be addressed before you can make your commit. Many formatting issues are fixed automatically within the pre-commit actions, so check the changes made by pre-commit on failure -- they may have automatically addressed the issues that caused the failure, in which case you can simply re-add the files, re-attempt the commit, and the checks will then succeed. 
 
-### pre-commit
+Installing pre-commit locally saves time dealing with formatting issues on pull requests. There is a [GitHub Action](./.github/workflows/lint.yml)
+that runs pre-commit on all files, not just changed ones, as part of our continuous integration. 
 
-This repository pre-commit hooks to format code, including black. To install
-pre-commit locally, run `pip install pre-commit` & `pre-commit install`
-in the root of the repo. There is a [GitHub Action](./.github/workflows/lint.yml)
-that runs pre-commit on all files, not just changed ones. sqlfluff is currently
-disabled in the CI run due to flakiness, but it will still lint any SQL files
-you attempt to commit locally.
+> [!NOTE]  
+> [SQLFluff](https://sqlfluff.com/) is currently disabled in the CI run due to flakiness, but it will still lint any SQL files you attempt to commit locally. You will need to manually correct SQLFluff errors because we found that SQLFluff's automated fixes could be too aggressive and could change the meaning and function of affected code. 
+
+### Pull requests
+- Use GitHub's *draft* status to indicate PRs that are not ready for review/merging
+- Do not use GitHub's "update branch" button or merge the `main` branch back into a PR branch to update it. Instead, rebase PR branches to update them and resolve any merge conflicts.
+- We use GitHub's "code owners" functionality to designate a person or group of people who are in the line of approval for changes to some parts of this repository - if one or more people are automatically tagged as reviewers by GitHub when you create a PR, an approving review from at least one of them is required to merge. This does not automatically place the PR review in somebody's list of priorities, so please reach out to a reviewer to get eyes on your PR if it's time-sensitive.
 
 ### mypy
 
@@ -52,7 +55,7 @@ and `shapely` (until stubs are available, if ever). We recommend including
 comments where additional asserts or other weird-looking code exist to make mypy
 happy.
 
-## Configuration via Environment Variables
+### Configuration via Environment Variables
 
 Generally we try to configure things via environment variables. In the Kubernetes
 world, these get configured via Kustomize overlays ([example](./kubernetes/apps/overlays/gtfs-rt-archiver-v3-prod/archiver-channel-vars.yaml)).

diff --git a/airflow/.env.sample b/airflow/.env.sample
diff --git a/airflow/dags/create_external_tables/README.md b/airflow/dags/create_external_tables/README.md
@@ -3,3 +3,46 @@
 Type: [Now / Scheduled](https://docs.calitp.org/data-infra/airflow/dags-maintenance.html)
 
 This DAG orchestrates the creation of [external tables](https://cloud.google.com/bigquery/docs/external-data-sources), which serve as the interface between our raw / parsed data (stored in Google Cloud Storage) and our data warehouse (BigQuery). Most of our external tables are [hive-partitioned](https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs).
+
+Here is an annotated example external table YAML file showing what the fields mean:
+
+```yaml
+# throughout this example, <> brackets denote sample content to be filled in based on your use case and should be removed 
+operator: operators.ExternalTable   # the name of the operator; this does not change
+bucket: gs://<your bucket name>    # fill in the name of your source data bucket here
+prefix_bucket: true     # Boolean for whether or not the bucket name should have `test-` appended when you're running from local Airflow (use this if there's a `test-` bucket used for testing)
+post_hook: |    # this is optional; can provide an example query to check that external table was created successfully. this query will run every time the external table DAG runs
+  SELECT *
+  FROM `{{ get_project_id() }}`.<your dataset as defined below under destination_project_dataset_table>.<your table name as defined below under destination_project_dataset_table>
+  LIMIT 1;
+source_objects: # this tells the external table which path & file format to look in for the objects that will be queryable through this external table 
+  - "<the top level folder name within your bucket that should be used for this external table like my_data>/*.<your file extension, most likely '.jsonl.gz'>"     
+destination_project_dataset_table: "<desired dataset name like external_my_data_source>.<desired table name, may be like topic_name__specific_data_name>"   # this defines the external table name (dataset and table name) through which the data will be accessible in BigQuery
+source_format: NEWLINE_DELIMITED_JSON   # file format of raw data; generally should not change -- allowable options are specified here: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#ExternalDataConfiguration.FIELDS.source_format
+use_bq_client: true     # this option only exists for backwards compatibility; should always be true for new tables
+hive_options:   # this section provides information about how hive-partitioning is used
+  mode: CUSTOM  # options are CUSTOM and AUTO. if CUSTOM, you need to define the hive partitions and their datatypes in the source_uri_prefix below; if you use AUTO, you only need to provide the top-level directory in the source_uri_prefix
+  require_partition_filter: false   # default is true: if true, users will have to provide a filter to query this data; false is usually fine except for very large data like GTFS-RT
+  source_uri_prefix: "<the top level folder name within your bucket that should be used for this external table (should match what's entered in source_objects above)>/{<if CUSTOM under mode above: hive partition name: hive partition data type like 'dt:DATE'>}"    # this tells the hive partitioning where to look. if mode = CUSTOM, should be something like "my_data/{dt:DATE}/{ts:TIMESTAMP}/{some_label:STRING}/" with the entire hive path defined; if mode = AUTO, should be like "my_data/"
+schema_fields:  # here you fill in the schema of the actual files, which will become the schema of the external table
+# make one list item per column that you want to be available in BigQuery
+# if there are columns in the source data that you don't want in BigQuery, you don't have to include them here
+# hive partition path elements (like "date", if present) will be added as columns automatically and should not be specified here
+# if you don't specify a schema, BigQuery will attempt to auto-detect the schema: https://cloud.google.com/bigquery/docs/schema-detect#schema_auto-detection_for_external_data_sources
+  - name: <column_name>     # this should match the key name for this data in the source JSONL file; see https://cloud.google.com/bigquery/docs/schemas#column_names for BQ naming rules
+    mode: <column mode>     # see https://cloud.google.com/bigquery/docs/schemas#modes
+    type: <column data type>    # see https://cloud.google.com/bigquery/docs/schemas#standard_sql_data_types
+  - name: <second_column_name>
+    mode: <second column mode>
+    type: <second column data type>
+```
+
+## Testing
+
+When testing external table creation locally, pay attention to test environment details:
+* Check the `prefix_bucket` setting in your external table DAG task YAML. If `prefix_bucket` is `true`, a local Airflow run will look for a `test-` prefixed bucket and will point the external table at that test data. 
+   * If there is test data in the `test-` bucket with a different schema than you want for the external table (for example, if during ingest development someone was changing individual field data types), that may cause errors and you may need to delete the test data with the outdated schema. 
+   * There will usually be less data present in a `test-` bucket than in production and data that is present may be unrepresentative or out of date.
+* External tables created by local Airflow will be created in the `cal-itp-data-infra-staging` environment. 
+   * If you're trying to test dbt changes that rely on unmerged external tables changes, you can set the `DBT_SOURCE_DATABASE` environment variable to `cal-itp-data-infra-staging`. This will cause the dbt project to use the staging environment's externabl tables. If the staging external tables are pointed at a `test-` buckets (as described in the bullet above), then the dbt project will run on that test data, which may lead to unexpected results. 
+   * For this reason, it is often easier to make external tables updates in one pull request, get that approved and merged, and then make dbt changes once the external tables are already updated in production so you can test on the production source data.