Skip to content

Commit

Permalink
Merge branch 'main' into refactor-payments-rides-device-transactions
Browse files Browse the repository at this point in the history
  • Loading branch information
lauriemerrell committed Dec 15, 2023
2 parents d07ac4e + 29d614b commit 9c4ed1f
Show file tree
Hide file tree
Showing 55 changed files with 2,695 additions and 975 deletions.
2 changes: 2 additions & 0 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ jobs:
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
with:
python-version: '3.12.0'
- uses: pre-commit/[email protected]
- uses: crate-ci/typos@master
with:
Expand Down
6 changes: 5 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ repos:
rev: v4.4.0
hooks:
- id: trailing-whitespace
exclude_types: ['markdown']
- id: end-of-file-fixer
exclude_types: ['jupyter']
- id: check-yaml
Expand Down Expand Up @@ -64,7 +65,10 @@ repos:
rev: 0.7.16
hooks:
- id: mdformat
exclude: ^warehouse/(?!README.md)
# list of exclusions: https://stackoverflow.com/a/75560858
# mdformat does not play nice with GitHub callouts: https://github.com/orgs/community/discussions/16925
# so skip README files that use them
exclude: 'README.md|warehouse/.*'
args: ["--number"]
additional_dependencies:
- mdformat-gfm
Expand Down
35 changes: 19 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ Documentation for this codebase lives at [docs.calitp.org/data-infra](https://do

## Repository Structure

- [./airflow](./airflow) contains the local dev setup and source code for Airflow DAGs (i.e. ETL)
- [./ci](./ci) contains continuous integration and deployment scripts using GitHub actions.
- [./airflow](./airflow) contains the local dev setup and source code for Airflow DAGs (i.e. ETL).
- [./ci](./ci) contains continuous integration and deployment scripts using GitHub Actions.
- [./docs](./docs) builds the [docs site](https://docs.calitp.org/data-infra).
- [./kubernetes](./kubernetes) contains helm charts, scripts and more for deploying apps/services (e.g. Metabase, JupyterHub) on our kubernetes cluster.
- [./images](./images) contains images we build and deploy for use by services such as JupyterHub.
Expand All @@ -16,22 +16,25 @@ Documentation for this codebase lives at [docs.calitp.org/data-infra](https://do

## Contributing

- Follow the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) standard for all commits
- Use Conventional Commit format for PR titles
- Use GitHub's *draft* status to indicate PRs that are not ready for review/merging
- Do not use GitHub's "update branch" button or merge the `main` branch back into a PR branch to update it. Instead, rebase PR branches to update them and resolve any merge conflicts.
- We use GitHub's "code owners" functionality to designate a person or group of people who are in the line of approval for changes to some parts of this repository - if one or more people are automatically tagged as reviewers by GitHub when you create a PR, an approving review from at least one of them is required to merge. This does not automatically place the PR review in somebody's list of priorities, so please reach out to a reviewer to get eyes on your PR if it's time-sensitive.
### Pre-commit

This repository uses pre-commit hooks to format code, including [Black](https://black.readthedocs.io/en/stable/index.html). This ensures baseline consistency in code formatting.

> [!IMPORTANT]
> Before contributing to this project, please install pre-commit locally by running `pip install pre-commit` and `pre-commit install` in the root of the repo.
## Linting and type-checking
Once installed, pre-commit checks will run before you can make commits locally. If a pre-commit check fails, it will need to be addressed before you can make your commit. Many formatting issues are fixed automatically within the pre-commit actions, so check the changes made by pre-commit on failure -- they may have automatically addressed the issues that caused the failure, in which case you can simply re-add the files, re-attempt the commit, and the checks will then succeed.

### pre-commit
Installing pre-commit locally saves time dealing with formatting issues on pull requests. There is a [GitHub Action](./.github/workflows/lint.yml)
that runs pre-commit on all files, not just changed ones, as part of our continuous integration.

This repository pre-commit hooks to format code, including black. To install
pre-commit locally, run `pip install pre-commit` & `pre-commit install`
in the root of the repo. There is a [GitHub Action](./.github/workflows/lint.yml)
that runs pre-commit on all files, not just changed ones. sqlfluff is currently
disabled in the CI run due to flakiness, but it will still lint any SQL files
you attempt to commit locally.
> [!NOTE]
> [SQLFluff](https://sqlfluff.com/) is currently disabled in the CI run due to flakiness, but it will still lint any SQL files you attempt to commit locally. You will need to manually correct SQLFluff errors because we found that SQLFluff's automated fixes could be too aggressive and could change the meaning and function of affected code.
### Pull requests
- Use GitHub's *draft* status to indicate PRs that are not ready for review/merging
- Do not use GitHub's "update branch" button or merge the `main` branch back into a PR branch to update it. Instead, rebase PR branches to update them and resolve any merge conflicts.
- We use GitHub's "code owners" functionality to designate a person or group of people who are in the line of approval for changes to some parts of this repository - if one or more people are automatically tagged as reviewers by GitHub when you create a PR, an approving review from at least one of them is required to merge. This does not automatically place the PR review in somebody's list of priorities, so please reach out to a reviewer to get eyes on your PR if it's time-sensitive.

### mypy

Expand All @@ -52,7 +55,7 @@ and `shapely` (until stubs are available, if ever). We recommend including
comments where additional asserts or other weird-looking code exist to make mypy
happy.

## Configuration via Environment Variables
### Configuration via Environment Variables

Generally we try to configure things via environment variables. In the Kubernetes
world, these get configured via Kustomize overlays ([example](./kubernetes/apps/overlays/gtfs-rt-archiver-v3-prod/archiver-channel-vars.yaml)).
Expand Down
2 changes: 0 additions & 2 deletions airflow/.env.sample

This file was deleted.

43 changes: 43 additions & 0 deletions airflow/dags/create_external_tables/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,46 @@
Type: [Now / Scheduled](https://docs.calitp.org/data-infra/airflow/dags-maintenance.html)

This DAG orchestrates the creation of [external tables](https://cloud.google.com/bigquery/docs/external-data-sources), which serve as the interface between our raw / parsed data (stored in Google Cloud Storage) and our data warehouse (BigQuery). Most of our external tables are [hive-partitioned](https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs).

Here is an annotated example external table YAML file showing what the fields mean:

```yaml
# throughout this example, <> brackets denote sample content to be filled in based on your use case and should be removed
operator: operators.ExternalTable # the name of the operator; this does not change
bucket: gs://<your bucket name> # fill in the name of your source data bucket here
prefix_bucket: true # Boolean for whether or not the bucket name should have `test-` appended when you're running from local Airflow (use this if there's a `test-` bucket used for testing)
post_hook: | # this is optional; can provide an example query to check that external table was created successfully. this query will run every time the external table DAG runs
SELECT *
FROM `{{ get_project_id() }}`.<your dataset as defined below under destination_project_dataset_table>.<your table name as defined below under destination_project_dataset_table>
LIMIT 1;
source_objects: # this tells the external table which path & file format to look in for the objects that will be queryable through this external table
- "<the top level folder name within your bucket that should be used for this external table like my_data>/*.<your file extension, most likely '.jsonl.gz'>"
destination_project_dataset_table: "<desired dataset name like external_my_data_source>.<desired table name, may be like topic_name__specific_data_name>" # this defines the external table name (dataset and table name) through which the data will be accessible in BigQuery
source_format: NEWLINE_DELIMITED_JSON # file format of raw data; generally should not change -- allowable options are specified here: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#ExternalDataConfiguration.FIELDS.source_format
use_bq_client: true # this option only exists for backwards compatibility; should always be true for new tables
hive_options: # this section provides information about how hive-partitioning is used
mode: CUSTOM # options are CUSTOM and AUTO. if CUSTOM, you need to define the hive partitions and their datatypes in the source_uri_prefix below; if you use AUTO, you only need to provide the top-level directory in the source_uri_prefix
require_partition_filter: false # default is true: if true, users will have to provide a filter to query this data; false is usually fine except for very large data like GTFS-RT
source_uri_prefix: "<the top level folder name within your bucket that should be used for this external table (should match what's entered in source_objects above)>/{<if CUSTOM under mode above: hive partition name: hive partition data type like 'dt:DATE'>}" # this tells the hive partitioning where to look. if mode = CUSTOM, should be something like "my_data/{dt:DATE}/{ts:TIMESTAMP}/{some_label:STRING}/" with the entire hive path defined; if mode = AUTO, should be like "my_data/"
schema_fields: # here you fill in the schema of the actual files, which will become the schema of the external table
# make one list item per column that you want to be available in BigQuery
# if there are columns in the source data that you don't want in BigQuery, you don't have to include them here
# hive partition path elements (like "date", if present) will be added as columns automatically and should not be specified here
# if you don't specify a schema, BigQuery will attempt to auto-detect the schema: https://cloud.google.com/bigquery/docs/schema-detect#schema_auto-detection_for_external_data_sources
- name: <column_name> # this should match the key name for this data in the source JSONL file; see https://cloud.google.com/bigquery/docs/schemas#column_names for BQ naming rules
mode: <column mode> # see https://cloud.google.com/bigquery/docs/schemas#modes
type: <column data type> # see https://cloud.google.com/bigquery/docs/schemas#standard_sql_data_types
- name: <second_column_name>
mode: <second column mode>
type: <second column data type>
```
## Testing
When testing external table creation locally, pay attention to test environment details:
* Check the `prefix_bucket` setting in your external table DAG task YAML. If `prefix_bucket` is `true`, a local Airflow run will look for a `test-` prefixed bucket and will point the external table at that test data.
* If there is test data in the `test-` bucket with a different schema than you want for the external table (for example, if during ingest development someone was changing individual field data types), that may cause errors and you may need to delete the test data with the outdated schema.
* There will usually be less data present in a `test-` bucket than in production and data that is present may be unrepresentative or out of date.
* External tables created by local Airflow will be created in the `cal-itp-data-infra-staging` environment.
* If you're trying to test dbt changes that rely on unmerged external tables changes, you can set the `DBT_SOURCE_DATABASE` environment variable to `cal-itp-data-infra-staging`. This will cause the dbt project to use the staging environment's externabl tables. If the staging external tables are pointed at a `test-` buckets (as described in the bullet above), then the dbt project will run on that test data, which may lead to unexpected results.
* For this reason, it is often easier to make external tables updates in one pull request, get that approved and merged, and then make dbt changes once the external tables are already updated in production so you can test on the production source data.
Loading

0 comments on commit 9c4ed1f

Please sign in to comment.