Skip to content

Commit

Permalink
Airtable docs updates (#2868)
Browse files Browse the repository at this point in the history
* wip updates

* foreign key docs and other updates

* rearrange navigation

* remove legacy docs section

* address failures in docs build - remove unused airflow page and fix toc

* rename airtable page

* remove references to contacting charlie

* update link to refactored architecture data page

* phrasing update per pr review and add link to the google sheet
  • Loading branch information
lauriemerrell authored Aug 10, 2023
1 parent c99733d commit d81d53b
Show file tree
Hide file tree
Showing 9 changed files with 88 additions and 155 deletions.
9 changes: 3 additions & 6 deletions docs/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,11 @@ parts:
- file: warehouse/overview
sections:
- file: warehouse/warehouse_starter_kit
- file: warehouse/navigating_dbt_docs
- file: warehouse/what_is_agency
- file: warehouse/developing_dbt_models
- file: warehouse/adding_oneoff_data
- file: warehouse/what_is_gtfs
- file: datasets_and_tables/overview
sections:
- file: datasets_and_tables/transitdatabase
- file: publishing/overview
sections:
- glob: publishing/sections/*
Expand All @@ -47,9 +45,8 @@ parts:
sections:
- file: architecture/services
- file: architecture/data
- file: airflow/overview
sections:
- file: airflow/dags-maintenance
- file: airflow/dags-maintenance
- file: transit_database/transitdatabase
- file: kubernetes/README
sections:
- file: kubernetes/JupyterHub
Expand Down
4 changes: 2 additions & 2 deletions docs/airflow/dags-maintenance.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
(dags-maintenance)=
# Production DAGs Maintenance
# Airflow Operational Considerations

We use [Airflow](https://airflow.apache.org/) to orchestrate our data ingest processes. This page describes how to handle cases where an Airflow [DAG task](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/tasks.html) fails.
We use [Airflow](https://airflow.apache.org/) to orchestrate our data ingest processes. This page describes how to handle cases where an Airflow [DAG task](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/tasks.html) fails. For general information about Airflow development, see the [Airflow README in the data-infra GitHub repo](https://github.com/cal-itp/data-infra/blob/main/airflow/README.md).

## Monitoring DAGs

Expand Down
6 changes: 0 additions & 6 deletions docs/airflow/overview.md

This file was deleted.

2 changes: 1 addition & 1 deletion docs/analytics_onboarding/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@
 
(get-help)=
```{admonition} Still need access to a non-Caltrans tool above?
DM Charlie <a href="https://cal-itp.slack.com/team/U027GAVHFST" target="_blank">on Cal-ITP Slack using this link</a>, or <a href="mailto: [email protected]?subject=Cal-ITP Access Issues&body=I need access to:" target="_blank">by email</a>.
Ask on the `#services-team` channel in the Cal-ITP Slack.
```

## New Analyst Training Curriculum
Expand Down
2 changes: 1 addition & 1 deletion docs/analytics_tools/jupyterhub.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ This avoids the need to set up a local environment, provides dedicated storage,

JupyterHub currently lives at [notebooks.calitp.org](https://notebooks.calitp.org/).

Note: you will need to have been added to the Cal-ITP organization on GitHub to obtain access. If you have yet to be added to the organization and need to be, DM Charlie on Cal-ITP Slack <a href="https://cal-itp.slack.com/team/U027GAVHFST" target="_blank">using this link</a>.
Note: you will need to have been added to the Cal-ITP organization on GitHub to obtain access. If you have yet to be added to the organization and need to be, ask in the `#services-team` channel in Slack.

(connecting-to-warehouse)=
### Connecting to the Warehouse
Expand Down
2 changes: 1 addition & 1 deletion docs/contribute/contribute-best-practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ If you feel a new section is warranted, make sure you follow Jupyter Book's guid
(new-pages)=
### New Pages and Chapters
Add new pages and chapters only as truly needed. If you're unsure of whether a new page or chapter is necessary, reach out to `@Charlie Costanzo` on `Cal-ITP Slack`.
Add new pages and chapters only as truly needed.

If you are adding new pages or chapters, you will need to also update the `_toc.yml` file. You can find more information at Jupyter Book's resource [Structure and organize content](https://jupyterbook.org/basics/organize.html).

Expand Down
131 changes: 0 additions & 131 deletions docs/datasets_and_tables/transitdatabase.md

This file was deleted.

80 changes: 80 additions & 0 deletions docs/transit_database/transitdatabase.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Transit Database (Airtable)

The Cal-ITP Airtable Transit Database stores key relationships about how transit services are organized and operated in California as well as how well they are performing. See Evan or post in the `#airtable-data` Slack channel to get a link and gain access.

Important Airtable documentation is maintained elsewhere:

* [Airtable Data Documentation Google Doc](https://docs.google.com/document/d/1KvlYRYB8cnyTOkT1Q0BbBmdQNguK_AMzhSV5ELXiZR4/edit#heading=h.u7y2eosf0i1d) - documentation of specific fields in Airtable
* [California Transit Data - Operating Procedures Google Doc](https://docs.google.com/document/d/1IO8x9-31LjwmlBDH0Jri-uWI7Zygi_IPc9nqd7FPEQM/edit#) - outlines the processes by which Airtable data is maintained

In addition, some documentation is available automatically within Airtable (these require Airtable authentication to access):
* Airtable creates an API documentation page for each base (for example, [here is the page for California Transit](https://airtable.com/appPnJWrQ7ui4UmIl/api/docs)). This page provides technical information about field types and relationships. Airtable does not currently have an effective mechanism to programmatically download your data schema (they have paused issuing keys to their metadata API).
* When looking at a base, there is an `Extensions` tab at the far upper right corner (below the share, notifications, and user icons). If you click that, an extensions sidebar will open. In that sidebar, there is an extension called `Base schema` (you may have to open it fullscreen to actually see it.) This extension will let you see an auto-generated visualization of the technical relationships among fields in the base.

Cal-ITP uses two main Airtable bases:

| **Base** | **Description** |
| :------------ | :-------------- |
| [**California Transit**](#california-transit) | Defines key organizational relationships and properties. Organizations, geography, funding programs, transit services, service characteristics, transit datasets such as GTFS, and the intersection between transit datasets and services.
| [**Transit Technology Stacks**](#transit-technology-stacks) | Defines operational setups at transit provider organizations. Defines relationships between vendor organizations, transit provider and operator organizations, products, contracts to provide products, transit stack components, and how they relate to one-another.

The rest of this page outlines stray technical considerations associated with Airtable and its ingestion into the data warehouse.

## Primary Keys

Airtable forces the use of the left-most field as the primary key of the database: the field that must be referenced in other tables, similar to a VLOOKUP in a spreadsheet. Unlike many databases, Airtable doesn't enforce uniqueness in the values of the primary key field. Instead, it assigns it an underlying and mostly hidden unique [`RECORD ID`](https://support.airtable.com/hc/en-us/articles/360051564873-Record-ID), which can be exposed by creating a formula field to reference it.

## Importing Airtable data into the Cal-ITP data warehouse

We ingest data from Airtable into the Cal-ITP data warehouse. For an overview of the data ingest process/architecture, see [the pipeline architecture documentation](architecture-data). For pointers to where Airtable-specific code and artifacts, see [the pipeline reference Google Sheet](https://docs.google.com/spreadsheets/d/1bv1K5lZMnq1eCSZRy3sPd3MgbdyghrMl4u8HvjNjWPw/edit#gid=0).

To ingest a new Airtable table or base and make it available in the warehouse, you need to make updates throughout the data ingest flow, from the Airtable scraper Airflow DAG all the way to dbt mart tables. See [data infra PR #2781](https://github.com/cal-itp/data-infra/pull/2781) for an example of what this can look like. Ingesting new columns in an existing table is similar; see [data infra PR #2383](https://github.com/cal-itp/data-infra/pull/2383) for an example.

### Gotchas
Bringing Airtable data into the warehouse can involve a few tricky situations. Here are a few we've encountered so far, with suggested resolutions.

#### Foreign keys and bridge tables
Airtable allows users to define links between tables, to create relationships between records of different types. In the Airtable UI, these links display the primary field for the linked record in the relevant column (so, for example, the `Services.provider` column contains an organization's name like `City of Anaheim`.) However, these foreign key links are exported via the Airtable API as an array of the back-end record IDs (so, instead of a single organization name like `City of Anaheim`, that `Services.provider` field will appear as an array containing a record ID, like `[rec0123asdf]`.) It does this even if the given field only ever contains exactly one foreign key (i.e., it turns it into an array even if all the arrays have only one entry.)

This means:
* All foreign keys need to be unpacked from arrays in the warehouse to become useful for joins. See below for more on this.
* If a linked field is severed in Airtable (if the foreign key relationship is removed, but the columns that contained the links are not deleted) it can break our data ingest, because these array-type fields will become string-type fields. Ideally, it is best to just delete any associated columns when a foreign key relationship/link is ended. If this is not done and the data ingest does break, the solution is to suppress the broken column from the associated table by removing it from the external table schema. If the external table uses schema auto-detect, you may have to define a schema for the table that does not include the broken column. See [data infra PR #2441](https://github.com/cal-itp/data-infra/pull/2441) for an example of this process (though addressing a different issue.)

Airtable foreign keys in the warehouse also require some special handling because:

* Most Airtable data is treated as dimensions (i.e., entities that we version over time)
* Some Airtable data contains many-to-many relationships

The mechanism that we have used to deal with both of these is the **bridge table**, [described in our dbt docs](https://dbt-docs.calitp.org/#!/overview). The bridge table stores the foreign key pairs to allow you to traverse a relationship, instead of trying to store these on each of the tables in the relationship itself. Trying to store the foreign keys on the tables directly opens you up to issues:

* You have to either store the foreign keys as an array or change the cardinality of the table (to account for the fact that one record may need to store multiple foreign keys, either to capture versioning on the foreign table or to capture relationships with multiple records). Metabase does not natively allow unnesting arrays to do joins in the GUI query editor, so we try to have non-array foreign keys in mart tables.
* You risk infinite loops if you try to version a record that includes a versioned foreign key on both sides of the relationship (which is how Airtable stores these relationships). For example, you have an organization and a service that are linked, with both containing a foreign key to the other. An attribute is changed on the service, creating a new versioned key. You need to add that new versioned service key to the organization record. But now that has triggered a change on the organization record, which makes a new versioned key on the organization record. So now you have to update the organization versioned key on the service record. And thus to infinity. Another solution here is to only store the relationship on one side, but then you still have the first problem of arrays and cardinality.

Bridge tables do introduce some complexity in handling fanout from joins, but they remove that complexity from the dimension tables themselves. Another solution would be to only store the unversioned natural key for the foreign key, in which case you would only need bridge tables for true many-to-many relationships (to handle the array/cardinality issue), but that would still create fanout without the explicit artifact of the bridge table to help troubleshoot.

#### Synced tables
Airtable allows you to "sync" a table from one base to another, where it appears with all the data from its source location and can be linked to records in the second base. An example in our Airtable is the `California Transit.organizations` table is synced to `Transit Technology Stacks.organizations`; you will see a little lightning icon to show that it is a synced table.

This requires special handling when importing to the warehouse, because Airtable assigns new back-end record IDs in the synced table, which means that foreign keys to the synced table in the second base will not match record IDs in the source table. We resolve this by mapping all foreign keys to point to the source table in a base layer in dbt. See [data infra PR #2781](https://github.com/cal-itp/data-infra/pull/2781) for an example.

## Entity Relationship Diagrams

The following entity relationship diagrams were last updated in 2022 but are preserved for general reference purposes.

### California Transit

[![](https://mermaid.ink/img/pako:eNqVVEtv4jAQ_iuWz0W9c1stbbWHbhFw5DLEEzJax07HDqss4b_vOCQQXlLLBSX6XuP5nL3OvEE91cgzgi1DuXZKfh-8BUf_IJJ36tBOJn6vlsg7ynCq1roEB1sMa_0ltK-QIT6C-w7-FvMwgwgBY6Jk3oW6_BalYm_q7HuUemMpFEfOY9b4XaJRUBUwuqj8GO3zu9btQxEwJTkKEZnc9sta7a3W-_zjebGa_1C5ZxULVJIO0sN3RCRQolYWnEt5oI6FZ4rNQ9VHw7bqp69dbN7QS6OqounlCwTzWQPLwGgUuUHnCq3atgs4t5DhhYbBkDFtMKiso0ws7tCKkoQqjwFG8V4l77KR4y0HxeuRoaosiVr05wL0vQ3D8kc9Pu4dIoMLFAer3qx2Rk5tzilu2V2C9oKcC-DUzQUZ5AV-1sRYpiLdml1mS6QXS1uSwsbm5HLDsvLvgtCwA5Pt93czX3ck_Y3oX6WLkTQYc4tZlBVtmsF7dHGG2e4wKS2mrJiCkM8VXkH4c_c8Bep3GJ7_ehaAPxXiFdG8Y2TKwtCoq5t7bkIqJqOVne5QiaLnCE7mO9v_Xs1-SUMGpXEJQtIqvDXinueUEdgEV0acujz6SZco3SIj38h90ltrcSxxrY8xcqhtTE4HgdaVEPHFUPSsp5FrfNJyjfyycdnwfMT0H1s9zcEGPPwHJNjt_A)](https://mermaid-js.github.io/mermaid-live-editor/edit/#pako:eNqVVEtv4jAQ_iuWz0W9c1stbbWHbhFw5DLEEzJax07HDqss4b_vOCQQXlLLBSX6XuP5nL3OvEE91cgzgi1DuXZKfh-8BUf_IJJ36tBOJn6vlsg7ynCq1roEB1sMa_0ltK-QIT6C-w7-FvMwgwgBY6Jk3oW6_BalYm_q7HuUemMpFEfOY9b4XaJRUBUwuqj8GO3zu9btQxEwJTkKEZnc9sta7a3W-_zjebGa_1C5ZxULVJIO0sN3RCRQolYWnEt5oI6FZ4rNQ9VHw7bqp69dbN7QS6OqounlCwTzWQPLwGgUuUHnCq3atgs4t5DhhYbBkDFtMKiso0ws7tCKkoQqjwFG8V4l77KR4y0HxeuRoaosiVr05wL0vQ3D8kc9Pu4dIoMLFAer3qx2Rk5tzilu2V2C9oKcC-DUzQUZ5AV-1sRYpiLdml1mS6QXS1uSwsbm5HLDsvLvgtCwA5Pt93czX3ck_Y3oX6WLkTQYc4tZlBVtmsF7dHGG2e4wKS2mrJiCkM8VXkH4c_c8Bep3GJ7_ehaAPxXiFdG8Y2TKwtCoq5t7bkIqJqOVne5QiaLnCE7mO9v_Xs1-SUMGpXEJQtIqvDXinueUEdgEV0acujz6SZco3SIj38h90ltrcSxxrY8xcqhtTE4HgdaVEPHFUPSsp5FrfNJyjfyycdnwfMT0H1s9zcEGPPwHJNjt_A)

[editable source](https://mermaid-js.github.io/mermaid-live-editor/edit/#pako:eNqVVEtv4jAQ_iuWz0W9c1stbbWHbhFw5DLEEzJax07HDqss4b_vOCQQXlLLBSX6XuP5nL3OvEE91cgzgi1DuXZKfh-8BUf_IJJ36tBOJn6vlsg7ynCq1roEB1sMa_0ltK-QIT6C-w7-FvMwgwgBY6Jk3oW6_BalYm_q7HuUemMpFEfOY9b4XaJRUBUwuqj8GO3zu9btQxEwJTkKEZnc9sta7a3W-_zjebGa_1C5ZxULVJIO0sN3RCRQolYWnEt5oI6FZ4rNQ9VHw7bqp69dbN7QS6OqounlCwTzWQPLwGgUuUHnCq3atgs4t5DhhYbBkDFtMKiso0ws7tCKkoQqjwFG8V4l77KR4y0HxeuRoaosiVr05wL0vQ3D8kc9Pu4dIoMLFAer3qx2Rk5tzilu2V2C9oKcC-DUzQUZ5AV-1sRYpiLdml1mS6QXS1uSwsbm5HLDsvLvgtCwA5Pt93czX3ck_Y3oX6WLkTQYc4tZlBVtmsF7dHGG2e4wKS2mrJiCkM8VXkH4c_c8Bep3GJ7_ehaAPxXiFdG8Y2TKwtCoq5t7bkIqJqOVne5QiaLnCE7mO9v_Xs1-SUMGpXEJQtIqvDXinueUEdgEV0acujz6SZco3SIj38h90ltrcSxxrY8xcqhtTE4HgdaVEPHFUPSsp5FrfNJyjfyycdnwfMT0H1s9zcEGPPwHJNjt_A)

### Transit Stacks

[![](https://mermaid.ink/img/pako:eNqdk7tuwzAMRX9F0JzH7jXp0ClF09ELITGyAFs0KClFG-ffS7_6SNu0iEbp3MtLSjppQxZ1oZG3HhxDUwYla8cOgn-F5ClEde6WSzqpPfLRGyxUqRsI4DCW-kecBnxDITGY1PP0HP4PV1Tb6_QDk80jHLGuB3jEp4wbaloKGJIoVqvuS3YfVY5oVSLV1hDWYy9rapEh4Vz3N6NPpcUoSlQF8bqoU-8bk0wa9cH9JbwYi-hapqO3ffaKKbvqo-9HrMcRVb79ZtZ1Q4rL_d50x975AEOcOJ4rMwNzulvNn4Adpht9pwlsIcHeVNhA73gfxSUENEmGkKOkLrVe6Aa5AW_lHZ9651InEchV9hKLB8j1UPMsaG6t3PKd9YlYFweoIy405ET7l2B0kTjjDE0_YqLOb7JEHuQ)](https://mermaid-js.github.io/mermaid-live-editor/edit/#pako:eNqdk7tuwzAMRX9F0JzH7jXp0ClF09ELITGyAFs0KClFG-ffS7_6SNu0iEbp3MtLSjppQxZ1oZG3HhxDUwYla8cOgn-F5ClEde6WSzqpPfLRGyxUqRsI4DCW-kecBnxDITGY1PP0HP4PV1Tb6_QDk80jHLGuB3jEp4wbaloKGJIoVqvuS3YfVY5oVSLV1hDWYy9rapEh4Vz3N6NPpcUoSlQF8bqoU-8bk0wa9cH9JbwYi-hapqO3ffaKKbvqo-9HrMcRVb79ZtZ1Q4rL_d50x975AEOcOJ4rMwNzulvNn4Adpht9pwlsIcHeVNhA73gfxSUENEmGkKOkLrVe6Aa5AW_lHZ9651InEchV9hKLB8j1UPMsaG6t3PKd9YlYFweoIy405ET7l2B0kTjjDE0_YqLOb7JEHuQ)

[editable source](https://mermaid-js.github.io/mermaid-live-editor/edit/#pako:eNqdk7tuwzAMRX9F0JzH7jXp0ClF09ELITGyAFs0KClFG-ffS7_6SNu0iEbp3MtLSjppQxZ1oZG3HhxDUwYla8cOgn-F5ClEde6WSzqpPfLRGyxUqRsI4DCW-kecBnxDITGY1PP0HP4PV1Tb6_QDk80jHLGuB3jEp4wbaloKGJIoVqvuS3YfVY5oVSLV1hDWYy9rapEh4Vz3N6NPpcUoSlQF8bqoU-8bk0wa9cH9JbwYi-hapqO3ffaKKbvqo-9HrMcRVb79ZtZ1Q4rL_d50x975AEOcOJ4rMwNzulvNn4Adpht9pwlsIcHeVNhA73gfxSUENEmGkKOkLrVe6Aa5AW_lHZ9651InEchV9hKLB8j1UPMsaG6t3PKd9YlYFweoIy405ET7l2B0kTjjDE0_YqLOb7JEHuQ)

## Dashboards

## DAGs Maintenance

You can find further information on DAGs maintenance for Transit Database data [on this page](dags-maintenance).
Original file line number Diff line number Diff line change
Expand Up @@ -48,10 +48,3 @@ To examine the documentation for our tables from the `Project` perspective:
2. Within that list, select `models`
3. From here, file directories will appear below.
4. Select the directory of your choice. A dropdown list of tables will appear and you can select a table to view its documentation

# Legacy documentation
In general, the dbt docs should be the main source of all documentation for warehouse entities (sources, views, tables, "models", etc.) but the following pages contain some not-yet-migrated documentation.

| page | description | datasets |
| ---- | ----------- | -------- |
| [Transit Database](./transitdatabase.md) | A representation of Cal-ITP's internal knowledge about our Transit Operators in CA and various pieces of National Transit Database statistics for ease of use | `airtable.*`, `staging.transit_database__*`, `transitstacks.*` |

0 comments on commit d81d53b

Please sign in to comment.