Skip to content

Commit

Permalink
Document data flows/architecture (#2847)
Browse files Browse the repository at this point in the history
* revamp architecture docs

* add details by data source

* minor phrasing

* see if this makes docs preview action pass?

* oops remove quotes

* remove spaces for attempted manual spacing

* move table to google sheet for space

* split architecture page into sections

* actually commit changes to data page

* clarify purpose of ad hoc imports page

* remove refs to charlie and fix link

* actually fix link

* tighten language

* reorder page to put diagram at bottom

* add example PR and label diagram
  • Loading branch information
lauriemerrell authored Aug 9, 2023
1 parent fd5818e commit a4c6ec4
Show file tree
Hide file tree
Showing 7 changed files with 230 additions and 145 deletions.
3 changes: 3 additions & 0 deletions docs/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,9 @@ parts:
- caption: Developers
chapters:
- file: architecture/architecture_overview
sections:
- file: architecture/services
- file: architecture/data
- file: airflow/overview
sections:
- file: airflow/dags-maintenance
Expand Down
2 changes: 0 additions & 2 deletions docs/analytics_tools/saving_code.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,8 +121,6 @@ If you would like to commit directly from the Github User Interface:

1. Navigate the Github repository and folder that you would like to add your work, and locate the file on your computer that you would like to commit

(Note: if you would like to commit your file to a directory that does not yet exist, <a href="https://cal-itp.slack.com/team/U027GAVHFST" target="_blank">message Charlie on Cal-ITP Slack</a> to add it for you)



![Collection Matrix](assets/step-1-gh-drag-drop.png)
Expand Down
2 changes: 1 addition & 1 deletion docs/analytics_tools/tools_quick_links.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,5 @@

&nbsp;
```{admonition} Still need access to a tool on this page?
DM Charlie <a href="https://cal-itp.slack.com/team/U027GAVHFST" target="_blank">on Cal-ITP Slack using this link</a>, or <a href="mailto: [email protected]?subject=Cal-ITP Access Issues&body=I need access to:" target="_blank">by email</a>.
Ask in the `#services-team` channel in the Cal-ITP Slack.
```
173 changes: 37 additions & 136 deletions docs/architecture/architecture_overview.md
Original file line number Diff line number Diff line change
@@ -1,159 +1,60 @@
(architecture-overview)=
# Architecture Overview

## Deployed services
The Cal-ITP data infrastructure facilitates several types of data workflows:

| Name | Function | URL | Source code | K8s namespace | Development/test environment? |
|------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|-----------------------------------------------------------------------------------------------------|--------------------|-------------------------------|
| Airflow | General orchestation/automation platform; downloads non-GTFS Realtime data and orchestrates data transformations outside of dbt; executes stateless jobs such as dbt and data publishing | https://o1d2fa0877cf3fb10p-tp.appspot.com/home | https://github.com/cal-itp/data-infra/tree/main/airflow | n/a | Yes (local) |
| GTFS-RT Archiver | Downloads GTFS Realtime data (more rapidly than Airflow can handle) | n/a | https://github.com/cal-itp/data-infra/tree/main/services/gtfs-rt-archiver-v3 | gtfs-rt-v3 | Yes (gtfs-rt-v3-test) |
| Metabase | Web-hosted BI tool | https://dashboards.calitp.org | https://github.com/cal-itp/data-infra/tree/main/kubernetes/apps/charts/metabase | metabase | Yes (metabase-test) |
| Grafana | Application observability (i.e. monitoring and alerting on metrics) | https://monitoring.calitp.org | https://github.com/JarvusInnovations/cluster-template/tree/develop/k8s-common/grafana (via hologit) | monitoring-grafana | No |
| Sentry | Application error observability (i.e. collecting errors for investigation) | https://sentry.calitp.org | https://github.com/cal-itp/data-infra/tree/main/kubernetes/apps/charts/sentry | sentry | No |
| JupyterHub | Kubernetes-driven Jupyter workspace provider | https://notebooks.calitp.org | https://github.com/cal-itp/data-infra/tree/main/kubernetes/apps/charts/jupyterhub | jupyterhub | No |
* `Ingestion`
* `Modeling/transformation`
* `Analysis`

In addition, we have `Infrastructure` tools that monitor the health of the system itself or deploy or run other services and do not directly interact with data or support end user data access.

At a high level, the following diagram outlines (in very broad terms) the main tools that we use in our data stack (excluding `Infrastructure` tools).

## Code and deployments (unless otherwise specified, deployments occur via GitHub Actions)
```{mermaid}
flowchart TD
%% note that you seemingly cannot have a subgraph that only contains other subgraphs
%% so I am using "label" nodes to make sure each subgraph has at least one direct child
subgraph repos[ ]
repos_label[GitHub repositories]
data_infra_repo[data-infra]
data_analyses_repo[data-analyses]
reports_repo[reports]
flowchart LR
subgraph ingestion/[ ]
ingestion_label[Ingestion/Orchestration]
airflow[Airflow]
python[Python scripts]
gcs[Google Cloud Storage]
airflow --schedules/executes--> python
python --save data to-->gcs
end
subgraph kubernetes[ ]
kubernetes_label[Google Kubernetes Engine]
subgraph airflow[us-west2-calitp-airflow2-pr-171e4e47-gke]
airflow_label[Production Airflow <br><i><a href='https://console.cloud.google.com/composer/environments?project=cal-itp-data-infra&supportedpurview=project'>Composer</a></i>]
airflow_dags
airflow_plugins
end
subgraph data_infra_apps_cluster[ ]
data_infra_apps_label[data-infra-apps]
subgraph rt_archiver[GTFS-RT Archiver]
rt_archiver_label[<a href='https://github.com/cal-itp/data-infra/tree/main/services/gtfs-rt-archiver-v3'>RT archiver</a>]
prod_rt_archiver[gtfs-rt-v3 archiver]
test_rt_archiver[gtfs-rt-v3-test archiver]
end
jupyterhub[<a href='https://notebooks.calitp.org'>JupyterHub</a>]
metabase[<a href='https://dashboards.calitp.org'>Metabase</a>]
grafana[<a href='https://monitoring.calitp.org'>Grafana</a>]
sentry[<a href='https://sentry.calitp.org'>Sentry</a>]
end
subgraph modeling[ ]
modeling_label[Modeling/Transformation]
bq[BigQuery]
dbt[dbt]
gcs--data read into-->bq
bq <--SQL data transformations--> dbt
python -- execute--> dbt
end
subgraph netlify[ ]
netlify_label[Netlify]
data_infra_docs[<a href='https://docs.calitp.org/data-infra'>data-infra Docs</a>]
reports_website[<a href='https://reports.calitp.org'>California GTFS Quality Dashboard</a>]
analysis_portfolio[<a href='https://analysis.calitp.org'>Cal-ITP Analysis Portfolio</a>]
subgraph analysis[ ]
analysis_label[Analysis]
metabase[Metabase]
jupyter[JupyterHub]
open_data[California Open Data]
python -- execute publish to--> open_data
end
data_infra_repo --> airflow_dags
data_infra_repo --> airflow_plugins
data_infra_repo --> rt_archiver
data_infra_repo --> jupyterhub
data_infra_repo --> metabase
data_infra_repo --> grafana
data_infra_repo --> sentry
bq -- data accessible in--> analysis
data_infra_repo --> data_infra_docs
data_analyses_repo --> jupyterhub --->|portfolio.py| analysis_portfolio
reports_repo --> reports_website
classDef default fill:white, color:black, stroke:black, stroke-width:1px
classDef group_labelstyle fill:#cde6ef, color:black, stroke-width:0px
class repos_label,kubernetes_label,netlify_label group_labelstyle
```

## Data flow
* Dotted lines indicate data flow from external (i.e. non-Cal-ITP) sources, such as agency-hosted GTFS feeds
* Orange lines indicate manual data flows, such as an analyst executing a Jupyter notebook
* Yellow nodes indicate testing/development environments
```{mermaid}
flowchart LR
%% default styles
classDef default fill:white, color:black, stroke:black, stroke-width:1px
linkStyle default stroke:black, stroke-width:4px
classDef test fill:#fdfcd8, color:#000000
classDef group fill:#cde6ef, color:black, stroke-width:0px
classDef subgroup fill:#14A6E0, color:white
%% note that you seemingly cannot have a subgraph that only contains other subgraphs
%% so I am using "label" nodes to make sure each subgraph has at least one direct child
subgraph sources[ ]
data_sources_label[Data Sources]:::group
raw_gtfs[Raw GTFS schedule data]
airtable[<a href='https://airtable.com/'>Airtable</a>]
raw_payment[Raw fare payment]
raw_rt[Raw GTFS RT feeds]
end
subgraph rt_archiver[ ]
rt_archiver_label[<a href='https://github.com/cal-itp/data-infra/tree/main/services/gtfs-rt-archiver-v3'>RT archiver</a>]:::group
prod_rt_archiver[Prod archiver]
test_rt_archiver[Test archiver]:::test
end
subgraph airflow[ ]
airflow_label[Airflow]:::group
airflow_prod[Production Airflow <br><i><a href='https://console.cloud.google.com/composer/environments?project=cal-itp-data-infra&supportedpurview=project'>Composer</a></i>]
airflow_local[Local Airflow <br><i><a href='https://github.com/cal-itp/data-infra/blob/main/airflow/README.md'>Setup</a></i>]:::test
end
subgraph gcp[Google Cloud Project]
subgraph bigquery[<a href=''>BigQuery</a>]
bq_cal_itp_data_infra[(cal-itp-data-infra)]
bq_cal_itp_data_infra_staging[(cal-itp-data-infra-staging)]:::test
end
subgraph gcs[<a href='https://console.cloud.google.com/storage/browser'>Google Cloud Storage</a>]
gcs_raw[(Raw)]
gcs_parsed[(Parsed)]
gcs_validation[(Validation)]
gcs_analysis[(Analysis artifacts)]
gcs_map_tiles[(Map tiles/GeoJSON)]
gcs_other[(Backups, Composer code, etc.)]
gcs_test[(test-* buckets)]:::test
end
end
subgraph consumers[ ]
consumers_label[Data consumers]:::group
jupyterhub[<a href='https://hubtest.k8s.calitp.jarv.us/hub/'>JupyterHub</a>]
metabase[<a href='https://dashboards.calitp.org/'>Metabase - dashboards.calitp.org</a>]
reports_website[<a href='https://reports.calitp.org'>reports.calitp.org</a>]
end
%% subgraphs cannot be styled in-line
class sources,rt_archiver,airflow,gcp,consumers group
%% manual actions; put first for easier style indexing
%% add indices to linkStyle as new manual connections exist
jupyterhub --> gcs_analysis
jupyterhub --> gcs_map_tiles
linkStyle 0,1 stroke:orange, stroke-width:4p
%% data sources and transforms
raw_gtfs -.-> airflow
airtable --> airflow
raw_payment -.-> airflow
airflow_prod ---> gcs_raw
airflow_local ---> gcs_test
raw_rt -.-> prod_rt_archiver --> gcs_raw
raw_rt -.-> test_rt_archiver --> gcs_test
%% data transformations
gcs_raw -->|<a href='https://github.com/MobilityData/gtfs-validator'>GTFS Schedule validator</a>| gcs_validation
gcs_raw -->|<a href='https://github.com/MobilityData/gtfs-realtime-validator'>GTFS-RT validator</a>| gcs_validation
gcs_raw -->|"GTFS Schedule, RT, Payments, etc. parsing jobs"| gcs_parsed
%% data consumption
gcs_parsed --> bigquery
gcs_validation --> bigquery
bigquery --> consumers
class ingestion_label,modeling_label,analysis_label group_labelstyle
```

This documentation outlines two ways to think of this system and its components from a technical/maintenance perspective:
* [Services](services) that are deployed and maintained (ex. Metabase, JupyterHub, etc.)
* [Data pipelines](data) to ingest specific types of data (ex. GTFS Schedule, Payments, etc.)

## Environments

Across both data and services, we often have a "production" (live, end-user-facing) environment and some type of testing, staging, or development environment.

### production
* Managed Airflow (i.e. Google Cloud Composer)
* Production gtfs-rt-archiver-v3
Expand Down
111 changes: 111 additions & 0 deletions docs/architecture/data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
(architecture-data)=

# Data pipelines
In general, our data ingest follows versions of the pattern diagrammed below. For an example PR that ingests a brand new data source from scratch, see [data infra PR #2376](https://github.com/cal-itp/data-infra/pull/2376).

Some of the key attributes of our approach:
* We generate an [`outcomes`](https://github.com/cal-itp/data-infra/blob/main/packages/calitp-data-infra/calitp_data_infra/storage.py#L418) file describing whether scrape, parse, or validate operations were successful. This makes operation outcomes visible in BigQuery, so they can be analyzed (for example: how long has the download operation for X feed been failing?)
* We try to limit the amount of manipulation in Airflow tasks to the bare minimum to make the data legible to BigQuery (for example, replace illegal column names that would break the external tables.) We use gzipped JSONL files in GCS as our default parsed data format.
* [External tables](https://cloud.google.com/bigquery/docs/external-data-sources#external_tables) provide the interface between ingested data and BigQuery modeling/transformations.

While many of the key elements of our architecture are common to most of our data sources, each data source has some unique aspects as well. [This spreadsheet](https://docs.google.com/spreadsheets/d/1bv1K5lZMnq1eCSZRy3sPd3MgbdyghrMl4u8HvjNjWPw/edit#gid=0) details overviews by data source, outlining the specific code/resources that correspond to each step in the general data flow shown below.

## Data ingest diagram

```{mermaid}
flowchart TD
raw_data((Raw data <br> in external source))
airflow_scrape{<b>Airflow</b>: <br> scraper}
airflow_parse{<b>Airflow</b>: <br> parser}
airflow_validate{<b>Airflow</b>: <br> GTFS validator}
airflow_external_tables{<b>Airflow</b>: <br> create<br>external tables}
dbt([<br><b>dbt</b><br><br>])
airflow_dbt{<b>Airflow</b>: <br> run dbt}
subgraph first_bq[ ]
bq_label1[BigQuery<br>External tables dataset]
ext_raw_outcomes[(Scrape outcomes<br>external table)]
ext_parse_data[(Parsed data<br>external table)]
ext_parse_outcomes[(Parse outcomes<br>external table)]
ext_validations[(Validations<br>external table)]
ext_validation_outcomes[(Validation outcomes<br>external table)]
end
subgraph second_bq[ ]
bq_label2[BigQuery<br>Staging, mart, etc. datasets]
bq_table_example[(staging_*.table_names)]
bq_table_example2[(mart_*.table_names)]
bq_table_example3[(etc.)]
end
subgraph first_gcs[ ]
gcs_label1[Google Cloud Storage:<br>Raw bucket]
raw_gcs[Raw data]
raw_outcomes_gcs[Scrape outcomes file]
end
subgraph second_gcs[ ]
gcs_label2[Google Cloud Storage:<br>Parsed bucket]
parse_gcs[Parsed data]
parse_outcomes_gcs[Parse outcomes file]
end
subgraph validated_gcs[ ]
gcs_label3[<i>GTFS ONLY</i> <br> Google Cloud Storage:<br>Validation bucket]
validation_gcs[Validations data]
validation_outcomes_gcs[Validation outcomes file]
end
raw_data -- read by--> airflow_scrape
airflow_scrape -- writes data to--> raw_gcs
airflow_scrape -- writes operation outcomes to--> raw_outcomes_gcs
raw_gcs -- read by--> airflow_parse
airflow_parse -- writes data to--> parse_gcs
airflow_parse -- writes operation outcomes to--> parse_outcomes_gcs
raw_gcs -- if GTFS then read by--> airflow_validate
airflow_validate -- writes data to--> validation_gcs
airflow_validate -- writes operation outcomes to--> validation_outcomes_gcs
parse_outcomes_gcs -- external tables defined by--> airflow_external_tables
parse_gcs -- external tables defined by--> airflow_external_tables
raw_outcomes_gcs -- external tables defined by--> airflow_external_tables
validation_outcomes_gcs -- external tables defined by--> airflow_external_tables
validation_gcs -- external tables defined by--> airflow_external_tables
airflow_external_tables -- defines--> ext_raw_outcomes
airflow_external_tables -- defines--> ext_parse_data
airflow_external_tables -- defines--> ext_parse_outcomes
airflow_external_tables -- defines--> ext_validation_outcomes
airflow_external_tables -- defines--> ext_validations
airflow_dbt -- runs--> dbt
ext_raw_outcomes -- read as source by-->dbt
ext_parse_data -- read as source by-->dbt
ext_parse_outcomes -- read as source by-->dbt
ext_validations -- read as source by-->dbt
ext_validation_outcomes -- read as source by-->dbt
dbt -- orchestrates transformations to create-->second_bq
classDef default fill:white, color:black, stroke:black, stroke-width:1px
classDef gcs_group_boxstyle fill:lightblue, color:black, stroke:black, stroke-width:1px
classDef bq_group_boxstyle fill:lightgreen, color:black, stroke:black, stroke-width:1px
classDef raw_datastyle fill:yellow, color:black, stroke:black, stroke-width:1px
classDef dbtstyle fill:darkgreen, color:white, stroke:black, stroke-width:1px
classDef group_labelstyle fill:lightgray, color:black, stroke-width:0px
class raw_data raw_datastyle
class dbt dbtstyle
class first_gcs,second_gcs,validated_gcs gcs_group_boxstyle
class first_bq,second_bq bq_group_boxstyle
class gcs_label1,gcs_label2,gcs_label3,bq_label1,bq_label2 group_labelstyle
```
Loading

0 comments on commit a4c6ec4

Please sign in to comment.