From 7a17a5ee38890701c2b4d692c24e8dceb844bf84 Mon Sep 17 00:00:00 2001 From: Soren Spicknall Date: Thu, 24 Aug 2023 10:42:28 -0500 Subject: [PATCH] mdformat with --number flag --- .github/ISSUE_TEMPLATE/bug_report.md | 1 - .github/ISSUE_TEMPLATE/new-team-member.md | 8 +- .github/ISSUE_TEMPLATE/user-story.md | 8 +- .github/README.md | 21 +- .github/pull_request_template.md | 9 +- .holo/README.md | 4 +- CONTRIBUTING.md | 24 +- README.md | 31 +- airflow/README.md | 4 +- airflow/dags/transform_warehouse/README.md | 9 +- apps/maps/README.md | 1 + ci/README.md | 1 + docs/airflow/dags-maintenance.md | 39 +- .../01-data-analysis-intro.md | 669 +++++++++--------- .../02-data-analysis-intermediate.md | 389 +++++----- .../03-data-management.md | 40 +- docs/analytics_new_analysts/04-notebooks.md | 54 +- .../05-spatial-analysis-basics.md | 219 +++--- .../06-spatial-analysis-intro.md | 472 ++++++------ .../07-spatial-analysis-intermediate.md | 425 ++++++----- .../08-spatial-analysis-advanced.md | 117 +-- docs/analytics_new_analysts/overview.md | 45 +- docs/analytics_onboarding/overview.md | 47 +- docs/analytics_tools/bi_dashboards.md | 50 +- docs/analytics_tools/data_catalogs.md | 26 +- docs/analytics_tools/github_setup.md | 43 +- docs/analytics_tools/jupyterhub.md | 45 +- docs/analytics_tools/knowledge_sharing.md | 138 ++-- .../local_oracle_db_connections.md | 17 +- docs/analytics_tools/overview.md | 21 +- docs/analytics_tools/python_libraries.md | 39 +- docs/analytics_tools/rt_analysis.md | 198 +++--- docs/analytics_tools/saving_code.md | 142 ++-- docs/analytics_tools/scripts.md | 86 ++- docs/analytics_tools/storing_data.md | 32 +- docs/analytics_tools/tools_quick_links.md | 25 +- docs/analytics_welcome/how_we_work.md | 39 +- docs/analytics_welcome/overview.md | 19 +- docs/analytics_welcome/what_is_calitp.md | 5 + docs/architecture/architecture_overview.md | 35 +- docs/architecture/data.md | 8 +- docs/architecture/services.md | 15 +- docs/contribute/content_types.md | 56 +- docs/contribute/contribute-best-practices.md | 31 +- docs/contribute/overview.md | 7 +- docs/contribute/submitting_changes.md | 113 +-- docs/intro.md | 6 +- docs/kubernetes/JupyterHub.md | 6 +- docs/kubernetes/README.md | 22 +- docs/publishing/overview.md | 20 +- .../sections/1_publishing_principles.md | 4 + docs/publishing/sections/2_static_files.md | 2 + docs/publishing/sections/3_github_pages.md | 19 +- .../sections/4_analytics_portfolio_site.md | 224 +++--- .../sections/5_notebooks_styling.md | 168 +++-- docs/publishing/sections/6_metabase.md | 1 + docs/publishing/sections/7_gcs.md | 1 + docs/publishing/sections/8_ckan.md | 4 +- docs/publishing/sections/9_geoportal.md | 12 +- docs/transit_database/transitdatabase.md | 33 +- docs/warehouse/adding_oneoff_data.md | 19 +- docs/warehouse/developing_dbt_models.md | 206 +++--- docs/warehouse/navigating_dbt_docs.md | 33 +- docs/warehouse/overview.md | 11 +- docs/warehouse/warehouse_starter_kit.md | 91 ++- docs/warehouse/what_is_agency.md | 11 +- docs/warehouse/what_is_gtfs.md | 13 +- images/dask/README.md | 1 + images/jupyter-singleuser/README.md | 1 + jobs/gtfs-rt-parser-v2/README.md | 2 + packages/calitp-data-infra/README.md | 1 + runbooks/data/deprecation-stored-files.md | 17 +- runbooks/data/deprecation-warehouse-models.md | 14 +- runbooks/infrastructure/disk-space.md | 13 +- .../rotating-littlepay-aws-keys.md | 9 +- runbooks/pipeline/sentry-triage.md | 49 +- services/gtfs-rt-archiver-v3/README.md | 46 +- warehouse/README.md | 60 +- warehouse/scripts/templates/ci_report.md | 13 +- 79 files changed, 2655 insertions(+), 2304 deletions(-) diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md index 8a2722e6ba..26ae2cd85a 100644 --- a/.github/ISSUE_TEMPLATE/bug_report.md +++ b/.github/ISSUE_TEMPLATE/bug_report.md @@ -4,7 +4,6 @@ about: Create a report to help us improve title: 'Bug: ' labels: '' assignees: '' - --- **Describe the bug** diff --git a/.github/ISSUE_TEMPLATE/new-team-member.md b/.github/ISSUE_TEMPLATE/new-team-member.md index e6b5bbf2ff..9b446cdd97 100644 --- a/.github/ISSUE_TEMPLATE/new-team-member.md +++ b/.github/ISSUE_TEMPLATE/new-team-member.md @@ -2,9 +2,9 @@ name: New Team Member about: Kick off the onboarding process. title: New Team Member - [Name] -labels: 'new team member' - +labels: new team member --- + Name: Role: Reports to: @@ -14,9 +14,11 @@ GitHub Username: Slack Username: **Set-up:** + - [ ] Technical Onboarding call scheduled - [ ] Added to tools: + - [ ] Github - [ ] Organization: Cal-ITP - [ ] Team: warehouse-users and warehouse-contributors @@ -26,12 +28,14 @@ Slack Username: - [ ] Slack - [ ] Added to meetings: + - [ ] Analyst Round Tables (Tuesday & Thursday) - [ ] Lunch n' Learn - [ ] All-hands - [ ] Data & Digital Services email list - [ ] Added to Slack channels: + - [ ] #data-analyses - [ ] #data-office-hours - [ ] #data diff --git a/.github/ISSUE_TEMPLATE/user-story.md b/.github/ISSUE_TEMPLATE/user-story.md index c1ea2add0e..0c18f13727 100644 --- a/.github/ISSUE_TEMPLATE/user-story.md +++ b/.github/ISSUE_TEMPLATE/user-story.md @@ -4,20 +4,16 @@ about: Submit a user story or feature request title: '' labels: '' assignees: '' - --- ## User story / feature request -_Please describe your need, outlining the key users, the feature being requested, and the goal that that the feature will facilitate. For example: **As a [user or stakeholder type], I want [software feature] so that [some business value]**_ - - +_Please describe your need, outlining the key users, the feature being requested, and the goal that that the feature will facilitate. For example: **As a \[user or stakeholder type\], I want \[software feature\] so that \[some business value\]**_ ### Acceptance Criteria _Please enter something that can be verified to show that this user story is satisfied. For example: **I can join table X with table Y.** or **Column A appears in table Z in Metabase.**_ - - ### Notes + _Please enter any additional information that will facilitate the completion of this ticket. For example: Are there any constraints not mentioned above? Are there any alternatives you have considered?_ diff --git a/.github/README.md b/.github/README.md index ba831f628b..02a55c4f77 100644 --- a/.github/README.md +++ b/.github/README.md @@ -13,32 +13,33 @@ While we're using GCP Composer, "deployment" of Airflow consists of two parts: This workflow builds a static website from the Svelte app and deploys it to Netlify. -## build-*.yml workflows +## build-\*.yml workflows Workflows prefixed with `build-` generally lint, test, and (usually) publish either a Python package or a Docker image. -## service-*.yml workflows +## service-\*.yml workflows Workflows prefixed with `service-` deal with Kubernetes deployments. -* `service-release-candidate.yml` creates candidate branches, using [hologit](https://github.com/JarvusInnovations/hologit) to bring in external Helm charts and remove irrelevant (i.e. non-infra) code -* `service-release-diff.yml` renders kubectl diffs on PRs targeting release branches -* `service-release-channel.yml` deploys to a given channel (i.e. environment) on updates to a release branch +- `service-release-candidate.yml` creates candidate branches, using [hologit](https://github.com/JarvusInnovations/hologit) to bring in external Helm charts and remove irrelevant (i.e. non-infra) code +- `service-release-diff.yml` renders kubectl diffs on PRs targeting release branches +- `service-release-channel.yml` deploys to a given channel (i.e. environment) on updates to a release branch Some of these workflows use hologit or invoke. See the READMEs in [.holo](../.holo) and [ci](../ci) for documentation regarding hologit and invoke, respectively. ## GitOps + The workflows described above also define their triggers. In general, developer workflows should follow these steps. 1. Check out a feature branch 2. Put up a PR for that feature branch, targeting `main` - * `service-release-candidate` will run and create a remote branch named `candidate/<+ if you want to run children>"}` using [dbt selection syntax](https://docs.getdbt.com/reference/node-selection/syntax#specifying-resources)) to re-run a specific individual model's lineage. +- Because the tasks in this DAG involve running a large volume of SQL transformations, they risk triggering data quotas if the DAG is run multiple times in a single day. + +- This task can be run with a `dbt_select` statement provided (use the `Trigger DAG w/ config` button (option under the "play" icon in the upper right corner when looking at an individual DAG) in the Airflow UI and provide a JSON configuration like `{"dbt_select": "<+ if you want to run parents><+ if you want to run children>"}` using [dbt selection syntax](https://docs.getdbt.com/reference/node-selection/syntax#specifying-resources)) to re-run a specific individual model's lineage. diff --git a/apps/maps/README.md b/apps/maps/README.md index e4e28776e3..700719e9bb 100644 --- a/apps/maps/README.md +++ b/apps/maps/README.md @@ -38,6 +38,7 @@ Netlify sites deployed via `netlify deploy ...` with `--alias=some-alias` and/or The site is deployed to production on merges to main, as defined in [../../.github/workflows/deploy-apps-maps.yml](../../.github/workflows/deploy-apps-maps.yml). You may also deploy manually with the following: + ```bash (from the apps/maps folder) npm run build diff --git a/ci/README.md b/ci/README.md index cf56ef1d00..d44744e544 100644 --- a/ci/README.md +++ b/ci/README.md @@ -5,6 +5,7 @@ a deployment named `archiver` is configured in [the prod channel](./channels/pro by `invoke` (see below) calling `kubectl` commands. ## invoke (aka pyinvoke) + [invoke](https://docs.pyinvoke.org/en/stable/) is a Python framework for executing subprocesses and building a CLI application. The tasks are defined in `tasks.py` and configuration in `invoke.yaml`; config values under the top-level `calitp` are specific to our defined tasks. diff --git a/docs/airflow/dags-maintenance.md b/docs/airflow/dags-maintenance.md index b2a4549ef6..fefd87cc2c 100644 --- a/docs/airflow/dags-maintenance.md +++ b/docs/airflow/dags-maintenance.md @@ -1,4 +1,5 @@ (dags-maintenance)= + # Airflow Operational Considerations We use [Airflow](https://airflow.apache.org/) to orchestrate our data ingest processes. This page describes how to handle cases where an Airflow [DAG task](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/tasks.html) fails. For general information about Airflow development, see the [Airflow README in the data-infra GitHub repo](https://github.com/cal-itp/data-infra/blob/main/airflow/README.md). @@ -16,25 +17,26 @@ Below are considerations to take into account when re-running or clearing DAGs t #### Now vs. data interval processing There are roughly two types of Airflow DAGs in our system: -* "Now" DAGs - mostly for executing code on a schedule (often scraping current data, or a fancy cron job), NOT orchestrating distributed processing of existing data - * **When these DAGs fail, and you'd like to re-run them, you should execute a new manual run rather than clearing a historical run.** - * Only the actual execution time matters if relevant (usually for timestamping data or artifacts) - * Generally safe but not useful to execute multiple times simultaneously - * There is no concept of backfilling via these DAGs -* "Data interval processing" DAGs - these DAGs orchestrate processing of previously-captured data, or data than can be retrieved in a timestamped manner - * **When these DAGs fail, you should clear the historical task instances that failed.** (Generally, these DAGs are expected to be 100% successful.) - * **Failures in these jobs may cause data to be missing from the data warehouse in unexpected ways:** if a parse job fails, then the data that should have been processed will not be available in the warehouse. Sometimes this is resolved easily by clearing the failed parse job so that the data will be picked up in the next warehouse run (orchestrated by [the `transform_warehouse` DAG](https://github.com/cal-itp/data-infra/blob/main/airflow/dags/transform_warehouse/)). However, because the data warehouse uses [incremental models](https://docs.getdbt.com/docs/build/incremental-models), it's possible that if the failed job is not cleared quickly enough the missing data will not be picked up because the incremental lookback period will have passed. - * Relies heavily on the [`execution_date` or `data_interval_start/end`](https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html) concepts - * May not be entirely idempotent though we try; for example, validating RT data depends on Schedule data which may be late-arriving - * Backfilling can generally be performed by clearing past task instances and letting them re-run - * We try to avoid `depends_on_past` DAGs, so parallelization is possible during backfills + +- "Now" DAGs - mostly for executing code on a schedule (often scraping current data, or a fancy cron job), NOT orchestrating distributed processing of existing data + - **When these DAGs fail, and you'd like to re-run them, you should execute a new manual run rather than clearing a historical run.** + - Only the actual execution time matters if relevant (usually for timestamping data or artifacts) + - Generally safe but not useful to execute multiple times simultaneously + - There is no concept of backfilling via these DAGs +- "Data interval processing" DAGs - these DAGs orchestrate processing of previously-captured data, or data than can be retrieved in a timestamped manner + - **When these DAGs fail, you should clear the historical task instances that failed.** (Generally, these DAGs are expected to be 100% successful.) + - **Failures in these jobs may cause data to be missing from the data warehouse in unexpected ways:** if a parse job fails, then the data that should have been processed will not be available in the warehouse. Sometimes this is resolved easily by clearing the failed parse job so that the data will be picked up in the next warehouse run (orchestrated by [the `transform_warehouse` DAG](https://github.com/cal-itp/data-infra/blob/main/airflow/dags/transform_warehouse/)). However, because the data warehouse uses [incremental models](https://docs.getdbt.com/docs/build/incremental-models), it's possible that if the failed job is not cleared quickly enough the missing data will not be picked up because the incremental lookback period will have passed. + - Relies heavily on the [`execution_date` or `data_interval_start/end`](https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html) concepts + - May not be entirely idempotent though we try; for example, validating RT data depends on Schedule data which may be late-arriving + - Backfilling can generally be performed by clearing past task instances and letting them re-run + - We try to avoid `depends_on_past` DAGs, so parallelization is possible during backfills #### Scheduled vs. ad-hoc Additionally, DAGs can either be scheduled or ad-hoc: -* **Scheduled** DAGs are designed to be run regularly, based on the [cron schedule](https://airflow.apache.org/docs/apache-airflow/1.10.1/scheduler.html) set in the DAG's `METADATA.yml` file. All "data interval processing" DAGs will be scheduled. -* **Ad-hoc** DAGs are designed to be run as one-offs, to automate a workflow that is risky or difficult for an individual user to run locally. These will have `schedule_interval: None` in their `METADATA.yml` files. Only "now" DAGs can be ad-hoc. +- **Scheduled** DAGs are designed to be run regularly, based on the [cron schedule](https://airflow.apache.org/docs/apache-airflow/1.10.1/scheduler.html) set in the DAG's `METADATA.yml` file. All "data interval processing" DAGs will be scheduled. +- **Ad-hoc** DAGs are designed to be run as one-offs, to automate a workflow that is risky or difficult for an individual user to run locally. These will have `schedule_interval: None` in their `METADATA.yml` files. Only "now" DAGs can be ad-hoc. ### How to clear a DAG or DAG task @@ -46,10 +48,10 @@ Failures can be cleared (re-run) via the Airflow user interface ([accessible via The following DAGs may still be listed in the Airflow UI even though they are **deprecated or indefinitely paused**. They never need to be re-run. (They show up in the UI because the Airflow database has historical DAG/task entries even though the code has been deleted.) -* `amplitude_benefits` -* `check_data_freshness` -* `load-sentry-rtfetchexception-events` -* `unzip_and_validate_gtfs_schedule` +- `amplitude_benefits` +- `check_data_freshness` +- `load-sentry-rtfetchexception-events` +- `unzip_and_validate_gtfs_schedule` ## `PodOperators` @@ -60,6 +62,7 @@ When restarting a failed `PodOperator` run, check the logs before restarting. If From time-to-time some DAGs may need to be re-ran in order to populate new data. Subject to the considerations outlined above, backfilling can be performed by clearing historical runs in the web interface, or via the CLI: + ```shell gcloud composer environments run calitp-airflow-prod --location=us-west2 backfill -- --start_date 2021-04-18 --end_date 2021-11-03 -x --reset_dagruns -y -t "gtfs_schedule_history_load" -i gtfs_loader ``` diff --git a/docs/analytics_new_analysts/01-data-analysis-intro.md b/docs/analytics_new_analysts/01-data-analysis-intro.md index b3180f52ad..301d1c97c9 100644 --- a/docs/analytics_new_analysts/01-data-analysis-intro.md +++ b/docs/analytics_new_analysts/01-data-analysis-intro.md @@ -1,335 +1,334 @@ -(pandas-intro)= -# Data Analysis: Intro - -Below are Python tutorials covering the basics of data cleaning and wrangling. [Chris Albon's guide](https://chrisalbon.com/#python) is particularly helpful. Rather than reinventing the wheel, this tutorial instead highlights specific methods and operations that might make your life easier as a data analyst. - -* [Import and export data in Python](#import-and-export-data-in-python) -* [Merge tabular and geospatial data](#merge-tabular-and-geospatial-data) -* [Functions](#functions) -* [Grouping](#grouping) -* [Aggregating](#aggregating) -* [Export aggregated output](#export-aggregated-output) - -## Getting Started - -``` -import numpy as np -import pandas as pd -import geopandas as gpd -``` - -## Import and Export Data in Python -### **Local files** -We import a tabular dataframe `my_csv.csv` and an Excel spreadsheet `my_excel.xlsx`. -``` -df = pd.read_csv('./folder/my_csv.csv') - -df = pd.read_excel('./folder/my_excel.xlsx', sheet_name = 'Sheet1') -``` - -### **GCS** -The data we use outside of the warehouse can be stored in GCS buckets. - -``` -# Read from GCS -df = pd.read_csv('gs://calitp-analytics-data/data-analyses/bucket-name/df_csv.csv') - -#Write to GCS -df.to_csv('gs://calitp-analytics-data/data-analyses/bucket-name/df_csv.csv') -``` - -Refer to the [Data Management best practices](data-management-page) and [Basics of Working with Geospatial Data](geo-intro) for additional information on importing various file types. - - -## Merge Tabular and Geospatial Data -Merging data from multiple sources creates one large dataframe (df) to perform data analysis. Let's say there are 3 sources of data that need to be merged: - -Dataframe #1: `council_population` (tabular) - -| CD | Council_Member | Population | -| ---| ---- | --- | -| 1 | Leslie Knope | 1,500 | -| 2 | Jeremy Jamm | 2,000 -| 3 | Douglass Howser | 2,250 - - -Dataframe #2: `paunch_locations` (geospatial) - -| Store | City | Sales_millions | CD | Geometry | -| ---| ---- | --- | --- | --- | -| 1 | Pawnee | $5 | 1| (x1,y1) -| 2 | Pawnee | $2.5 | 2 | (x2, y2) -| 3 | Pawnee | $2.5 | 3 | (x3, y3) -| 4 | Eagleton | $2 | | (x4, y4) -| 5 | Pawnee | $4 | 1 | (x5, y5) -| 6 | Pawnee | $6 | 2 | (x6, y6) -| 7 | Indianapolis | $7 | | (x7, y7) - - -If `paunch_locations` did not come with the council district information, use a spatial join to attach the council district within which the store falls. More on spatial joins [here](geo-intro). - - -Dataframe #3: `council_boundaries` (geospatial) - -| District | Geometry -| ---| ---- | -| 1 | polygon -| 2 | polygon -| 3 | polygon - - -First, merge `paunch_locations` with `council_population` using the `CD` column, which they have in common. - -``` -merge1 = pd.merge(paunch_locations, council_population, on = 'CD', - how = 'inner', validate = 'm:1') - -# m:1 many-to-1 merge means that CD appears multiple times in -# paunch_locations, but only once in council_population. -``` - -Next, merge `merge1` and `council_boundaries`. Columns don't have to have the same names to be matched on, as long as they hold the same values. - -``` -merge2 = pd.merge(merge1, council_boundaries, left_on = 'CD', - right_on = 'District', how = 'left', validate = 'm:1') -``` - -Here are some things to know about `merge2`: -* `merge2` is a geodataframe (gdf) because the ***base,*** `paunch_locations`, is a gdf. -* Pandas allows the merge to take place even if the `Geometry` column appears in both dfs. The resulting df contains 2 renamed `Geometry` columns; `Geometry_x` corresponds to the left df `Geometry` and `Geometry_y` for the right df. -* Geopandas still designates a geometry to use. To see what which geometry column is set, type `merge2.geometry.name`. To change the geometry to a different column, type `merge2 = merge2.set_geometry('new_column')`. - - -`merge2` looks like this: - -| Store | City | Sales_millions | CD | Geometry_x | Council_Member | Population | Geometry_y -| ---| ---- | --- | --- | --- | ---| ---| ---| -| 1 | Pawnee | $5 | 1| (x1,y1) | Leslie Knope | 1,500 | polygon -| 2 | Pawnee | $2.5 | 2 | (x2, y2) | Jeremy Jamm | 2,000 | polygon -| 3 | Pawnee | $2.5 | 3 | (x3, y3) | Douglass Howser | 2,250 | polygon -| 5 | Pawnee | $4 | 1 | (x5, y5) | Leslie Knope | 1,500 | polygon -| 6 | Pawnee | $6 | 2 | (x6, y6) | Jeremy Jamm | 2,000 | polygon - - -## Functions -A function is a set of instructions to *do something*. It can be as simple as changing values in a column or as complicated as a series of steps to clean, group, aggregate, and plot the data. - -### **Lambda Functions** -Lambda functions are quick and dirty. You don't even have to name the function! These are used for one-off functions that you don't need to save for repeated use within the script or notebook. You can use it for any simple function (e.g., if-else statements, etc) you want to apply to all rows of the df. - - -`df`: Andy Dwyer's band names and number of songs played under that name - -| Band | Songs -| ---| ---- | -| Mouse Rat | 30 -| Scarecrow Boat | 15 -| Jet Black Pope | 4 -| Nothing Rhymes with Orange | 6 - -### **If-Else Statements** - -``` -# Create column called duration. If Songs > 10, duration is 'long'. -# Otherwise, duration is 'short'. -df['duration'] = df.apply(lambda row: 'long' if row.Songs > 10 - else 'short', axis = 1) - -# Create column called famous. If Band is 'Mouse Rat', famous is 1, -# otherwise 0. -df['famous'] = df.apply(lambda row: 1 if row.Band == 'Mouse Rat' - else 0, axis = 1) - -# An equivalent full function would be: -def tag_famous(row): - if row.Band == 'Mouse Rat': - return 1 - else: - return 0 - -df['famous'] = df.apply(tag_famous, axis = 1) - -df -``` - -| Band | Songs | duration | famous | -| ---| ---- | --- | --- | -| Mouse Rat | 30 | long | 1 | -| Scarecrow Boat | 15 | long | 0 -| Jet Black Pope | 4 | short | 0 -| Nothing Rhymes with Orange | 6 | short | 0 - - -### **Other Lambda Functions** - -``` -# Split the band name at the spaces -# [1] means we want to extract the second word -# [0:2] means we want to start at the first character -# and stop at (but not include) the 3rd character -df['word2_start'] = df.apply(lambda x: - x.Band.split(" ")[1][0:2], axis = 1) -df -``` - -| Band | Songs | word2_start | -| ---| ---- | --- | -| Mouse Rat | 30 | Ra | -| Scarecrow Boat | 15 | Bo -| Jet Black Pope | 4 | Po -| Nothing Rhymes with Orange | 6 | Or - - -### **Apply over Dataframe** -You should use a full function when a function is too complicated to be a lambda function. These functions are defined by a name and are called upon to operate on the rows of a dataframe. You can also write more complex functions that bundle together all the steps (including nesting more functions) you want to execute over the dataframe. - -`df.apply` is one common usage of a function. - -``` -def years_active(row): - if row.Band == 'Mouse Rat': - return '2009-2014' - elif row.Band == 'Scarecrow Boat': - return '2009' - elif (row.Band == 'Jet Black Pope') or (row.Band == - 'Nothing Rhymes with Orange'): - return '2008' - -df['Active'] = df.apply(years_active, axis = 1) -df -``` - -| Band | Songs | Active | -| ---| ---- | --- | -| Mouse Rat | 30 | 2009-2014 | -| Scarecrow Boat | 15 | 2009 -| Jet Black Pope | 4 | 2008 -| Nothing Rhymes with Orange | 6 | 2008 - - - -## Grouping -Sometimes it's necessary to create a new column to group together certain values of a column. Here are two ways to accomplish this: - -Method #1: Write a function using if-else statement and apply it using a lambda function. - -``` -# The function is called elected_year, and it operates on every row. -def elected_year(row): - # For each row, if Council_Member says 'Leslie Knope', then return 2012 - # as the value. - if row.Council_Member == 'Leslie Knope': - return 2012 - elif row.Council_Member == 'Jeremy Jamm': - return 2008 - elif row.Council_Member == 'Douglass Howser': - return 2006 - -# Use a lambda function to apply the elected_year function to all rows in the df. -# Don't forget axis = 1 (apply function to all rows)! -council_population['Elected'] = council_population.apply(lambda row: - elected_year(row), axis = 1) - -council_population -``` - -| CD | Council_Member | Population | Elected -| ---| ---- | --- | --- | -| 1 | Leslie Knope | 1,500 | 2012 -| 2 | Jeremy Jamm | 2,000 | 2008 -| 3 | Douglass Howser | 2,250 | 2006 - - -Method #2: Loop over every value, fill in the new column value, then attach that new column. - -``` -# Create a list to store the new column -sales_group = [] - -for row in paunch_locations['Sales_millions']: - # If sales are more than $3M, but less than $5M, tag as moderate. - if (row >= 3) & (row <= 5) : - sales_group.append('moderate') - # If sales are more than $5M, tag as high. - elif row >=5: - sales_group.append('high') - # Anything else, aka, if sales are less than $3M, tag as low. - else: - sales_group.append('low') - -paunch_locations['sales_group'] = sales_group - -paunch_locations -``` - -| Store | City | Sales_millions | CD | Geometry | sales_group -| ---| ---- | --- | --- | --- | --- | -| 1 | Pawnee | $5 | 1| (x1,y1) | moderate -| 2 | Pawnee | $2.5 | 2 | (x2, y2) | low -| 3 | Pawnee | $2.5 | 3 | (x3, y3) | low -| 4 | Eagleton | $2 | | (x4, y4) | low -| 5 | Pawnee | $4 | 1 | (x5, y5) | moderate -| 6 | Pawnee | $6 | 2 | (x6, y6) | high -| 7 | Indianapolis | $7 | | (x7, y7) | high - - -## Aggregating -One of the most common form of summary statistics is aggregating by groups. In Excel, it's called a pivot table. In ArcGIS, it's doing a dissolve and calculating summary statistics. There are two ways to do it in Python: `groupby` and `agg` or `pivot_table`. - -To answer the question of how many Paunch Burger locations there are per Council District and the sales generated per resident, - -``` -# Method #1: groupby and agg -pivot = merge2.groupby(['CD', 'Geometry_y']).agg({'Sales_millions': 'sum', - 'Store': 'count', 'Population': 'mean'}).reset_index() - -# Method #2: pivot table -pivot = merge2.pivot_table(index= ['CD', 'Geometry_y'], - values = ['Sales_millions', 'Store', 'Population'], - aggfunc= {'Sales_millions': 'sum', 'Store': 'count', - 'Population': 'mean'}).reset_index() - - # to only find one type of summary statistic, use aggfunc = 'sum' - -# reset_index() will compress the headers of the table, forcing them to appear -# in 1 row rather than 2 separate rows -``` - -`pivot` looks like this: - -| CD | Geometry_y | Sales_millions | Store | Council_Member | Population -| ---| ---- | --- | --- | --- | ---| -| 1 | polygon | $9 | 2 | Leslie Knope | 1,500 -| 2 | polygon | $8.5 | 2 | Jeremy Jamm | 2,000 -| 3 | polygon | $2.5 | 1 | Douglass Howser | 2,250 - - -## Export Aggregated Output -Python can do most of the heavy lifting for data cleaning, transformations, and general wrangling. But, for charts or tables, it might be preferable to finish in Excel so that visualizations conform to the corporate style guide. - -Dataframes can be exported into Excel and written into multiple sheets. - -``` -import xlsxwriter - -# initiate a writer -writer = pd.ExcelWriter('../outputs/filename.xlsx', engine='xlsxwriter') - -council_population.to_excel(writer, sheet_name = 'council_pop') -paunch_locations.to_excel(writer, sheet_name = 'paunch_locations') -merge2.to_excel(writer, sheet_name = 'merged_data') -pivot.to_excel(writer, sheet_name = 'pivot') - -# Close the Pandas Excel writer and output the Excel file. -writer.save() -``` - -Geodataframes can be exported as a shapefile or GeoJSON to visualize in ArcGIS/QGIS. -``` -gdf.to_file(driver = 'ESRI Shapefile', filename = '../folder/my_shapefile.shp' ) - -gdf.to_file(driver = 'GeoJSON', filename = '../folder/my_geojson.geojson') -``` - -
+(pandas-intro)= + +# Data Analysis: Intro + +Below are Python tutorials covering the basics of data cleaning and wrangling. [Chris Albon's guide](https://chrisalbon.com/#python) is particularly helpful. Rather than reinventing the wheel, this tutorial instead highlights specific methods and operations that might make your life easier as a data analyst. + +- [Import and export data in Python](#import-and-export-data-in-python) +- [Merge tabular and geospatial data](#merge-tabular-and-geospatial-data) +- [Functions](#functions) +- [Grouping](#grouping) +- [Aggregating](#aggregating) +- [Export aggregated output](#export-aggregated-output) + +## Getting Started + +``` +import numpy as np +import pandas as pd +import geopandas as gpd +``` + +## Import and Export Data in Python + +### **Local files** + +We import a tabular dataframe `my_csv.csv` and an Excel spreadsheet `my_excel.xlsx`. + +``` +df = pd.read_csv('./folder/my_csv.csv') + +df = pd.read_excel('./folder/my_excel.xlsx', sheet_name = 'Sheet1') +``` + +### **GCS** + +The data we use outside of the warehouse can be stored in GCS buckets. + +``` +# Read from GCS +df = pd.read_csv('gs://calitp-analytics-data/data-analyses/bucket-name/df_csv.csv') + +#Write to GCS +df.to_csv('gs://calitp-analytics-data/data-analyses/bucket-name/df_csv.csv') +``` + +Refer to the [Data Management best practices](data-management-page) and [Basics of Working with Geospatial Data](geo-intro) for additional information on importing various file types. + +## Merge Tabular and Geospatial Data + +Merging data from multiple sources creates one large dataframe (df) to perform data analysis. Let's say there are 3 sources of data that need to be merged: + +Dataframe #1: `council_population` (tabular) + +| CD | Council_Member | Population | +| --- | --------------- | ---------- | +| 1 | Leslie Knope | 1,500 | +| 2 | Jeremy Jamm | 2,000 | +| 3 | Douglass Howser | 2,250 | + +Dataframe #2: `paunch_locations` (geospatial) + +| Store | City | Sales_millions | CD | Geometry | +| ----- | ------------ | -------------- | --- | -------- | +| 1 | Pawnee | $5 | 1 | (x1,y1) | +| 2 | Pawnee | $2.5 | 2 | (x2, y2) | +| 3 | Pawnee | $2.5 | 3 | (x3, y3) | +| 4 | Eagleton | $2 | | (x4, y4) | +| 5 | Pawnee | $4 | 1 | (x5, y5) | +| 6 | Pawnee | $6 | 2 | (x6, y6) | +| 7 | Indianapolis | $7 | | (x7, y7) | + +If `paunch_locations` did not come with the council district information, use a spatial join to attach the council district within which the store falls. More on spatial joins [here](geo-intro). + +Dataframe #3: `council_boundaries` (geospatial) + +| District | Geometry | +| -------- | -------- | +| 1 | polygon | +| 2 | polygon | +| 3 | polygon | + +First, merge `paunch_locations` with `council_population` using the `CD` column, which they have in common. + +``` +merge1 = pd.merge(paunch_locations, council_population, on = 'CD', + how = 'inner', validate = 'm:1') + +# m:1 many-to-1 merge means that CD appears multiple times in +# paunch_locations, but only once in council_population. +``` + +Next, merge `merge1` and `council_boundaries`. Columns don't have to have the same names to be matched on, as long as they hold the same values. + +``` +merge2 = pd.merge(merge1, council_boundaries, left_on = 'CD', + right_on = 'District', how = 'left', validate = 'm:1') +``` + +Here are some things to know about `merge2`: + +- `merge2` is a geodataframe (gdf) because the ***base,*** `paunch_locations`, is a gdf. +- Pandas allows the merge to take place even if the `Geometry` column appears in both dfs. The resulting df contains 2 renamed `Geometry` columns; `Geometry_x` corresponds to the left df `Geometry` and `Geometry_y` for the right df. +- Geopandas still designates a geometry to use. To see what which geometry column is set, type `merge2.geometry.name`. To change the geometry to a different column, type `merge2 = merge2.set_geometry('new_column')`. + +`merge2` looks like this: + +| Store | City | Sales_millions | CD | Geometry_x | Council_Member | Population | Geometry_y | +| ----- | ------ | -------------- | --- | ---------- | --------------- | ---------- | ---------- | +| 1 | Pawnee | $5 | 1 | (x1,y1) | Leslie Knope | 1,500 | polygon | +| 2 | Pawnee | $2.5 | 2 | (x2, y2) | Jeremy Jamm | 2,000 | polygon | +| 3 | Pawnee | $2.5 | 3 | (x3, y3) | Douglass Howser | 2,250 | polygon | +| 5 | Pawnee | $4 | 1 | (x5, y5) | Leslie Knope | 1,500 | polygon | +| 6 | Pawnee | $6 | 2 | (x6, y6) | Jeremy Jamm | 2,000 | polygon | + +## Functions + +A function is a set of instructions to *do something*. It can be as simple as changing values in a column or as complicated as a series of steps to clean, group, aggregate, and plot the data. + +### **Lambda Functions** + +Lambda functions are quick and dirty. You don't even have to name the function! These are used for one-off functions that you don't need to save for repeated use within the script or notebook. You can use it for any simple function (e.g., if-else statements, etc) you want to apply to all rows of the df. + +`df`: Andy Dwyer's band names and number of songs played under that name + +| Band | Songs | +| -------------------------- | ----- | +| Mouse Rat | 30 | +| Scarecrow Boat | 15 | +| Jet Black Pope | 4 | +| Nothing Rhymes with Orange | 6 | + +### **If-Else Statements** + +``` +# Create column called duration. If Songs > 10, duration is 'long'. +# Otherwise, duration is 'short'. +df['duration'] = df.apply(lambda row: 'long' if row.Songs > 10 + else 'short', axis = 1) + +# Create column called famous. If Band is 'Mouse Rat', famous is 1, +# otherwise 0. +df['famous'] = df.apply(lambda row: 1 if row.Band == 'Mouse Rat' + else 0, axis = 1) + +# An equivalent full function would be: +def tag_famous(row): + if row.Band == 'Mouse Rat': + return 1 + else: + return 0 + +df['famous'] = df.apply(tag_famous, axis = 1) + +df +``` + +| Band | Songs | duration | famous | +| -------------------------- | ----- | -------- | ------ | +| Mouse Rat | 30 | long | 1 | +| Scarecrow Boat | 15 | long | 0 | +| Jet Black Pope | 4 | short | 0 | +| Nothing Rhymes with Orange | 6 | short | 0 | + +### **Other Lambda Functions** + +``` +# Split the band name at the spaces +# [1] means we want to extract the second word +# [0:2] means we want to start at the first character +# and stop at (but not include) the 3rd character +df['word2_start'] = df.apply(lambda x: + x.Band.split(" ")[1][0:2], axis = 1) +df +``` + +| Band | Songs | word2_start | +| -------------------------- | ----- | ----------- | +| Mouse Rat | 30 | Ra | +| Scarecrow Boat | 15 | Bo | +| Jet Black Pope | 4 | Po | +| Nothing Rhymes with Orange | 6 | Or | + +### **Apply over Dataframe** + +You should use a full function when a function is too complicated to be a lambda function. These functions are defined by a name and are called upon to operate on the rows of a dataframe. You can also write more complex functions that bundle together all the steps (including nesting more functions) you want to execute over the dataframe. + +`df.apply` is one common usage of a function. + +``` +def years_active(row): + if row.Band == 'Mouse Rat': + return '2009-2014' + elif row.Band == 'Scarecrow Boat': + return '2009' + elif (row.Band == 'Jet Black Pope') or (row.Band == + 'Nothing Rhymes with Orange'): + return '2008' + +df['Active'] = df.apply(years_active, axis = 1) +df +``` + +| Band | Songs | Active | +| -------------------------- | ----- | --------- | +| Mouse Rat | 30 | 2009-2014 | +| Scarecrow Boat | 15 | 2009 | +| Jet Black Pope | 4 | 2008 | +| Nothing Rhymes with Orange | 6 | 2008 | + +## Grouping + +Sometimes it's necessary to create a new column to group together certain values of a column. Here are two ways to accomplish this: + +Method #1: Write a function using if-else statement and apply it using a lambda function. + +``` +# The function is called elected_year, and it operates on every row. +def elected_year(row): + # For each row, if Council_Member says 'Leslie Knope', then return 2012 + # as the value. + if row.Council_Member == 'Leslie Knope': + return 2012 + elif row.Council_Member == 'Jeremy Jamm': + return 2008 + elif row.Council_Member == 'Douglass Howser': + return 2006 + +# Use a lambda function to apply the elected_year function to all rows in the df. +# Don't forget axis = 1 (apply function to all rows)! +council_population['Elected'] = council_population.apply(lambda row: + elected_year(row), axis = 1) + +council_population +``` + +| CD | Council_Member | Population | Elected | +| --- | --------------- | ---------- | ------- | +| 1 | Leslie Knope | 1,500 | 2012 | +| 2 | Jeremy Jamm | 2,000 | 2008 | +| 3 | Douglass Howser | 2,250 | 2006 | + +Method #2: Loop over every value, fill in the new column value, then attach that new column. + +``` +# Create a list to store the new column +sales_group = [] + +for row in paunch_locations['Sales_millions']: + # If sales are more than $3M, but less than $5M, tag as moderate. + if (row >= 3) & (row <= 5) : + sales_group.append('moderate') + # If sales are more than $5M, tag as high. + elif row >=5: + sales_group.append('high') + # Anything else, aka, if sales are less than $3M, tag as low. + else: + sales_group.append('low') + +paunch_locations['sales_group'] = sales_group + +paunch_locations +``` + +| Store | City | Sales_millions | CD | Geometry | sales_group | +| ----- | ------------ | -------------- | --- | -------- | ----------- | +| 1 | Pawnee | $5 | 1 | (x1,y1) | moderate | +| 2 | Pawnee | $2.5 | 2 | (x2, y2) | low | +| 3 | Pawnee | $2.5 | 3 | (x3, y3) | low | +| 4 | Eagleton | $2 | | (x4, y4) | low | +| 5 | Pawnee | $4 | 1 | (x5, y5) | moderate | +| 6 | Pawnee | $6 | 2 | (x6, y6) | high | +| 7 | Indianapolis | $7 | | (x7, y7) | high | + +## Aggregating + +One of the most common form of summary statistics is aggregating by groups. In Excel, it's called a pivot table. In ArcGIS, it's doing a dissolve and calculating summary statistics. There are two ways to do it in Python: `groupby` and `agg` or `pivot_table`. + +To answer the question of how many Paunch Burger locations there are per Council District and the sales generated per resident, + +``` +# Method #1: groupby and agg +pivot = merge2.groupby(['CD', 'Geometry_y']).agg({'Sales_millions': 'sum', + 'Store': 'count', 'Population': 'mean'}).reset_index() + +# Method #2: pivot table +pivot = merge2.pivot_table(index= ['CD', 'Geometry_y'], + values = ['Sales_millions', 'Store', 'Population'], + aggfunc= {'Sales_millions': 'sum', 'Store': 'count', + 'Population': 'mean'}).reset_index() + + # to only find one type of summary statistic, use aggfunc = 'sum' + +# reset_index() will compress the headers of the table, forcing them to appear +# in 1 row rather than 2 separate rows +``` + +`pivot` looks like this: + +| CD | Geometry_y | Sales_millions | Store | Council_Member | Population | +| --- | ---------- | -------------- | ----- | --------------- | ---------- | +| 1 | polygon | $9 | 2 | Leslie Knope | 1,500 | +| 2 | polygon | $8.5 | 2 | Jeremy Jamm | 2,000 | +| 3 | polygon | $2.5 | 1 | Douglass Howser | 2,250 | + +## Export Aggregated Output + +Python can do most of the heavy lifting for data cleaning, transformations, and general wrangling. But, for charts or tables, it might be preferable to finish in Excel so that visualizations conform to the corporate style guide. + +Dataframes can be exported into Excel and written into multiple sheets. + +``` +import xlsxwriter + +# initiate a writer +writer = pd.ExcelWriter('../outputs/filename.xlsx', engine='xlsxwriter') + +council_population.to_excel(writer, sheet_name = 'council_pop') +paunch_locations.to_excel(writer, sheet_name = 'paunch_locations') +merge2.to_excel(writer, sheet_name = 'merged_data') +pivot.to_excel(writer, sheet_name = 'pivot') + +# Close the Pandas Excel writer and output the Excel file. +writer.save() +``` + +Geodataframes can be exported as a shapefile or GeoJSON to visualize in ArcGIS/QGIS. + +``` +gdf.to_file(driver = 'ESRI Shapefile', filename = '../folder/my_shapefile.shp' ) + +gdf.to_file(driver = 'GeoJSON', filename = '../folder/my_geojson.geojson') +``` + +
diff --git a/docs/analytics_new_analysts/02-data-analysis-intermediate.md b/docs/analytics_new_analysts/02-data-analysis-intermediate.md index ad5e23877e..736113f44c 100644 --- a/docs/analytics_new_analysts/02-data-analysis-intermediate.md +++ b/docs/analytics_new_analysts/02-data-analysis-intermediate.md @@ -1,196 +1,193 @@ -(pandas-intermediate)= -# Data Analysis: Intermediate - -After polishing off the [intro tutorial](pandas-intro), you're ready to devour some more techniques to simplify your life as a data analyst. - -* [Create a new column using a dictionary to map the values](#create-a-new-column-using-a-dictionary-to-map-the-values) -* [Loop over columns with a dictionary](#loop-over-columns-with-a-dictionary) -* [Loop over dataframes with a dictionary](#loop-over-dataframes-with-a-dictionary) - - -## Getting Started - -``` -import numpy as np -import pandas as pd -import geopandas as gpd -``` - -### Create a New Column Using a Dictionary to Map the Values -Sometimes, you want to create a new column by converting one set of values into a different set of values. We could write a function or we could use the map function to add a new column. For our `df`, we want a new column that shows the state. - -`df`: person and birthplace - -| Person | Birthplace | -| ---| ---- | -| Leslie Knope | Eagleton, Indiana -| Tom Haverford | South Carolina | -| Ann Perkins | Michigan | -| Ben Wyatt | Partridge, Minnesota | - - -### Write a Function -[Quick refresher on functions](pandas-intro) - -``` -# Create a function called state_abbrev. -def state_abbrev(row): - # The find function returns the index of where 'Indiana' is found in - #the column. If it cannot find it, it returns -1. - if row.Birthplace.find('Indiana') != -1: - return 'IN' - elif row.Birthplace.find('South Carolina') != -1: - return 'SC' - # For an exact match, we would write it this way. - elif row.Birthplace == 'Michigan': - return 'MI' - elif row.Birthplace.find('Minnesota') != -1: - return 'MI' - -# Apply this function and create the State column. -df['State'] = df.apply(state_abbrev, axis = 1) -``` - -### Use a Dictionary to Map the Values -But, writing a function could take up a lot of space, especially with all the if-elif-else statements. Alternatively, a dictionary would also work. We could use a dictionary and map the four different city-state values into the state abbreviation. - -``` -state_abbrev1 = {'Eagleton, Indiana': 'IN', 'South Carolina': 'SC', - 'Michigan': 'MI', 'Partridge, Minnesota': 'MN'} - -df['State'] = df.Birthplace.map(state_abbrev1) -``` - -But, if we wanted to avoid writing out all the possible combinations, we would first extract the *state* portion of the city-state text. Then we could map the state's full name with its abbreviation. - -``` -# The split function splits at the comma and expand the columns. -# Everything is stored in a new df called 'fullname'. -fullname = df['Birthplace'].str.split(",", expand = True) - -# Add the City column into our df by extracting the first column (0) from fullname. -df['City'] = fullname[0] - -# Add the State column by extracting the second column (1) from fullname. -df['State_full'] = fullname[1] - - -# Tom Haverford's birthplace is South Carolina. We don't have city information. -# So, the City column would be incorrectly filled in with South Carolina, and -# the State would say None. -# Fix these so the Nones actually display the state information correctly. - -df['State_full'] = df.apply(lambda row: row.City if row.State == None else - row.State_full, axis = 1) - -# Now, use a dictionary to map the values. -state_abbrev2 = {'Indiana': 'IN', 'South Carolina': 'SC', - 'Michigan': 'MI', 'Minnesota': 'MN'} - -df['State'] = df.Birthplace.map(state_abbrev2) -``` - -All 3 methods would give us this `df`: - -| Person | Birthplace | State | -| ---| ---- | --- | -| Leslie Knope | Eagleton, Indiana | IN | -| Tom Haverford | South Carolina | SC | -| Ann Perkins | Michigan | MI | -| Ben Wyatt | Partridge, Minnesota | MN | - - - -### Loop over Columns with a Dictionary -If there are operations or data transformations that need to be performed on multiple columns, the best way to do that is with a loop. - -``` -columns = ['colA', 'colB', 'colC'] - -for c in columns: - # Fill in missing values for all columns with zeros - df[c] = df[c].fillna(0) - # Multiply all columns by 0.5 - df[c] = df[c] * 0.5 -``` - -### Loop over Dataframes with a Dictionary -It's easier and more efficient to use a loop to do the same operations over the different dataframes (df). Here, we want to find the number of Pawnee businesses and Tom Haverford businesses located in each Council District. - -This type of question is perfect for a loop. Each df will be spatially joined to the geodataframe `council_district`, followed by some aggregation. - -`business`: list of Pawnee stores - -| Business | longitude | latitude | Sales_millions | geometry -| ---| ---- | --- | ---| ---| -| Paunch Burger | x1 | y1 | 5 | Point(x1, y1) -| Sweetums | x2 | y2 | 30 | Point(x2, y2) -| Jurassic Fork | x3 | y3 | 2 | Point(x3, y3) -| Gryzzl | x4 | y4 | 40 | Point(x4, y4) - - -`tom`: list of Tom Haverford businesses - -| Business | longitude | latitude | Sales_millions | geometry -| ---| ---- | --- | ---| ---| -| Tom's Bistro | x1 | y1 |30 | Point(x1, y1) -| Entertainment 720 | x2 | y2 | 1 | Point(x2, y2) -| Rent-A-Swag | x3 | y3 | 4 | Point(x3, y3) - - -``` -# Save our existing dfs into a dictionary. The business df is named -# 'pawnee"; the tom df is named 'tom'. -dfs = {'pawnee': business, 'tom': tom} - -# Create an empty dictionary called summary_dfs to hold the results -summary_dfs = {} - -# Loop over key-value pairs -## Keys: pawnee, tom (names given to dataframes) -## Values: business, tom (dataframes) - -for key, value in dfs.items(): - # Use f string to define a variable join_df (result of our spatial join) - ## join_{key} would be join_pawnee or join_tom in the loop - join_df = "join_{key}" - # Spatial join - join_df = gpd.sjoin(value, council_district, how = 'inner', op = 'intersects') - # Calculate summary stats with groupby, agg, then save it into summary_dfs, - # naming it 'pawnee' or 'tom'. - summary_dfs[key] = join.groupby('ID').agg( - {'Business': 'count', 'Sales_millions': 'sum'}) -``` - -Now, our `summary_dfs` dictionary contains 2 items, which are the 2 dataframes with everything aggregated. - -``` -# To view the contents of this dictionary -for key, value in summary_dfs.items(): - display(key) - display(value) - -# To access the df -summary_dfs["pawnee"] -summary_dfs["tom"] -``` - -`join_tom`: result of spatial join between tom and council_district - -| Business | longitude | latitude | Sales_millions | geometry | ID -| ---| ---- | --- | ---| ---| --- | -| Tom's Bistro | x1 | y1 | 30 | Point(x1, y1) | 1 -| Entertainment 720 | x2 | y2 | 1 | Point(x2, y2) | 3 -| Rent-A-Swag | x3 | y3 | 4 | Point(x3, y3) | 3 - - -`summary_dfs["tom"]`: result of the counting number of Tom's businesses by CD - -| ID | Business | Sales_millions -| ---| ---- | --- | -| 1 | 1 | 30 -| 3 | 2 | 5 - - - -
+(pandas-intermediate)= + +# Data Analysis: Intermediate + +After polishing off the [intro tutorial](pandas-intro), you're ready to devour some more techniques to simplify your life as a data analyst. + +- [Create a new column using a dictionary to map the values](#create-a-new-column-using-a-dictionary-to-map-the-values) +- [Loop over columns with a dictionary](#loop-over-columns-with-a-dictionary) +- [Loop over dataframes with a dictionary](#loop-over-dataframes-with-a-dictionary) + +## Getting Started + +``` +import numpy as np +import pandas as pd +import geopandas as gpd +``` + +### Create a New Column Using a Dictionary to Map the Values + +Sometimes, you want to create a new column by converting one set of values into a different set of values. We could write a function or we could use the map function to add a new column. For our `df`, we want a new column that shows the state. + +`df`: person and birthplace + +| Person | Birthplace | +| ------------- | -------------------- | +| Leslie Knope | Eagleton, Indiana | +| Tom Haverford | South Carolina | +| Ann Perkins | Michigan | +| Ben Wyatt | Partridge, Minnesota | + +### Write a Function + +[Quick refresher on functions](pandas-intro) + +``` +# Create a function called state_abbrev. +def state_abbrev(row): + # The find function returns the index of where 'Indiana' is found in + #the column. If it cannot find it, it returns -1. + if row.Birthplace.find('Indiana') != -1: + return 'IN' + elif row.Birthplace.find('South Carolina') != -1: + return 'SC' + # For an exact match, we would write it this way. + elif row.Birthplace == 'Michigan': + return 'MI' + elif row.Birthplace.find('Minnesota') != -1: + return 'MI' + +# Apply this function and create the State column. +df['State'] = df.apply(state_abbrev, axis = 1) +``` + +### Use a Dictionary to Map the Values + +But, writing a function could take up a lot of space, especially with all the if-elif-else statements. Alternatively, a dictionary would also work. We could use a dictionary and map the four different city-state values into the state abbreviation. + +``` +state_abbrev1 = {'Eagleton, Indiana': 'IN', 'South Carolina': 'SC', + 'Michigan': 'MI', 'Partridge, Minnesota': 'MN'} + +df['State'] = df.Birthplace.map(state_abbrev1) +``` + +But, if we wanted to avoid writing out all the possible combinations, we would first extract the *state* portion of the city-state text. Then we could map the state's full name with its abbreviation. + +``` +# The split function splits at the comma and expand the columns. +# Everything is stored in a new df called 'fullname'. +fullname = df['Birthplace'].str.split(",", expand = True) + +# Add the City column into our df by extracting the first column (0) from fullname. +df['City'] = fullname[0] + +# Add the State column by extracting the second column (1) from fullname. +df['State_full'] = fullname[1] + + +# Tom Haverford's birthplace is South Carolina. We don't have city information. +# So, the City column would be incorrectly filled in with South Carolina, and +# the State would say None. +# Fix these so the Nones actually display the state information correctly. + +df['State_full'] = df.apply(lambda row: row.City if row.State == None else + row.State_full, axis = 1) + +# Now, use a dictionary to map the values. +state_abbrev2 = {'Indiana': 'IN', 'South Carolina': 'SC', + 'Michigan': 'MI', 'Minnesota': 'MN'} + +df['State'] = df.Birthplace.map(state_abbrev2) +``` + +All 3 methods would give us this `df`: + +| Person | Birthplace | State | +| ------------- | -------------------- | ----- | +| Leslie Knope | Eagleton, Indiana | IN | +| Tom Haverford | South Carolina | SC | +| Ann Perkins | Michigan | MI | +| Ben Wyatt | Partridge, Minnesota | MN | + +### Loop over Columns with a Dictionary + +If there are operations or data transformations that need to be performed on multiple columns, the best way to do that is with a loop. + +``` +columns = ['colA', 'colB', 'colC'] + +for c in columns: + # Fill in missing values for all columns with zeros + df[c] = df[c].fillna(0) + # Multiply all columns by 0.5 + df[c] = df[c] * 0.5 +``` + +### Loop over Dataframes with a Dictionary + +It's easier and more efficient to use a loop to do the same operations over the different dataframes (df). Here, we want to find the number of Pawnee businesses and Tom Haverford businesses located in each Council District. + +This type of question is perfect for a loop. Each df will be spatially joined to the geodataframe `council_district`, followed by some aggregation. + +`business`: list of Pawnee stores + +| Business | longitude | latitude | Sales_millions | geometry | +| ------------- | --------- | -------- | -------------- | ------------- | +| Paunch Burger | x1 | y1 | 5 | Point(x1, y1) | +| Sweetums | x2 | y2 | 30 | Point(x2, y2) | +| Jurassic Fork | x3 | y3 | 2 | Point(x3, y3) | +| Gryzzl | x4 | y4 | 40 | Point(x4, y4) | + +`tom`: list of Tom Haverford businesses + +| Business | longitude | latitude | Sales_millions | geometry | +| ----------------- | --------- | -------- | -------------- | ------------- | +| Tom's Bistro | x1 | y1 | 30 | Point(x1, y1) | +| Entertainment 720 | x2 | y2 | 1 | Point(x2, y2) | +| Rent-A-Swag | x3 | y3 | 4 | Point(x3, y3) | + +``` +# Save our existing dfs into a dictionary. The business df is named +# 'pawnee"; the tom df is named 'tom'. +dfs = {'pawnee': business, 'tom': tom} + +# Create an empty dictionary called summary_dfs to hold the results +summary_dfs = {} + +# Loop over key-value pairs +## Keys: pawnee, tom (names given to dataframes) +## Values: business, tom (dataframes) + +for key, value in dfs.items(): + # Use f string to define a variable join_df (result of our spatial join) + ## join_{key} would be join_pawnee or join_tom in the loop + join_df = "join_{key}" + # Spatial join + join_df = gpd.sjoin(value, council_district, how = 'inner', op = 'intersects') + # Calculate summary stats with groupby, agg, then save it into summary_dfs, + # naming it 'pawnee' or 'tom'. + summary_dfs[key] = join.groupby('ID').agg( + {'Business': 'count', 'Sales_millions': 'sum'}) +``` + +Now, our `summary_dfs` dictionary contains 2 items, which are the 2 dataframes with everything aggregated. + +``` +# To view the contents of this dictionary +for key, value in summary_dfs.items(): + display(key) + display(value) + +# To access the df +summary_dfs["pawnee"] +summary_dfs["tom"] +``` + +`join_tom`: result of spatial join between tom and council_district + +| Business | longitude | latitude | Sales_millions | geometry | ID | +| ----------------- | --------- | -------- | -------------- | ------------- | --- | +| Tom's Bistro | x1 | y1 | 30 | Point(x1, y1) | 1 | +| Entertainment 720 | x2 | y2 | 1 | Point(x2, y2) | 3 | +| Rent-A-Swag | x3 | y3 | 4 | Point(x3, y3) | 3 | + +`summary_dfs["tom"]`: result of the counting number of Tom's businesses by CD + +| ID | Business | Sales_millions | +| --- | -------- | -------------- | +| 1 | 1 | 30 | +| 3 | 2 | 5 | + +
diff --git a/docs/analytics_new_analysts/03-data-management.md b/docs/analytics_new_analysts/03-data-management.md index fd4ce92993..369f1c4f12 100644 --- a/docs/analytics_new_analysts/03-data-management.md +++ b/docs/analytics_new_analysts/03-data-management.md @@ -1,27 +1,29 @@ (data-management-page)= + # Data Management Data Management is hard, and before you know it, you end up with `final_final_final_project_data-2019.csv.bak` as the source of your project's data. Below is a series of tips, tricks and use-cases for managing data throughout the lifecycle of a projects. -* [Reading and Writing Data](#reading-and-writing-data) - * [GCS](#gcs) - * [Local Folders](#local-folders) -* [Formats and Use-Cases](#formats-and-use-cases) - * [CSVs](#csvs) - * [Excel / XLSX](#excel) - * [Parquet](#parquet) - * [Feather Files](#feather-files) - * [GeoJSON](#geojson) - * [Shapefiles](#shapefiles) - * [PBF (Protocolbuffer Binary Format)](#pbf-protocolbuffer-binary-format) - * [Databases](#databases) - * [Pickles](#pickles) +- [Reading and Writing Data](#reading-and-writing-data) + - [GCS](#gcs) + - [Local Folders](#local-folders) +- [Formats and Use-Cases](#formats-and-use-cases) + - [CSVs](#csvs) + - [Excel / XLSX](#excel) + - [Parquet](#parquet) + - [Feather Files](#feather-files) + - [GeoJSON](#geojson) + - [Shapefiles](#shapefiles) + - [PBF (Protocolbuffer Binary Format)](#pbf-protocolbuffer-binary-format) + - [Databases](#databases) + - [Pickles](#pickles) ## Reading and Writing Data ### GCS + Our team often uses Google Cloud Storage (GCS) for object storage. If you haven't set up your Google authentication, go [here](https://docs.calitp.org/data-infra/analytics_tools/notebooks.html#connecting-to-warehouse) for the instructions. For a walkthrough on how to use GCS buckets, go [here](https://docs.calitp.org/data-infra/analytics_tools/storing_data.html#in-gcs). By putting data on GCS, anybody on the team can use/access/replicate the data without having to transfer data files between machines. @@ -38,6 +40,7 @@ pd.read_csv('gs://calitp-analytics-data/data-analyses/bucket-name/my_csv.csv') ``` ### Local Folders + Sometimes, it is easiest to simply use your local file system to store data. ``` @@ -50,16 +53,17 @@ pd.read_csv('./my_csv.csv') ``` ## Formats and Use-cases + Data Interchange: Where everything can be broken. ### CSVs + CSVs are the lowest common denominator of data files. They are plain text files that contain a list of data. They are best for getting raw data from SQL and storing large blobs on cloud services. For interchange, it is better to use Parquet or even Excel as they preserve datatypes. Benefits to CSVs include their readability and ease of use for users. Unlike Parquet files, they are stored as plain text, making them human readable. The downsides to CSVs are that their sizes can easily get out of hand, making Parquet files a preferable alternative in that regard. CSVs also don't store data types for columns. If there are different data types within a single column, this can lead to numerous issues. For example, if there are strings and integers mixed within a single column, the process of analyzing that CSV becomes extremely difficult and even impossible at times. Finally, another key issue with CSVs is the ability to only store a single sheet in a file without any formatting or formulas. Excel files do a better job of allowing for formulas and different formats. - ### Excel Excel/XLSX is a binary file format that holds information about all the worksheets in a file, including both content and formatting. This means Excel files are capable of holding formatting, images, charts, formulas, etc. CSVs are more limited in this respect. A downside to Excel files is that they aren't commonly readable by data analysis platforms. Every data analysis platform is capable of processing CSVs, but Excel files are a proprietary format that often require extensions in order to be processed. The ease of processing CSVs makes it easier to move data between different platforms, compared with Excel files. Excel files are best for sharing with other teams, except for geographic info (use Shapefiles or GeoJSON instead), if the Excel format is the only available and accessible format. @@ -80,6 +84,7 @@ writer.save() ``` ### Parquet + Parquet is an "open source columnar storage format for use in data analysis systems." Columnar storage is more efficient as it is easily compressed and the data is more homogenous. CSV files utilize a row-based storage format which is harder to compress, a reason why Parquets files are preferable for larger datasets. Parquet files are faster to read than CSVs, as they have a higher querying speed and preserve datatypes (i.e. Number, Timestamps, Points). They are best for intermediate data storage and large datasets (1GB+) on most any on-disk storage. This file format is also good for passing dataframes between Python and R. A similar option is [feather](https://blog.rstudio.com/2016/03/29/feather/). One of the downsides to Parquet files is the inability to quickly look at the dataset in GUI based (Excel, QGIS, etc.) programs. Parquet files also lack built-in support for categorical data. @@ -94,6 +99,7 @@ df2 = df.to_parquet('my_parquet_name.parquet') ``` ### Feather Files + Feather provides a lightweight binary columnar serialization format for data frames. It is designed to make reading and writing data frames more efficient, as well as to make sharing data across languages easier. Just like Parquet, Feather is also capable of passing dataframes between Python and R, as well as storing column data types. The Feather format is not compressed, allowing for faster input/output so it works well with solid-state drives. Similarly, Feather doesn't need unpacking in order to load it back into RAM. @@ -110,6 +116,7 @@ df = feather.read_dataframe(path) ``` ### GeoJSON + GeoJSON is an [open-standard format](https://geojson.org/) for encoding a variety of geographic data structures using JavaScript Object Notation (JSON). A GeoJSON object may represent a region of space (a Geometry), a spatially bounded entity (a Feature), or a list of Features (a FeatureCollection). It supports geometry types: Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon, and GeometryCollection. JSON is light and easier to read than most geospatial formats, but GeoJSON files can quickly get too large to handle. The upside is that a GeoJSON file is often easier to work with than a Shapefile. ``` @@ -119,6 +126,7 @@ gdf = gpd.read_file('https://data.cityofnewyork.us/api/geospatial/tqmj-j8zm?meth ``` ### Shapefiles + Shapefiles are a geospatial vector data format for geographic information system software and the original file format for geospatial data. They are capable of spatially describing vector features: points, lines, and polygons. Geopandas has good support for reading / writing shapefiles. One weird thing, however, is that a shapefile isn't a _file_, it's a _folder_, containing multiple subfiles (such as .dbf, .shpx, etc). To properly read/write shapefiles, make sure to read the entire folder or write to a folder each time. This can cause issues especially as most shapefiles are compressed into a zip file with isn't always easily decompressed. @@ -126,6 +134,7 @@ One weird thing, however, is that a shapefile isn't a _file_, it's a _folder_, c It is often better to use `geojson` vs `shapefiles` since the former is easier to render on the web. The latter is better when you have a bespoke projection. A few downsides to shapefiles include their inability to store topological information and the file size restriction of 2GB. Similarly, shapefiles can only contain one geometry type per file. Here is a template for one way to read and write shapefiles using pandas: + ``` import geopandas as gpd import os @@ -140,10 +149,13 @@ gdf.to_file('./outputs/my_dir_name') ``` ### PBF (Protocolbuffer Binary Format) + Protocol Buffers is a method of serializing structured data. It is used for storing and interchanging structured information of all types. PBF involves an interface description language that describes the structure of some data and a program that generates source code from that description for generating or parsing a stream of bytes that represents the structured data. As compared to XML, it is designed to be simpler and quicker. A benefit of using PBF is that you can define how you want your data to be structured once and then use special generated source code to easily write and read your structured data to and from a variety of data streams. It is also possible to update the defined data structure without breaking deployed programs that are compiled against the older structure/format. Although PBF was designed as a better medium for communication between systems than XML, it only has some marginal advantages when compared to JSON. ### Databases + A whole field of study, it is often useful to use a DB for analytics and aggregated queries, rather than just your production datastore. ### Pickles + A way of serializing arbitrary python objects into a byte stream with the intent of storing it in a file/database. Danger lives here. diff --git a/docs/analytics_new_analysts/04-notebooks.md b/docs/analytics_new_analysts/04-notebooks.md index 997757bfa7..a329e31452 100644 --- a/docs/analytics_new_analysts/04-notebooks.md +++ b/docs/analytics_new_analysts/04-notebooks.md @@ -1,14 +1,15 @@ (nb-best-practices)= + # Working with Jupyter notebooks Jupyter notebooks are ubiquitous in the fields of data analysis, data science, and education. There are a number of reasons for their popularity, some of which are (in no particular order): -* They are user-friendly. -* They allow for richer outputs than plain code (e.g, images, equations, HTML, and prose). -* They allow for interactive, human-in-the-loop computing. -* They provide an easy route to publishing papers, technical documents, and blog posts that involve computation. -* They can be served over the internet, and can live in the cloud. +- They are user-friendly. +- They allow for richer outputs than plain code (e.g, images, equations, HTML, and prose). +- They allow for interactive, human-in-the-loop computing. +- They provide an easy route to publishing papers, technical documents, and blog posts that involve computation. +- They can be served over the internet, and can live in the cloud. However, the popularity of the format has also revealed some of its drawbacks, and prompted criticism of how notebooks are used. @@ -18,11 +19,10 @@ see [this](https://www.youtube.com/watch?v=7jiPeIFXb6U) talk from Joel Grus. This document is meant to outline some recommendations for how to best use notebooks. -* [Notebooks and Reproducibility](#notebooks-and-reproducibility) -* [Notebooks and Version Control](#notebooks-and-version-control) -* [Prose and Documentation](#prose-and-documentation) -* [Data Access](#data-access) - +- [Notebooks and Reproducibility](#notebooks-and-reproducibility) +- [Notebooks and Version Control](#notebooks-and-version-control) +- [Prose and Documentation](#prose-and-documentation) +- [Data Access](#data-access) ## Notebooks and Reproducibility @@ -68,19 +68,19 @@ and using them with most version control tools is painful. There are a few things that can be done to mitigate this: 1. Don't commit changes to a notebook unless you intend to. -Often opening and running a notebook can result in different metadata, -slightly different results, or produce errors. -In general, these differences are not worth committing to a code repository, -and such commits will mostly read as noise in your version history. -1. Use a tool like [`nbdime`](https://nbdime.readthedocs.io/en/latest). -This provides command-line tools for diffing and merging notebooks. -It provides git integration and a small web app for viewing differences between notebooks. -It is also available as a JupyterLab extension. -1. Move some code into scripts. There will often be large code blocks in notebooks. -Sometimes these code blocks are duplicated among many notebooks within a project repository. -Examples include code for cleaning and preprocessing data. -Often such code is best removed from the interactive notebook environment and put in plain-text scripts, -where it can be more easily automated and tracked. + Often opening and running a notebook can result in different metadata, + slightly different results, or produce errors. + In general, these differences are not worth committing to a code repository, + and such commits will mostly read as noise in your version history. +2. Use a tool like [`nbdime`](https://nbdime.readthedocs.io/en/latest). + This provides command-line tools for diffing and merging notebooks. + It provides git integration and a small web app for viewing differences between notebooks. + It is also available as a JupyterLab extension. +3. Move some code into scripts. There will often be large code blocks in notebooks. + Sometimes these code blocks are duplicated among many notebooks within a project repository. + Examples include code for cleaning and preprocessing data. + Often such code is best removed from the interactive notebook environment and put in plain-text scripts, + where it can be more easily automated and tracked. ## Prose and Documentation @@ -113,8 +113,8 @@ A few strategies to mitigate these issues: 1. Small datasets (less than a few megabytes) may be included in the code repositories for analyses. 2. Larger datasets may be stored elsewhere (S3, GCS, data portals, databases). -However, instructions to access them should be given in the repository. -Tools like [intake](https://intake.readthedocs.io/en/latest/) can help here. + However, instructions to access them should be given in the repository. + Tools like [intake](https://intake.readthedocs.io/en/latest/) can help here. 3. Credentials to access private data sources should be read from environment variables, -and never stored in code repositories or saved to notebooks. -The environment variables needed to access the data for an analysis should be documented in the project `README`. + and never stored in code repositories or saved to notebooks. + The environment variables needed to access the data for an analysis should be documented in the project `README`. diff --git a/docs/analytics_new_analysts/05-spatial-analysis-basics.md b/docs/analytics_new_analysts/05-spatial-analysis-basics.md index 9cc71c1858..f268045d0d 100644 --- a/docs/analytics_new_analysts/05-spatial-analysis-basics.md +++ b/docs/analytics_new_analysts/05-spatial-analysis-basics.md @@ -1,106 +1,113 @@ -(geo-basics)= -# Working with Geospatial Data: Basics - -Place matters. That's why data analysis often includes a geospatial or geographic component. Before we wrangle with our data, let's go over the basics and make sure we're properly set up. - -Below are short demos for getting started: -* [Import and export data in Python](#import-and-export-data-in-python) -* [Setting and projecting coordinate reference system](#setting-and-projecting-coordinate-reference-system) - -## Getting Started - -``` -# Import Python packages -import pandas as pd -import geopandas as gpd -``` - -## Import and Export Data in Python -### **Local files** -We import a tabular dataframe `my_csv.csv` and a geodataframe `my_geojson.geojson` or `my_shapefile.shp`. -``` -df = pd.read_csv('../folder/my_csv.csv') - -# GeoJSON -gdf = gpd.read_file('../folder/my_geojson.geojson') -gdf.to_file(driver = 'GeoJSON', filename = '../folder/my_geojson.geojson' ) - - -# Shapefile (collection of files: .shx, .shp, .prj, .dbf, etc) -# The collection files must be put into a folder before importing -gdf = gpd.read_file('../folder/my_shapefile/') -gdf.to_file(driver = 'ESRI Shapefile', filename = '../folder/my_shapefile.shp' ) -``` - -### **GCS** -To read in our dataframe (df) and geodataframe (gdf) from GCS: - -``` -df = pd.read_csv('gs://calitp-analytics-data/data-analyses/bucket-name/my-csv.csv') -gdf = gpd.read_file('gs://calitp-analytics-data/data-analyses/bucket-name/my-geojson.geojson') -gdf = gpd.read_parquet('gs://calitp-analytics-data/data-analyses/bucket-name/my-geoparquet.parquet', engine= 'auto') -gdf = gpd.read_file('gs://calitp-analytics-data/data-analyses/bucket-name/my-shapefile.zip') - -# Write a file to GCS -gdf.to_file('gs://calitp-analytics-data/data-analyses/bucket-name/my-geojson.geojson', driver='GeoJSON') - -#Using shared utils -GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/" -FILE_NAME = "test_geoparquet" -utils.geoparquet_gcs_export(gdf, GCS_FILE_PATH, FILE_NAME) - -``` - -Additional general information about various file types can be found in the [Data Management section](data-management-page). - -## Setting and Projecting Coordinate Reference System -A coordinate reference system (CRS) tells geopandas how to plot the coordinates on the Earth. Starting with a shapefile usually means that the CRS is already set. In that case, we are interested in re-projecting the gdf to a different CRS. The CRS is chosen specific to a region (i.e., USA, Southern California, New York, etc) or for its map units (i.e., decimal degrees, US feet, meters, etc). Map units that are US feet or meters are easier to work when it comes to defining distances (100 ft buffer, etc). - -In Python, there are 2 related concepts: -1. Setting the CRS <--> corresponds to geographic coordinate system in ArcGIS -2. Re-projecting the CRS <--> corresponds to datum transformation and projected coordinated system in ArcGIS - - - -The ArcGIS equivalent of this is in [3 related concepts](https://pro.arcgis.com/en/pro-app/help/mapping/properties/coordinate-systems-and-projections.htm): -1. geographic coordinate system -2. datum transformation -3. projected coordinate system - -The **geographic coordinate system** is the coordinate system of the latitude and longitude points. Common ones are WGS84, NAD83, and NAD27. - -**Datum transformation** is needed when the geographic coordinate systems of two layers do not match. A datum transformation is needed to convert NAD1983 into WGS84. - -The **projected coordinate system** projects the coordinates onto the map. ArcGIS projects "on the fly", and applies the first layer's projection to all subsequent layers. The projection does not change the coordinates from WGS84, but displays the points from a 3D sphere onto a 2D map. The projection determines how the Earth's sphere is unfolded and flattened. - -In ArcGIS, layers must have the same geographic coordinate system and projected coordinate system before spatial analysis can occur. Since ArcGIS allows you to choose the map units (i.e., feet, miles, meters) for proximity analysis, projections are chosen primarily for the region to be mapped. - -In Python, the `geometry` column holds information about the geographic coordinate system and its projection. All gdfs must be set to the same CRS before performing any spatial operations between them. Changing `geometry` from WGS84 to CA State Plane is a datum transformation (WGS84 to NAD83) and projection to CA State Plane Zone 5. - -``` -# Check to see what the CRS is -gdf.crs - -# If there is a CRS set, you can change the projection -# Here, change to CA State Plane (units = US feet) -gdf = gdf.to_crs('EPSG:2229') -``` - -Sometimes, the gdf does not have a CRS set and you will need to be manually set it. This might occur if you create the `geometry` column from latitude and longitude points. More on this in the [intermediate tutorial](geo-intermediate): - -There are [lots of different CRS available](https://epsg.io). The most common ones used for California are: - -| EPSG | Name | Map Units -| ---| ---- | --- | -| 4326 | WGS84 | decimal degrees -| 2229 | CA State Plane Zone 5 | US feet -| 3310 | CA Albers | meters - -``` -# If the CRS is not set after checking it with gdf.crs - -gdf = gdf.set_crs('EPSG:4326') - -``` - -
+(geo-basics)= + +# Working with Geospatial Data: Basics + +Place matters. That's why data analysis often includes a geospatial or geographic component. Before we wrangle with our data, let's go over the basics and make sure we're properly set up. + +Below are short demos for getting started: + +- [Import and export data in Python](#import-and-export-data-in-python) +- [Setting and projecting coordinate reference system](#setting-and-projecting-coordinate-reference-system) + +## Getting Started + +``` +# Import Python packages +import pandas as pd +import geopandas as gpd +``` + +## Import and Export Data in Python + +### **Local files** + +We import a tabular dataframe `my_csv.csv` and a geodataframe `my_geojson.geojson` or `my_shapefile.shp`. + +``` +df = pd.read_csv('../folder/my_csv.csv') + +# GeoJSON +gdf = gpd.read_file('../folder/my_geojson.geojson') +gdf.to_file(driver = 'GeoJSON', filename = '../folder/my_geojson.geojson' ) + + +# Shapefile (collection of files: .shx, .shp, .prj, .dbf, etc) +# The collection files must be put into a folder before importing +gdf = gpd.read_file('../folder/my_shapefile/') +gdf.to_file(driver = 'ESRI Shapefile', filename = '../folder/my_shapefile.shp' ) +``` + +### **GCS** + +To read in our dataframe (df) and geodataframe (gdf) from GCS: + +``` +df = pd.read_csv('gs://calitp-analytics-data/data-analyses/bucket-name/my-csv.csv') +gdf = gpd.read_file('gs://calitp-analytics-data/data-analyses/bucket-name/my-geojson.geojson') +gdf = gpd.read_parquet('gs://calitp-analytics-data/data-analyses/bucket-name/my-geoparquet.parquet', engine= 'auto') +gdf = gpd.read_file('gs://calitp-analytics-data/data-analyses/bucket-name/my-shapefile.zip') + +# Write a file to GCS +gdf.to_file('gs://calitp-analytics-data/data-analyses/bucket-name/my-geojson.geojson', driver='GeoJSON') + +#Using shared utils +GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/" +FILE_NAME = "test_geoparquet" +utils.geoparquet_gcs_export(gdf, GCS_FILE_PATH, FILE_NAME) + +``` + +Additional general information about various file types can be found in the [Data Management section](data-management-page). + +## Setting and Projecting Coordinate Reference System + +A coordinate reference system (CRS) tells geopandas how to plot the coordinates on the Earth. Starting with a shapefile usually means that the CRS is already set. In that case, we are interested in re-projecting the gdf to a different CRS. The CRS is chosen specific to a region (i.e., USA, Southern California, New York, etc) or for its map units (i.e., decimal degrees, US feet, meters, etc). Map units that are US feet or meters are easier to work when it comes to defining distances (100 ft buffer, etc). + +In Python, there are 2 related concepts: + +1. Setting the CRS \<--> corresponds to geographic coordinate system in ArcGIS +2. Re-projecting the CRS \<--> corresponds to datum transformation and projected coordinated system in ArcGIS + +The ArcGIS equivalent of this is in [3 related concepts](https://pro.arcgis.com/en/pro-app/help/mapping/properties/coordinate-systems-and-projections.htm): + +1. geographic coordinate system +2. datum transformation +3. projected coordinate system + +The **geographic coordinate system** is the coordinate system of the latitude and longitude points. Common ones are WGS84, NAD83, and NAD27. + +**Datum transformation** is needed when the geographic coordinate systems of two layers do not match. A datum transformation is needed to convert NAD1983 into WGS84. + +The **projected coordinate system** projects the coordinates onto the map. ArcGIS projects "on the fly", and applies the first layer's projection to all subsequent layers. The projection does not change the coordinates from WGS84, but displays the points from a 3D sphere onto a 2D map. The projection determines how the Earth's sphere is unfolded and flattened. + +In ArcGIS, layers must have the same geographic coordinate system and projected coordinate system before spatial analysis can occur. Since ArcGIS allows you to choose the map units (i.e., feet, miles, meters) for proximity analysis, projections are chosen primarily for the region to be mapped. + +In Python, the `geometry` column holds information about the geographic coordinate system and its projection. All gdfs must be set to the same CRS before performing any spatial operations between them. Changing `geometry` from WGS84 to CA State Plane is a datum transformation (WGS84 to NAD83) and projection to CA State Plane Zone 5. + +``` +# Check to see what the CRS is +gdf.crs + +# If there is a CRS set, you can change the projection +# Here, change to CA State Plane (units = US feet) +gdf = gdf.to_crs('EPSG:2229') +``` + +Sometimes, the gdf does not have a CRS set and you will need to be manually set it. This might occur if you create the `geometry` column from latitude and longitude points. More on this in the [intermediate tutorial](geo-intermediate): + +There are [lots of different CRS available](https://epsg.io). The most common ones used for California are: + +| EPSG | Name | Map Units | +| ---- | --------------------- | --------------- | +| 4326 | WGS84 | decimal degrees | +| 2229 | CA State Plane Zone 5 | US feet | +| 3310 | CA Albers | meters | + +``` +# If the CRS is not set after checking it with gdf.crs + +gdf = gdf.set_crs('EPSG:4326') + +``` + +
diff --git a/docs/analytics_new_analysts/06-spatial-analysis-intro.md b/docs/analytics_new_analysts/06-spatial-analysis-intro.md index bbfe9777fa..600abea2d3 100644 --- a/docs/analytics_new_analysts/06-spatial-analysis-intro.md +++ b/docs/analytics_new_analysts/06-spatial-analysis-intro.md @@ -1,233 +1,239 @@ -(geo-intro)= -# Working with Geospatial Data: Intro - -Place matters. That's why data analysis often includes a geospatial or geographic component. Data analysts are called upon to merge tabular and geospatial data, count the number of points within given boundaries, and create a map illustrating the results. - -Below are short demos of common techniques to help get you started with exploring your geospatial data. -* [Merge tabular and geospatial data](#merge-tabular-and-geospatial-data) -* [Attach geographic characteristics to all points or lines that fall within a boundary (spatial join and dissolve)](#attach-geographic-characteristics-to-all-points-or-lines-that-fall-within-a-boundary) -* [Aggregate and calculate summary statistics](#aggregate-and-calculate-summary-statistics) -* [Buffers](#buffers) - - -## Getting Started - -``` -# Import Python packages -import pandas as pd -import geopandas as gpd -``` - -## Merge Tabular and Geospatial Data -We have two files: Council District boundaries (geospatial) and population values (tabular). Through visual inspection, we know that `CD` and `District` are columns that help us make this match. - -`df`: population by council district - -| CD | Council_Member | Population | -| ---| ---- | --- | -| 1 | Leslie Knope | 1,500 | -| 2 | Jeremy Jamm | 2,000 -| 3 | Douglass Howser | 2,250 - -`gdf`: council district boundaries - -| District | Geometry -| ---| ---- | -| 1 | polygon -| 2 | polygon -| 3 | polygon - -We could merge these two dfs using the District and CD columns. If our left df is a geodataframe (gdf), then our merged df will also be a gdf. -``` -merge = pd.merge(gdf, df, left_on = 'District', right_on = 'CD') -merge -``` - -| District | Geometry | CD | Council_Member | Population -| ---| ---- | --- | --- | --- | -| 1 | polygon | 1 | Leslie Knope | 1,500 -| 2 | polygon | 2 | Jeremy Jamm | 2,000 -| 3 | polygon | 3 | Douglass Howser | 2,250 - -## Attach Geographic Characteristics to All Points or Lines That Fall Within a Boundary - -Sometimes with a point shapefile (list of lat/lon points), we want to count how many points fall within the boundary. Unlike the previous example, these points aren't attached with Council District information, so we need to generate that ourselves. - -The ArcGIS equivalent of this is a **spatial join** between the point and polygon shapefiles, then **dissolving** to calculate summary statistics. - -``` -locations = gpd.read_file('../folder/paunch_burger_locations.geojson') -gdf = gpd.read_file('../folder/council_boundaries.geojson') - -# Make sure both our gdfs are projected to the same coordinate reference system -# (EPSG:4326 = WGS84) -locations = locations.to_crs('EPSG:4326') -gdf = gdf.to_crs('EPSG:4326') - -``` - -`locations` lists the Paunch Burgers locations and their annual sales. - -| Store | City | Sales_millions | Geometry | -| ---| ---- | --- | --- | -| 1 | Pawnee | $5 | (x1,y1) -| 2 | Pawnee | $2.5 | (x2, y2) -| 3 | Pawnee | $2.5 | (x3, y3) -| 4 | Eagleton | $2 | (x4, y4) -| 5 | Pawnee | $4 | (x5, y5) -| 6 | Pawnee | $6 | (x6, y6) -| 7 | Indianapolis | $7 | (x7, y7) - -`gdf` is the Council District boundaries. - -| District | Geometry -| ---| ---- -| 1 | polygon -| 2 | polygon -| 3 | polygon - -A spatial join finds the Council District the location falls within and attaches that information. - -``` -join = gpd.sjoin(locations, gdf, how = 'inner', predicate = 'intersects') - -# how = 'inner' means that we only want to keep observations that matched, -# i.e locations that were within the council district boundaries. -# predicate = 'intersects' means that we are joining based on whether or not the location -# intersects with the council district. -``` - -The `join` gdf looks like this. We lost Stores 4 (Eagleton) and 7 (Indianapolis) because they were outside of Pawnee City Council boundaries. - -| Store | City | Sales_millions | Geometry_x | District | Geometry_y -| ---| ---- | --- | --- | --- | ---| -| 1 | Pawnee | $5 | (x1,y1) | 1 | polygon -| 2 | Pawnee | $2.5 | (x2, y2) | 2 | polygon -| 3 | Pawnee | $2.5 | (x3, y3) | 3 | polygon -| 5 | Pawnee | $4 | (x5, y5) | 1 | polygon -| 6 | Pawnee | $6 | (x6, y6) | 2 | polygon - - -## Aggregate and Calculate Summary Statistics -We want to count the number of Paunch Burger locations and their total sales within each District. - -``` -summary = join.pivot_table(index = ['District', 'Geometry_y], - values = ['Store', 'Sales_millions'], - aggfunc = {'Store': 'count', 'Sales_millions': 'sum'}).reset_index() - -OR - -summary = join.groupby(['District', 'Geometry_y']).agg({'Store': 'count', - 'Sales_millions': 'sum'}).reset_index() - -summary.rename(column = {'Geometry_y': 'Geometry'}, inplace = True) -summary -``` - -| District | Store | Sales_millions | Geometry -| ---| ---- | --- | --- -| 1 | 2 | $9 | polygon -| 2 | 2 | $8.5 | polygon -| 3 | 1 | $2.5 | polygon - -By keeping the `Geometry` column, we're able to export this as a GeoJSON or shapefile. - -``` -summary.to_file(driver = 'GeoJSON', - filename = '../folder/pawnee_sales_by_district.geojson') - -summary.to_file(driver = 'ESRI Shapefile', - filename = '../folder/pawnee_sales_by_district.shp') -``` - -## Buffers -Buffers are areas of a certain distance around a given point, line, or polygon. Buffers are used to determine proximity . A 5 mile buffer around a point would be a circle of 5 mile radius centered at the point. This [ESRI page](http://desktop.arcgis.com/en/arcmap/10.3/tools/analysis-toolbox/buffer.htm) shows how buffers for points, lines, and polygons look. - -Some examples of questions that buffers help answer are: -* How many stores are within 1 mile of my house? -* Which streets are within 5 miles of the mall? -* Which census tracts or neighborhoods are within a half mile from the rail station? - -Small buffers can also be used to determine whether 2 points are located in the same place. A shopping mall or the park might sit on a large property. If points are geocoded to various areas of the mall/park, they would show up as 2 distinct locations, when in reality, we consider them the same location. - -We start with two point shapefiles: `locations` (Paunch Burger locations) and `homes` (home addresses for my 2 friends). The goal is to find out how many Paunch Burgers are located within a 2 miles of my friends. - -`locations`: Paunch Burger locations - -| Store | City | Sales_millions | Geometry -| ---| ---- | --- | --- | -| 1 | Pawnee | $5 | (x1,y1) -| 2 | Pawnee | $2.5 | (x2, y2) -| 3 | Pawnee | $2.5 | (x3, y3) -| 4 | Eagleton | $2 | (x4, y4) -| 5 | Pawnee | $4 | (x5, y5) -| 6 | Pawnee | $6 | (x6, y6) -| 7 | Indianapolis | $7 | (x7, y7) - - -`homes`: friends' addresses - -| Name | Geometry -| ---| ---- | -| Leslie Knope | (x8, y8) -| Ann Perkins | (x9, y9) - -First, prepare our point gdf and change it to the right projection. Pawnee is in Indiana, so we'll use EPSG:2965. - -``` -# Use NAD83/Indiana East projection (units are in feet) -homes = homes.to_crs('EPSG:2965') -locations = locations.to_crs('EPSG:2965') -``` - -Next, draw a 2 mile buffer around `homes`. -``` -# Make a copy of the homes gdf -homes_buffer = homes.copy() - -# Overwrite the existing geometry and change it from point to polygon -miles_to_feet = 5280 -two_miles = 2 * miles_to_feet -homes_buffer['geometry'] = homes.geometry.buffer(two_miles) -``` - -### **Select Points Within a Buffer** - -Do a spatial join between `locations` and `homes_buffer`. Repeat the process of spatial join and aggregation in Python as illustrated in the previous section (spatial join and dissolve in ArcGIS). - -``` -sjoin = gpd.sjoin(locations, homes_buffer, how = 'inner', predicate = 'intersects') -sjoin -``` - -`sjoin` looks like this. -* Geometry_x is the point geometry from our left df `locations`. -* Geometry_y is the polygon geometry from our right df `homes_buffer`. - -| Store | Geometry_x | Name | Geometry_y -| ---| ---- | --- | --- | -| 1 | (x1,y1) | Leslie Knope | polygon -| 3 | (x3, y3) | Ann Perkins | polygon -| 5 | (x5, y5) | Leslie Knope | polygon -| 6 | (x6, y6) | Leslie Knope | polygon - -Count the number of Paunch Burger locations for each friend. - -``` -count = sjoin.pivot_table(index = 'Name', - values = 'Store', aggfunc = 'count').reset_index() - -OR - -count = sjoin.groupby('Name').agg({'Store':'count'}).reset_index() -``` - -The final `count`: - -| Name | Store -| ---| ---- | -| Leslie Knope | 3 -| Ann Perkins | 1 - -
+(geo-intro)= + +# Working with Geospatial Data: Intro + +Place matters. That's why data analysis often includes a geospatial or geographic component. Data analysts are called upon to merge tabular and geospatial data, count the number of points within given boundaries, and create a map illustrating the results. + +Below are short demos of common techniques to help get you started with exploring your geospatial data. + +- [Merge tabular and geospatial data](#merge-tabular-and-geospatial-data) +- [Attach geographic characteristics to all points or lines that fall within a boundary (spatial join and dissolve)](#attach-geographic-characteristics-to-all-points-or-lines-that-fall-within-a-boundary) +- [Aggregate and calculate summary statistics](#aggregate-and-calculate-summary-statistics) +- [Buffers](#buffers) + +## Getting Started + +``` +# Import Python packages +import pandas as pd +import geopandas as gpd +``` + +## Merge Tabular and Geospatial Data + +We have two files: Council District boundaries (geospatial) and population values (tabular). Through visual inspection, we know that `CD` and `District` are columns that help us make this match. + +`df`: population by council district + +| CD | Council_Member | Population | +| --- | --------------- | ---------- | +| 1 | Leslie Knope | 1,500 | +| 2 | Jeremy Jamm | 2,000 | +| 3 | Douglass Howser | 2,250 | + +`gdf`: council district boundaries + +| District | Geometry | +| -------- | -------- | +| 1 | polygon | +| 2 | polygon | +| 3 | polygon | + +We could merge these two dfs using the District and CD columns. If our left df is a geodataframe (gdf), then our merged df will also be a gdf. + +``` +merge = pd.merge(gdf, df, left_on = 'District', right_on = 'CD') +merge +``` + +| District | Geometry | CD | Council_Member | Population | +| -------- | -------- | --- | --------------- | ---------- | +| 1 | polygon | 1 | Leslie Knope | 1,500 | +| 2 | polygon | 2 | Jeremy Jamm | 2,000 | +| 3 | polygon | 3 | Douglass Howser | 2,250 | + +## Attach Geographic Characteristics to All Points or Lines That Fall Within a Boundary + +Sometimes with a point shapefile (list of lat/lon points), we want to count how many points fall within the boundary. Unlike the previous example, these points aren't attached with Council District information, so we need to generate that ourselves. + +The ArcGIS equivalent of this is a **spatial join** between the point and polygon shapefiles, then **dissolving** to calculate summary statistics. + +``` +locations = gpd.read_file('../folder/paunch_burger_locations.geojson') +gdf = gpd.read_file('../folder/council_boundaries.geojson') + +# Make sure both our gdfs are projected to the same coordinate reference system +# (EPSG:4326 = WGS84) +locations = locations.to_crs('EPSG:4326') +gdf = gdf.to_crs('EPSG:4326') + +``` + +`locations` lists the Paunch Burgers locations and their annual sales. + +| Store | City | Sales_millions | Geometry | +| ----- | ------------ | -------------- | -------- | +| 1 | Pawnee | $5 | (x1,y1) | +| 2 | Pawnee | $2.5 | (x2, y2) | +| 3 | Pawnee | $2.5 | (x3, y3) | +| 4 | Eagleton | $2 | (x4, y4) | +| 5 | Pawnee | $4 | (x5, y5) | +| 6 | Pawnee | $6 | (x6, y6) | +| 7 | Indianapolis | $7 | (x7, y7) | + +`gdf` is the Council District boundaries. + +| District | Geometry | +| -------- | -------- | +| 1 | polygon | +| 2 | polygon | +| 3 | polygon | + +A spatial join finds the Council District the location falls within and attaches that information. + +``` +join = gpd.sjoin(locations, gdf, how = 'inner', predicate = 'intersects') + +# how = 'inner' means that we only want to keep observations that matched, +# i.e locations that were within the council district boundaries. +# predicate = 'intersects' means that we are joining based on whether or not the location +# intersects with the council district. +``` + +The `join` gdf looks like this. We lost Stores 4 (Eagleton) and 7 (Indianapolis) because they were outside of Pawnee City Council boundaries. + +| Store | City | Sales_millions | Geometry_x | District | Geometry_y | +| ----- | ------ | -------------- | ---------- | -------- | ---------- | +| 1 | Pawnee | $5 | (x1,y1) | 1 | polygon | +| 2 | Pawnee | $2.5 | (x2, y2) | 2 | polygon | +| 3 | Pawnee | $2.5 | (x3, y3) | 3 | polygon | +| 5 | Pawnee | $4 | (x5, y5) | 1 | polygon | +| 6 | Pawnee | $6 | (x6, y6) | 2 | polygon | + +## Aggregate and Calculate Summary Statistics + +We want to count the number of Paunch Burger locations and their total sales within each District. + +``` +summary = join.pivot_table(index = ['District', 'Geometry_y], + values = ['Store', 'Sales_millions'], + aggfunc = {'Store': 'count', 'Sales_millions': 'sum'}).reset_index() + +OR + +summary = join.groupby(['District', 'Geometry_y']).agg({'Store': 'count', + 'Sales_millions': 'sum'}).reset_index() + +summary.rename(column = {'Geometry_y': 'Geometry'}, inplace = True) +summary +``` + +| District | Store | Sales_millions | Geometry | +| -------- | ----- | -------------- | -------- | +| 1 | 2 | $9 | polygon | +| 2 | 2 | $8.5 | polygon | +| 3 | 1 | $2.5 | polygon | + +By keeping the `Geometry` column, we're able to export this as a GeoJSON or shapefile. + +``` +summary.to_file(driver = 'GeoJSON', + filename = '../folder/pawnee_sales_by_district.geojson') + +summary.to_file(driver = 'ESRI Shapefile', + filename = '../folder/pawnee_sales_by_district.shp') +``` + +## Buffers + +Buffers are areas of a certain distance around a given point, line, or polygon. Buffers are used to determine proximity . A 5 mile buffer around a point would be a circle of 5 mile radius centered at the point. This [ESRI page](http://desktop.arcgis.com/en/arcmap/10.3/tools/analysis-toolbox/buffer.htm) shows how buffers for points, lines, and polygons look. + +Some examples of questions that buffers help answer are: + +- How many stores are within 1 mile of my house? +- Which streets are within 5 miles of the mall? +- Which census tracts or neighborhoods are within a half mile from the rail station? + +Small buffers can also be used to determine whether 2 points are located in the same place. A shopping mall or the park might sit on a large property. If points are geocoded to various areas of the mall/park, they would show up as 2 distinct locations, when in reality, we consider them the same location. + +We start with two point shapefiles: `locations` (Paunch Burger locations) and `homes` (home addresses for my 2 friends). The goal is to find out how many Paunch Burgers are located within a 2 miles of my friends. + +`locations`: Paunch Burger locations + +| Store | City | Sales_millions | Geometry | +| ----- | ------------ | -------------- | -------- | +| 1 | Pawnee | $5 | (x1,y1) | +| 2 | Pawnee | $2.5 | (x2, y2) | +| 3 | Pawnee | $2.5 | (x3, y3) | +| 4 | Eagleton | $2 | (x4, y4) | +| 5 | Pawnee | $4 | (x5, y5) | +| 6 | Pawnee | $6 | (x6, y6) | +| 7 | Indianapolis | $7 | (x7, y7) | + +`homes`: friends' addresses + +| Name | Geometry | +| ------------ | -------- | +| Leslie Knope | (x8, y8) | +| Ann Perkins | (x9, y9) | + +First, prepare our point gdf and change it to the right projection. Pawnee is in Indiana, so we'll use EPSG:2965. + +``` +# Use NAD83/Indiana East projection (units are in feet) +homes = homes.to_crs('EPSG:2965') +locations = locations.to_crs('EPSG:2965') +``` + +Next, draw a 2 mile buffer around `homes`. + +``` +# Make a copy of the homes gdf +homes_buffer = homes.copy() + +# Overwrite the existing geometry and change it from point to polygon +miles_to_feet = 5280 +two_miles = 2 * miles_to_feet +homes_buffer['geometry'] = homes.geometry.buffer(two_miles) +``` + +### **Select Points Within a Buffer** + +Do a spatial join between `locations` and `homes_buffer`. Repeat the process of spatial join and aggregation in Python as illustrated in the previous section (spatial join and dissolve in ArcGIS). + +``` +sjoin = gpd.sjoin(locations, homes_buffer, how = 'inner', predicate = 'intersects') +sjoin +``` + +`sjoin` looks like this. + +- Geometry_x is the point geometry from our left df `locations`. +- Geometry_y is the polygon geometry from our right df `homes_buffer`. + +| Store | Geometry_x | Name | Geometry_y | +| ----- | ---------- | ------------ | ---------- | +| 1 | (x1,y1) | Leslie Knope | polygon | +| 3 | (x3, y3) | Ann Perkins | polygon | +| 5 | (x5, y5) | Leslie Knope | polygon | +| 6 | (x6, y6) | Leslie Knope | polygon | + +Count the number of Paunch Burger locations for each friend. + +``` +count = sjoin.pivot_table(index = 'Name', + values = 'Store', aggfunc = 'count').reset_index() + +OR + +count = sjoin.groupby('Name').agg({'Store':'count'}).reset_index() +``` + +The final `count`: + +| Name | Store | +| ------------ | ----- | +| Leslie Knope | 3 | +| Ann Perkins | 1 | + +
diff --git a/docs/analytics_new_analysts/07-spatial-analysis-intermediate.md b/docs/analytics_new_analysts/07-spatial-analysis-intermediate.md index 7e5e21afa4..33f390e7b8 100644 --- a/docs/analytics_new_analysts/07-spatial-analysis-intermediate.md +++ b/docs/analytics_new_analysts/07-spatial-analysis-intermediate.md @@ -1,213 +1,212 @@ -(geo-intermediate)= -# Working with Geospatial Data: Intermediate - -After breezing through the [intro tutorial](geo-intro), you're ready to take your spatial analysis to the next level. - -Below are short demos of other common manipulations of geospatial data. -* [Create geometry column from latitude and longitude coordinates](#create-geometry-column-from-latitude-and-longitude-coordinates) -* [Create geometry column from text](#create-geometry-column-from-text) -* [Use a loop to do spatial joins and aggregations over different boundaries](#use-a-loop-to-do-spatial-joins-and-aggregations-over-different-boundaries) -* [Multiple geometry columns](#multiple-geometry-columns) - - -## Getting Started -``` -# Import Python packages -import pandas as pd -import geopandas as gpd -from shapely.geometry import Point - -df = pd.read_csv('../folder/pawnee_businesses.csv') -``` - -| Business | X | Y | Sales_millions -| ---| ---- | ---- | ---| -| Paunch Burger | x1 | y1 | 5 -| Sweetums | x2 | y2 | 30 -| Jurassic Fork | x3 | y3 | 2 -| Gryzzl | x4 | y4 | 40 - - -## Create Geometry Column from Latitude and Longitude Coordinates -Sometimes, latitude and longitude coordinates are given in a tabular form. The file is read in as a dataframe (df), but it needs to be converted into a geodataframe (gdf). The `geometry` column contains a Shapely object (point, line, or polygon), and is what makes it a geodataframe. A gdf can be exported as GeoJSON, parquet, or shapefile. - -In ArcGIS/QGIS, this is equivalent to adding XY data, selecting the columns that correspond to latitude and longitude, and exporting the layer as a shapefile. - -First, drop all the points that are potentially problematic (NAs or zeroes). - -``` -# Drop NAs -df = df.dropna(subset=['X', 'Y']) - -# Keep non-zero values for X, Y -df = df[(df.X != 0) & (df.Y != 0)] -``` - -Then, create the `geometry` column. We use a lambda function and apply it to all rows in our df. For every row, take the XY coordinates and make it Point(X,Y). Make sure you set the projection (coordinate reference system)! - -``` -# Rename columns -df.rename(columns = {'X': 'longitude', 'Y':'latitude'}, inplace=True) - -# Create geometry column -gdf = gpd.points_from_xy(df.longitude, df.latitude, crs="EPSG:4326") - -# Project to different CRS. Pawnee is in Indiana, so we'll use EPSG:2965. -# In Southern California, use EPSG:2229. -gdf = gdf.to_crs('EPSG:2965') - -gdf -``` - -| Business | longitude | latitude | Sales_millions | geometry -| ---| ---- | --- | ---| ---| -| Paunch Burger | x1 | y1 | 5 | Point(x1, y1) -| Sweetums | x2 | y2 | 30 | Point(x2, y2) -| Jurassic Fork | x3 | y3 | 2 | Point(x3, y3) -| Gryzzl | x4 | y4 | 40 | Point(x4, y4) - - -## Create Geometry Column from Text -If you are importing your df directly from a CSV or database, the geometry information might be stored as as text. To create our geometry column, we extract the latitude and longitude information and use these components to create a Shapely object. - -`df` starts off this way, with column `Coord` stored as text: - -| Business | Coord | Sales_millions | -| ---| ---- | --- | -| Paunch Burger | (x1, y1) | 5 | -| Sweetums | (x2, y2) | 30 | -| Jurassic Fork | (x3, y3) | 2 | -| Gryzzl | (x4, y4) | 40 | - - -First, we split `Coord` at the comma. - -``` -# We want to expand the result into multiple columns. -# Save the result and call it new. -new = df.Coord.str.split(", ", expand = True) -``` - -Then, extract our X, Y components. Put lat, lon into a Shapely object as demonstrated [in the prior section.](#create-geometry-column-from-latitude-and-longitude-coordinates) - -``` -# Make sure only numbers, not parentheses, are captured. Cast it as float. - -# 0 corresponds to the portion before the comma. [1:] means starting from -# the 2nd character, right after the opening parenthesis, to the comma. -df['lat'] = new[0].str[1:].astype(float) - -# 1 corresponds to the portion after the comma. [:-1] means starting from -# right after the comma to the 2nd to last character from the end, which -# is right before the closing parenthesis. -df['lon'] = new[1].str[:-1].astype(float) -``` - - -Or, do it in one swift move: - -``` -df['geometry'] = df.dropna(subset=['Coord']).apply( - lambda x: Point( - float(str(x.Coord).split(",")[0][1:]), - float(str(x.Coord).split(",")[1][:-1]) - ), axis = 1) - - -# Now that you have a geometry column, convert to gdf. -gdf = gpd.GeoDataFrame(df) - -# Set the coordinate reference system. You must set it first before you -# can project. -gdf = df.set_crs('EPSG:4326') -``` - - -## Use a Loop to Do Spatial Joins and Aggregations Over Different Boundaries -Let's say we want to do a spatial join between `df` to 2 different boundaries. Different government departments often use different boundaries for their operations (i.e. city planning districts, water districts, transportation districts, etc). Looping over dictionary items would be an efficient way to do this. - -We want to count the number of stores and total sales within each Council District and Planning District. - -`df`: list of Pawnee stores - -| Business | longitude | latitude | Sales_millions | geometry -| ---| ---- | --- | ---| ---| -| Paunch Burger | x1 | y1 | 5 | Point(x1, y1) -| Sweetums | x2 | y2 | 30 | Point(x2, y2) -| Jurassic Fork | x3 | y3 | 2 | Point(x3, y3) -| Gryzzl | x4 | y4 | 40 | Point(x4, y4) - -`council_district` and `planning_district` are polygon shapefiles while `df` is a point shapefile. For simplicity, `council_district` and `planning_district` both use column `ID` as the unique identifier. - - -``` -# Save the dataframes into dictionaries -boundaries = {'council': council_district, 'planning': planning_district} - -# Create empty dictionaries to hold our results -results = {} - - -# Loop over different boundaries (council, planning) -for key, value in boundaries.items(): - # Define new variables using f string - join_df = f"{key}_join" - agg_df = f"{key}_summary" - # Spatial join, but don't save it into the results dictionary - join_df = gpd.sjoin(df, value, how = 'inner', predicate = 'intersects') - # Aggregate and save results into results dictionary - results[agg_df] = join_df.groupby('ID').agg( - {'Business': 'count', 'Sales_millions': 'sum'}) -``` - -Our results dictionary contains 2 dataframes: `council_summary` and `planning_summary`. We can see the contents of the results dictionary using this: -``` -for key, value in results.items(): - display(key) - display(value.head()) - - -# To access the "dataframe", write this: -results["council_summary"].head() -results["planning_summary"].head() -``` - -`council_summary` would look like this, with the total count of Business and sum of Sales_millions within the council district: - -| ID | Business | Sales_millions -| ---| ---- | --- | -| 1 | 2 | 45 -| 2 | 1 | 2 -| 3 | 1 | 30 - - -## Multiple Geometry Columns -Sometimes we want to iterate over different options, and we want to see the results side-by-side. Here, we draw multiple buffers around `df`, specifically, 100 ft and 200 ft buffers. - -``` -# Make sure our projection has US feet as its units -df.to_crs('EPSG:2965') - -# Add other columns for the different buffers -df['geometry100'] = df.geometry.buffer(100) -df['geometry200'] = df.geometry.buffer(200) - -df -``` - -| Business | Sales_millions | geometry | geometry100 | geometry200 -| ---| ---- | --- | ---| ---| -| Paunch Burger | 5 | Point(x1, y1) | polygon | polygon -| Sweetums | 30 | Point(x2, y2) | polygon | polygon -| Jurassic Fork | 2 | Point(x3, y3) | polygon | polygon -| Gryzzl | 40 | Point(x4, y4) | polygon | polygon - - -To create a new gdf with just 100 ft buffers, select the relevant geometry column, `geometry100`, and set it as the geometry of the gdf. - -``` -df100 = df[['Business', 'Sales_millions', - 'geometry100']].set_geometry('geometry100') -``` - -
+(geo-intermediate)= + +# Working with Geospatial Data: Intermediate + +After breezing through the [intro tutorial](geo-intro), you're ready to take your spatial analysis to the next level. + +Below are short demos of other common manipulations of geospatial data. + +- [Create geometry column from latitude and longitude coordinates](#create-geometry-column-from-latitude-and-longitude-coordinates) +- [Create geometry column from text](#create-geometry-column-from-text) +- [Use a loop to do spatial joins and aggregations over different boundaries](#use-a-loop-to-do-spatial-joins-and-aggregations-over-different-boundaries) +- [Multiple geometry columns](#multiple-geometry-columns) + +## Getting Started + +``` +# Import Python packages +import pandas as pd +import geopandas as gpd +from shapely.geometry import Point + +df = pd.read_csv('../folder/pawnee_businesses.csv') +``` + +| Business | X | Y | Sales_millions | +| ------------- | --- | --- | -------------- | +| Paunch Burger | x1 | y1 | 5 | +| Sweetums | x2 | y2 | 30 | +| Jurassic Fork | x3 | y3 | 2 | +| Gryzzl | x4 | y4 | 40 | + +## Create Geometry Column from Latitude and Longitude Coordinates + +Sometimes, latitude and longitude coordinates are given in a tabular form. The file is read in as a dataframe (df), but it needs to be converted into a geodataframe (gdf). The `geometry` column contains a Shapely object (point, line, or polygon), and is what makes it a geodataframe. A gdf can be exported as GeoJSON, parquet, or shapefile. + +In ArcGIS/QGIS, this is equivalent to adding XY data, selecting the columns that correspond to latitude and longitude, and exporting the layer as a shapefile. + +First, drop all the points that are potentially problematic (NAs or zeroes). + +``` +# Drop NAs +df = df.dropna(subset=['X', 'Y']) + +# Keep non-zero values for X, Y +df = df[(df.X != 0) & (df.Y != 0)] +``` + +Then, create the `geometry` column. We use a lambda function and apply it to all rows in our df. For every row, take the XY coordinates and make it Point(X,Y). Make sure you set the projection (coordinate reference system)! + +``` +# Rename columns +df.rename(columns = {'X': 'longitude', 'Y':'latitude'}, inplace=True) + +# Create geometry column +gdf = gpd.points_from_xy(df.longitude, df.latitude, crs="EPSG:4326") + +# Project to different CRS. Pawnee is in Indiana, so we'll use EPSG:2965. +# In Southern California, use EPSG:2229. +gdf = gdf.to_crs('EPSG:2965') + +gdf +``` + +| Business | longitude | latitude | Sales_millions | geometry | +| ------------- | --------- | -------- | -------------- | ------------- | +| Paunch Burger | x1 | y1 | 5 | Point(x1, y1) | +| Sweetums | x2 | y2 | 30 | Point(x2, y2) | +| Jurassic Fork | x3 | y3 | 2 | Point(x3, y3) | +| Gryzzl | x4 | y4 | 40 | Point(x4, y4) | + +## Create Geometry Column from Text + +If you are importing your df directly from a CSV or database, the geometry information might be stored as as text. To create our geometry column, we extract the latitude and longitude information and use these components to create a Shapely object. + +`df` starts off this way, with column `Coord` stored as text: + +| Business | Coord | Sales_millions | +| ------------- | -------- | -------------- | +| Paunch Burger | (x1, y1) | 5 | +| Sweetums | (x2, y2) | 30 | +| Jurassic Fork | (x3, y3) | 2 | +| Gryzzl | (x4, y4) | 40 | + +First, we split `Coord` at the comma. + +``` +# We want to expand the result into multiple columns. +# Save the result and call it new. +new = df.Coord.str.split(", ", expand = True) +``` + +Then, extract our X, Y components. Put lat, lon into a Shapely object as demonstrated [in the prior section.](#create-geometry-column-from-latitude-and-longitude-coordinates) + +``` +# Make sure only numbers, not parentheses, are captured. Cast it as float. + +# 0 corresponds to the portion before the comma. [1:] means starting from +# the 2nd character, right after the opening parenthesis, to the comma. +df['lat'] = new[0].str[1:].astype(float) + +# 1 corresponds to the portion after the comma. [:-1] means starting from +# right after the comma to the 2nd to last character from the end, which +# is right before the closing parenthesis. +df['lon'] = new[1].str[:-1].astype(float) +``` + +Or, do it in one swift move: + +``` +df['geometry'] = df.dropna(subset=['Coord']).apply( + lambda x: Point( + float(str(x.Coord).split(",")[0][1:]), + float(str(x.Coord).split(",")[1][:-1]) + ), axis = 1) + + +# Now that you have a geometry column, convert to gdf. +gdf = gpd.GeoDataFrame(df) + +# Set the coordinate reference system. You must set it first before you +# can project. +gdf = df.set_crs('EPSG:4326') +``` + +## Use a Loop to Do Spatial Joins and Aggregations Over Different Boundaries + +Let's say we want to do a spatial join between `df` to 2 different boundaries. Different government departments often use different boundaries for their operations (i.e. city planning districts, water districts, transportation districts, etc). Looping over dictionary items would be an efficient way to do this. + +We want to count the number of stores and total sales within each Council District and Planning District. + +`df`: list of Pawnee stores + +| Business | longitude | latitude | Sales_millions | geometry | +| ------------- | --------- | -------- | -------------- | ------------- | +| Paunch Burger | x1 | y1 | 5 | Point(x1, y1) | +| Sweetums | x2 | y2 | 30 | Point(x2, y2) | +| Jurassic Fork | x3 | y3 | 2 | Point(x3, y3) | +| Gryzzl | x4 | y4 | 40 | Point(x4, y4) | + +`council_district` and `planning_district` are polygon shapefiles while `df` is a point shapefile. For simplicity, `council_district` and `planning_district` both use column `ID` as the unique identifier. + +``` +# Save the dataframes into dictionaries +boundaries = {'council': council_district, 'planning': planning_district} + +# Create empty dictionaries to hold our results +results = {} + + +# Loop over different boundaries (council, planning) +for key, value in boundaries.items(): + # Define new variables using f string + join_df = f"{key}_join" + agg_df = f"{key}_summary" + # Spatial join, but don't save it into the results dictionary + join_df = gpd.sjoin(df, value, how = 'inner', predicate = 'intersects') + # Aggregate and save results into results dictionary + results[agg_df] = join_df.groupby('ID').agg( + {'Business': 'count', 'Sales_millions': 'sum'}) +``` + +Our results dictionary contains 2 dataframes: `council_summary` and `planning_summary`. We can see the contents of the results dictionary using this: + +``` +for key, value in results.items(): + display(key) + display(value.head()) + + +# To access the "dataframe", write this: +results["council_summary"].head() +results["planning_summary"].head() +``` + +`council_summary` would look like this, with the total count of Business and sum of Sales_millions within the council district: + +| ID | Business | Sales_millions | +| --- | -------- | -------------- | +| 1 | 2 | 45 | +| 2 | 1 | 2 | +| 3 | 1 | 30 | + +## Multiple Geometry Columns + +Sometimes we want to iterate over different options, and we want to see the results side-by-side. Here, we draw multiple buffers around `df`, specifically, 100 ft and 200 ft buffers. + +``` +# Make sure our projection has US feet as its units +df.to_crs('EPSG:2965') + +# Add other columns for the different buffers +df['geometry100'] = df.geometry.buffer(100) +df['geometry200'] = df.geometry.buffer(200) + +df +``` + +| Business | Sales_millions | geometry | geometry100 | geometry200 | +| ------------- | -------------- | ------------- | ----------- | ----------- | +| Paunch Burger | 5 | Point(x1, y1) | polygon | polygon | +| Sweetums | 30 | Point(x2, y2) | polygon | polygon | +| Jurassic Fork | 2 | Point(x3, y3) | polygon | polygon | +| Gryzzl | 40 | Point(x4, y4) | polygon | polygon | + +To create a new gdf with just 100 ft buffers, select the relevant geometry column, `geometry100`, and set it as the geometry of the gdf. + +``` +df100 = df[['Business', 'Sales_millions', + 'geometry100']].set_geometry('geometry100') +``` + +
diff --git a/docs/analytics_new_analysts/08-spatial-analysis-advanced.md b/docs/analytics_new_analysts/08-spatial-analysis-advanced.md index 2c321af901..abeec1dbd8 100644 --- a/docs/analytics_new_analysts/08-spatial-analysis-advanced.md +++ b/docs/analytics_new_analysts/08-spatial-analysis-advanced.md @@ -1,57 +1,60 @@ -(geo-advanced)= -# Working with Geospatial Data: Advanced - -Place matters. After covering the [intermediate tutorial](geo-intermediate), you're ready to cover some advanced spatial analysis topics. - -Below are more detailed explanations for dealing with geometry in Python. -* [Types of geometric shapes](#types-of-geometric-shapes) -* [Geometry in-memory and in databases](#geometry-in-memory-and-in-databases) - - -## Getting Started - -``` -# Import Python packages -import pandas as pd -import geopandas as gpd -from shapely.geometry import Point -from geoalchemy2 import WKTElement -``` - -## Types of Geometric Shapes -There are six possible geometric shapes that are represented in geospatial data. [More description here.](http://postgis.net/workshops/postgis-intro/geometries.html#representing-real-world-objects) -* Point -* MultiPoint: collection of points -* LineString -* MultiLineString: collection of linestrings, which are disconnected from each other -* Polygon -* MultiPolygon: collection of polygons, which can be disconnected or overlapping from each other - -The ArcGIS equivalent of these are just points, lines, and polygons. - - -## Geometry In-Memory and in Databases -If you're loading a GeoDataFrame (gdf), having the `geometry` column is necessary to do spatial operations in your Python session. The `geometry` column is composed of Shapely objects, such as Point or MultiPoint, LineString or MultiLineString, and Polygon or MultiPolygon. - -Databases often store geospatial information as well-known text (WKT) or its binary equivalent, well-known binary (WKB). These are well-specified interchange formats for the importing and exporting of geospatial data. Often, querying a database (PostGIS, SpatiaLite, etc) or writing data to the database requires converting the `geometry` column to/from WKT/WKB. - -The spatial referencing system identifier (SRID) is the **geographic coordinate system** of the latitude and longitude coordinates. As you are writing the coordinates into WKT/WKB, don't forget to set the SRID. WGS84 is a commonly used geographic coordinate system; it provides latitude and longitude in decimal degrees. The SRID for WGS84 is 4326. [Refresher on geographic coordinated system vs projected coordinated system.](geo-basics) - -*Shapely* is the Python package used to create the `geometry` column when you're working with the gdf in-memory. *Geoalchemy* is the Python package used to write the `geometry` column into geospatial databases. Unless you're writing the geospatial data into a database, you're most likely sticking with *shapely* rather than *geoalchemy*. - -To summarize: - -| Data is used / sourced from... | Python Package | Geometry column | SRID/EPSG -| ---| ---- | --- | --- | -| Local Python session, in-memory | shapely | shapely object: Point, LineString, Polygon and Multi equivalents | CRS is usually set, but most likely will still need to re-project your CRS using EPSG -| Database (PostGIS, SpatiaLite, etc) | geoalchemy | WKT or WKB | define the SRID - -``` -# Set the SRID -srid = 4326 -df = df.dropna(subset=['lat', 'lon']) -df['geometry'] = df.apply( - lambda x: WKTElement(Point(x.lon, x.lat).wkt, srid=srid), axis = 1) -``` - -
+(geo-advanced)= + +# Working with Geospatial Data: Advanced + +Place matters. After covering the [intermediate tutorial](geo-intermediate), you're ready to cover some advanced spatial analysis topics. + +Below are more detailed explanations for dealing with geometry in Python. + +- [Types of geometric shapes](#types-of-geometric-shapes) +- [Geometry in-memory and in databases](#geometry-in-memory-and-in-databases) + +## Getting Started + +``` +# Import Python packages +import pandas as pd +import geopandas as gpd +from shapely.geometry import Point +from geoalchemy2 import WKTElement +``` + +## Types of Geometric Shapes + +There are six possible geometric shapes that are represented in geospatial data. [More description here.](http://postgis.net/workshops/postgis-intro/geometries.html#representing-real-world-objects) + +- Point +- MultiPoint: collection of points +- LineString +- MultiLineString: collection of linestrings, which are disconnected from each other +- Polygon +- MultiPolygon: collection of polygons, which can be disconnected or overlapping from each other + +The ArcGIS equivalent of these are just points, lines, and polygons. + +## Geometry In-Memory and in Databases + +If you're loading a GeoDataFrame (gdf), having the `geometry` column is necessary to do spatial operations in your Python session. The `geometry` column is composed of Shapely objects, such as Point or MultiPoint, LineString or MultiLineString, and Polygon or MultiPolygon. + +Databases often store geospatial information as well-known text (WKT) or its binary equivalent, well-known binary (WKB). These are well-specified interchange formats for the importing and exporting of geospatial data. Often, querying a database (PostGIS, SpatiaLite, etc) or writing data to the database requires converting the `geometry` column to/from WKT/WKB. + +The spatial referencing system identifier (SRID) is the **geographic coordinate system** of the latitude and longitude coordinates. As you are writing the coordinates into WKT/WKB, don't forget to set the SRID. WGS84 is a commonly used geographic coordinate system; it provides latitude and longitude in decimal degrees. The SRID for WGS84 is 4326. [Refresher on geographic coordinated system vs projected coordinated system.](geo-basics) + +*Shapely* is the Python package used to create the `geometry` column when you're working with the gdf in-memory. *Geoalchemy* is the Python package used to write the `geometry` column into geospatial databases. Unless you're writing the geospatial data into a database, you're most likely sticking with *shapely* rather than *geoalchemy*. + +To summarize: + +| Data is used / sourced from... | Python Package | Geometry column | SRID/EPSG | +| ----------------------------------- | -------------- | ---------------------------------------------------------------- | ------------------------------------------------------------------------------------- | +| Local Python session, in-memory | shapely | shapely object: Point, LineString, Polygon and Multi equivalents | CRS is usually set, but most likely will still need to re-project your CRS using EPSG | +| Database (PostGIS, SpatiaLite, etc) | geoalchemy | WKT or WKB | define the SRID | + +``` +# Set the SRID +srid = 4326 +df = df.dropna(subset=['lat', 'lon']) +df['geometry'] = df.apply( + lambda x: WKTElement(Point(x.lon, x.lat).wkt, srid=srid), axis = 1) +``` + +
diff --git a/docs/analytics_new_analysts/overview.md b/docs/analytics_new_analysts/overview.md index 6bc6e50da1..778cc1197f 100644 --- a/docs/analytics_new_analysts/overview.md +++ b/docs/analytics_new_analysts/overview.md @@ -1,29 +1,34 @@ (beginner_analysts_tutorials)= + # Tutorials for New Python Users + This section is geared towards data analysts who are new to Python. The following tutorials highlight the most relevant Python skills used at Cal ITP. Use them to guide you through completing [practice exercises #1-9](https://github.com/cal-itp/data-analyses/tree/main/starter_kit). ## Content: -* [Data Analysis: Introduction](pandas-intro) -* [Data Analysis: Intermediate](pandas-intermediate) -* [Data Management](data-management-page) -* [Best Practices for Jupyter Notebooks](nb-best-practices) -* [Working with Geospatial Data: Basics](geo-basics) -* [Working with Geospatial Data: Intro](geo-intro) -* [Working with Geospatial Data: Intermediate](geo-intermediate) -* [Working with Geospatial Data: Advanced](geo-advanced) + +- [Data Analysis: Introduction](pandas-intro) +- [Data Analysis: Intermediate](pandas-intermediate) +- [Data Management](data-management-page) +- [Best Practices for Jupyter Notebooks](nb-best-practices) +- [Working with Geospatial Data: Basics](geo-basics) +- [Working with Geospatial Data: Intro](geo-intro) +- [Working with Geospatial Data: Intermediate](geo-intermediate) +- [Working with Geospatial Data: Advanced](geo-advanced) ## Additional Resources: -* If you are new to Python, take a look at [all the Python tutorials](https://www.linkedin.com/learning/search?keywords=python&u=36029164) available through Caltrans. There are many introductory Python courses [such as this one.](https://www.linkedin.com/learning/python-essential-training-18764650/getting-started-with-python?autoplay=true&u=36029164) -* [Joris van den Bossche's Geopandas Tutorial](https://github.com/jorisvandenbossche/geopandas-tutorial) -* [Practical Python for Data Science by Jill Cates](https://www.practicalpythonfordatascience.com/intro.html) -* [Ben-Gurion University of the Negev - Geometric operations](https://geobgu.xyz/py/geopandas2.html) -* [Geographic Thinking for Data Scientists](https://geographicdata.science/book/notebooks/01_geo_thinking.html) -* [Python Courses, compiled by our team](https://docs.google.com/spreadsheets/d/1Omow8F0SUiMx1jyG7GpbwnnJ5yWqlLeMH7SMtKxwG80/edit?usp=sharing) -* [Why Dask?](https://docs.dask.org/en/stable/why.html) -* [10 Minutes to Dask](https://docs.dask.org/en/stable/10-minutes-to-dask.html) + +- If you are new to Python, take a look at [all the Python tutorials](https://www.linkedin.com/learning/search?keywords=python&u=36029164) available through Caltrans. There are many introductory Python courses [such as this one.](https://www.linkedin.com/learning/python-essential-training-18764650/getting-started-with-python?autoplay=true&u=36029164) +- [Joris van den Bossche's Geopandas Tutorial](https://github.com/jorisvandenbossche/geopandas-tutorial) +- [Practical Python for Data Science by Jill Cates](https://www.practicalpythonfordatascience.com/intro.html) +- [Ben-Gurion University of the Negev - Geometric operations](https://geobgu.xyz/py/geopandas2.html) +- [Geographic Thinking for Data Scientists](https://geographicdata.science/book/notebooks/01_geo_thinking.html) +- [Python Courses, compiled by our team](https://docs.google.com/spreadsheets/d/1Omow8F0SUiMx1jyG7GpbwnnJ5yWqlLeMH7SMtKxwG80/edit?usp=sharing) +- [Why Dask?](https://docs.dask.org/en/stable/why.html) +- [10 Minutes to Dask](https://docs.dask.org/en/stable/10-minutes-to-dask.html) ### Books: -* [The Performance Stat Potential](https://www.brookings.edu/book/the-performancestat-potential/) -* [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) -* [Data Wrangling With Python](http://shop.oreilly.com/product/0636920032861.do) -* [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook/tree/master/notebooks) + +- [The Performance Stat Potential](https://www.brookings.edu/book/the-performancestat-potential/) +- [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) +- [Data Wrangling With Python](http://shop.oreilly.com/product/0636920032861.do) +- [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook/tree/master/notebooks) diff --git a/docs/analytics_onboarding/overview.md b/docs/analytics_onboarding/overview.md index 4cb510c2fb..6cfcebbeac 100644 --- a/docs/analytics_onboarding/overview.md +++ b/docs/analytics_onboarding/overview.md @@ -1,52 +1,59 @@ (technical-onboarding)= + # Technical Onboarding + :::{admonition} Bookmark this page for quick access to team resources :class: tip **Analysts** -* See below for list of our collaboration and analysis tools. -* As part of your onboarding, privileges have already been created for you to access the resources below. + +- See below for list of our collaboration and analysis tools. +- As part of your onboarding, privileges have already been created for you to access the resources below. **Non-Analyst Team Members** -* Any of the tools below are available to you as well! + +- Any of the tools below are available to you as well! **If you still need help with access**, use the information at the bottom of this page to [**get help**](get-help). ::: **Collaboration Tools:** -- [ ] [**Slack**](https://cal-itp.slack.com) | ([Docs](slack-intro)) -- [ ] [**Analytics Repo**](https://github.com/cal-itp/data-analyses) | ([Docs](analytics-repo)) -- [ ] [**Analyst Project Board**](https://github.com/cal-itp/data-analyses/projects/1) | ([Docs](analytics-project-board)) -- [ ] [**Google Cloud Storage**](https://console.cloud.google.com/storage/browser/calitp-analytics-data) | ([Docs](storing-new-data)) +- [ ] [**Slack**](https://cal-itp.slack.com) | ([Docs](slack-intro)) +- [ ] [**Analytics Repo**](https://github.com/cal-itp/data-analyses) | ([Docs](analytics-repo)) +- [ ] [**Analyst Project Board**](https://github.com/cal-itp/data-analyses/projects/1) | ([Docs](analytics-project-board)) +- [ ] [**Google Cloud Storage**](https://console.cloud.google.com/storage/browser/calitp-analytics-data) | ([Docs](storing-new-data)) **Analytics Tools:** -- [ ] **[notebooks.calitp.org](https://notebooks.calitp.org/)** - JupyterHub cloud-based notebooks for querying Python, SQL, R | ([Docs](jupyterhub-intro)) -- [ ] **[dashboards.calitp.org](https://dashboards.calitp.org/)** - Metabase business insights & dashboards | ([Docs](metabase)) -- [ ] **[dbt-docs.calitp.org](https://dbt-docs.calitp.org/)** - Documentation for the Cal-ITP data warehouse. -- [ ] **[analysis.calitp.org](https://analysis.calitp.org/)** - The Cal-ITP analytics portfolio website. | (Docs WIP) -- [ ] [**Google BigQuery**](https://console.cloud.google.com/bigquery) - Viewing the data warehouse and querying SQL +- [ ] **[notebooks.calitp.org](https://notebooks.calitp.org/)** - JupyterHub cloud-based notebooks for querying Python, SQL, R | ([Docs](jupyterhub-intro)) +- [ ] **[dashboards.calitp.org](https://dashboards.calitp.org/)** - Metabase business insights & dashboards | ([Docs](metabase)) +- [ ] **[dbt-docs.calitp.org](https://dbt-docs.calitp.org/)** - Documentation for the Cal-ITP data warehouse. +- [ ] **[analysis.calitp.org](https://analysis.calitp.org/)** - The Cal-ITP analytics portfolio website. | (Docs WIP) +- [ ] [**Google BigQuery**](https://console.cloud.google.com/bigquery) - Viewing the data warehouse and querying SQL **Python Libraries:** -- [ ] **calitp-data-analysis** - Cal-ITP's internal Python library for analysis | ([Docs](calitp-data-analysis)) -- [ ] **siuba** - Recommended data analysis library | ([Docs](siuba)) -- [ ] [**shared_utils**](https://github.com/cal-itp/data-analyses/tree/main/_shared_utils) - A shared utilities library for the analytics team | ([Docs](shared-utils)) +- [ ] **calitp-data-analysis** - Cal-ITP's internal Python library for analysis | ([Docs](calitp-data-analysis)) +- [ ] **siuba** - Recommended data analysis library | ([Docs](siuba)) +- [ ] [**shared_utils**](https://github.com/cal-itp/data-analyses/tree/main/_shared_utils) - A shared utilities library for the analytics team | ([Docs](shared-utils)) **Caltrans Employee Resources:** -- [ ] [**OnRamp**](https://onramp.dot.ca.gov/) - Caltrans employee intranet -- [ ] [**Service Now (SNOW)**](https://cdotprod.service-now.com/sp) - Caltrans IT Service Management Portal for IT issues and requesting specific software -- [ ] [**Cal Employee Connect**](https://connect.sco.ca.gov/) - State Controller's Office site for paystubs and tax information -- [ ] [**Geospatial Enterprise Engagement Platform - GIS Account Request Form**](https://sv03tmcpo.ct.dot.ca.gov/portal/apps/sites/#/geep/pages/account-request) (optional) - User request form for ArcGIS Online and ArcGIS Portal accounts +- [ ] [**OnRamp**](https://onramp.dot.ca.gov/) - Caltrans employee intranet +- [ ] [**Service Now (SNOW)**](https://cdotprod.service-now.com/sp) - Caltrans IT Service Management Portal for IT issues and requesting specific software +- [ ] [**Cal Employee Connect**](https://connect.sco.ca.gov/) - State Controller's Office site for paystubs and tax information +- [ ] [**Geospatial Enterprise Engagement Platform - GIS Account Request Form**](https://sv03tmcpo.ct.dot.ca.gov/portal/apps/sites/#/geep/pages/account-request) (optional) - User request form for ArcGIS Online and ArcGIS Portal accounts -  +  (get-help)= + ```{admonition} Still need access to a non-Caltrans tool above? Ask on the `#services-team` channel in the Cal-ITP Slack. ``` ## New Analyst Training Curriculum + This is a rough guide to your first few weeks on our team: + 1. Week 1 -- Introduction to Caltrans, Cal-ITP, and Division of Data & Digital Services: Includes non-technical 1:1 chats with the rest of the analyst team to meet and discuss ongoing and past projects. Also includes peer-guided introduction to [transit data](https://docs.calitp.org/data-infra/warehouse/what_is_gtfs.html) and transportation grants concepts. 2. Weeks 1-2 -- Technical onboarding (GitHub, JupyterHub, Google products): Includes working through an example push/pull/commit workflow with your Personal README. 3. Weeks 2-3 -- Introduction to our data: Includes learning what is in and how to access our data [warehouse](https://docs.calitp.org/data-infra/warehouse/warehouse_starter_kit.html) and Airtable, with Python and in Metabase. Also includes basic data visualization concepts. diff --git a/docs/analytics_tools/bi_dashboards.md b/docs/analytics_tools/bi_dashboards.md index 248292b6e3..576c39196e 100644 --- a/docs/analytics_tools/bi_dashboards.md +++ b/docs/analytics_tools/bi_dashboards.md @@ -12,9 +12,13 @@ kernelspec: language: python name: python3 --- + (metabase)= + # Business Insights & Dashboards + ## Introduction to Metabase + Metabase is Cal-ITP's dashboarding tool, and it's where data generated by Cal-ITP's data pipeline can be turned in to graphs, tables, and robust dashboards. @@ -27,9 +31,9 @@ Metabase is best utilized for the Cal-ITP project. If you have never used Metabase before, there are a few important terms: -* **Question:** A single table or graph, created by you with either SQL or Metabase's point-and-click UI. -* **Dashboard:** A group of questions that work together to support holistic data exploration and help inform decisions. -* **Collection:** Essentially, a "folder" for us to put relevant questions and dashboards into. +- **Question:** A single table or graph, created by you with either SQL or Metabase's point-and-click UI. +- **Dashboard:** A group of questions that work together to support holistic data exploration and help inform decisions. +- **Collection:** Essentially, a "folder" for us to put relevant questions and dashboards into. You can also incorporate filters into your questions and dashboards, specify custom click behavior in your dashboards, and more. @@ -41,23 +45,23 @@ which is full of useful tutorials on how to work with Metabase questions, dashbo Below are some helpful articles from Metabase's learn section: -* Getting Started - * [Getting Started with Metabase](https://www.metabase.com/learn/getting-started/getting-started.html) (Video included!) - * [A Tour of Metabase](https://www.metabase.com/learn/getting-started/tour-of-metabase.html) -* Asking Questions - * [Create Charts with Explorable Data](https://www.metabase.com/learn/questions/drill-through.html) - * [Custom Expressions in the Notebook Editor](https://www.metabase.com/learn/questions/custom-expressions.html) -* Working with SQL - * [Best Practices for Writing SQL Queries](https://www.metabase.com/learn/sql-questions/sql-best-practices.html) - * [Create Filter Widgets for Charts Using SQL Variables](https://www.metabase.com/learn/sql-questions/sql-variables.html) (Video included!) - * [Field Filters](https://www.metabase.com/learn/sql-questions/field-filters.html) -* Visualizing Data - * [Which Chart Should You Use?](https://www.metabase.com/learn/visualization/chart-guide.html) - * [Everything You Can Do with the Table Visualization](https://www.metabase.com/learn/visualization/table.html) - * [Visualizing Data with Maps](https://www.metabase.com/learn/visualization/maps.html) (Video included!) -* Building Dashboards - * [Best Practices for BI Dashboards](https://www.metabase.com/learn/dashboards/bi-dashboard-best-practices.html) - * [Linking Filters](https://www.metabase.com/learn/dashboards/linking-filters.html) +- Getting Started + - [Getting Started with Metabase](https://www.metabase.com/learn/getting-started/getting-started.html) (Video included!) + - [A Tour of Metabase](https://www.metabase.com/learn/getting-started/tour-of-metabase.html) +- Asking Questions + - [Create Charts with Explorable Data](https://www.metabase.com/learn/questions/drill-through.html) + - [Custom Expressions in the Notebook Editor](https://www.metabase.com/learn/questions/custom-expressions.html) +- Working with SQL + - [Best Practices for Writing SQL Queries](https://www.metabase.com/learn/sql-questions/sql-best-practices.html) + - [Create Filter Widgets for Charts Using SQL Variables](https://www.metabase.com/learn/sql-questions/sql-variables.html) (Video included!) + - [Field Filters](https://www.metabase.com/learn/sql-questions/field-filters.html) +- Visualizing Data + - [Which Chart Should You Use?](https://www.metabase.com/learn/visualization/chart-guide.html) + - [Everything You Can Do with the Table Visualization](https://www.metabase.com/learn/visualization/table.html) + - [Visualizing Data with Maps](https://www.metabase.com/learn/visualization/maps.html) (Video included!) +- Building Dashboards + - [Best Practices for BI Dashboards](https://www.metabase.com/learn/dashboards/bi-dashboard-best-practices.html) + - [Linking Filters](https://www.metabase.com/learn/dashboards/linking-filters.html) ## Metabase at Cal-ITP @@ -69,9 +73,9 @@ to ensure only certain users are able to edit certain questions. This approach h concerns of question/dashboard ownership and responsibility. As of this writing, there are three primary collections: -* **Cal-ITP Dashboards:** Read-only dashboards, editable only by Cal-ITP analysts -* **Cal-ITP Commons:** Permission-less sharing of questions and dashboards -* **Cal-ITP Development:** Analyst-only collection for works in progress +- **Cal-ITP Dashboards:** Read-only dashboards, editable only by Cal-ITP analysts +- **Cal-ITP Commons:** Permission-less sharing of questions and dashboards +- **Cal-ITP Development:** Analyst-only collection for works in progress In general, the **Cal-ITP Dashboards** collection is used for what we might consider "complete" or "official" sources of truth for the work done by Cal-ITP analysts. diff --git a/docs/analytics_tools/data_catalogs.md b/docs/analytics_tools/data_catalogs.md index e2154c6dc7..5935c0abd3 100644 --- a/docs/analytics_tools/data_catalogs.md +++ b/docs/analytics_tools/data_catalogs.md @@ -12,7 +12,9 @@ kernelspec: language: python name: python3 --- + (data-catalogs)= + # Using Data Catalogs One major difficulty with conducting reproducible analyses is the location of data. If a data analyst downloads a CSV on their local system, but does not document its provenance or access, the analysis becomes very difficult to reproduce. @@ -24,9 +26,9 @@ Each task sub-folder within the `data-analyses` repo should come with its own da ## Table of Contents 1. Data Catalogs with [Intake](#intake) -1. [Open Data Portals](#open-data-portals) -1. [Google Cloud Storage](#google-cloud-storage) (GCS) Buckets -1. [Sample Data Catalog](#sample-data-catalog) +2. [Open Data Portals](#open-data-portals) +3. [Google Cloud Storage](#google-cloud-storage) (GCS) Buckets +4. [Sample Data Catalog](#sample-data-catalog) ### Intake @@ -37,8 +39,9 @@ Data analysts tend to load their data from many heterogeneous sources (Databases Refer to this [sample-catalog.yml](sample-catalog) to see how various data sources and file types are documented. Each dataset is given a human-readable name, with optional metadata associated. File types that work within GCS buckets, URLs, or DCATs (open data catalogs): -* Tabular: CSV, parquet -* Geospatial: zipped shapefile, GeoJSON, geoparquet + +- Tabular: CSV, parquet +- Geospatial: zipped shapefile, GeoJSON, geoparquet To open the catalog in a Jupyter Notebook: @@ -55,10 +58,10 @@ catalog = intake.open_catalog("./*.yml") Open data portals (such as the CA Open Data Portal and CA Geoportal) usually provide a DCAT catalog for their datasets, including links for downloading them and metadata describing them. Many civic data analysis projects end up using these open datasets. When they do, it should be clearly documented. -* To input a dataset from an open data portal, find the dataset's identifier for the `catalog.yml`. -* Ex: The URL for CA Open Data Portal is: https://data.ca.gov. -* Navigate to the corresponding `data.json` file at https://data.ca.gov/data.json -* Each dataset has associated metadata, including `accessURL`, `landingPage`, etc. Find the dataset's `identifier`, and input that as the catalog item. +- To input a dataset from an open data portal, find the dataset's identifier for the `catalog.yml`. +- Ex: The URL for CA Open Data Portal is: https://data.ca.gov. +- Navigate to the corresponding `data.json` file at https://data.ca.gov/data.json +- Each dataset has associated metadata, including `accessURL`, `landingPage`, etc. Find the dataset's `identifier`, and input that as the catalog item. ```yaml # Catalog item @@ -75,12 +78,14 @@ To import this dataset as a dataframe within the notebook: ```python df = catalog.ca_open_data.cdcr_population_covid_tracking.read() ``` + (catalogue-cloud-storage)= + ### Google Cloud Storage When putting GCS files into the catalog, note that geospatial datasets (zipped shapefiles, GeoJSONs) require the additional `use_fsspec: true` argument compared to tabular datasets (parquets, CSVs). Geoparquets, the exception, are catalogued like tabular datasets. -Opening geospatial datasets through `intake` is the easiest way to import these datasets within a Jupyter Notebook. Otherwise, `geopandas` can read the geospatial datasets that are locally saved or downloaded first from the bucket, but not directly with a GCS file path. Refer to [storing data](Connecting to the Warehouse) to set up your Google authentication. +Opening geospatial datasets through `intake` is the easiest way to import these datasets within a Jupyter Notebook. Otherwise, `geopandas` can read the geospatial datasets that are locally saved or downloaded first from the bucket, but not directly with a GCS file path. Refer to \[storing data\](Connecting to the Warehouse) to set up your Google authentication. ```yaml lehd_federal_jobs_by_tract: @@ -120,6 +125,7 @@ gdf2 = catalog.test_geoparquet.read() ``` (sample-catalog)= + # Sample Data Catalog ```{literalinclude} sample-catalog.yml diff --git a/docs/analytics_tools/github_setup.md b/docs/analytics_tools/github_setup.md index a333874bf6..68d88de152 100644 --- a/docs/analytics_tools/github_setup.md +++ b/docs/analytics_tools/github_setup.md @@ -1,17 +1,20 @@ (github_setup)= + # GitHub Setup ## Table of Contents + 1. [Onboarding Setup](#onboarding-setup) - * [Adding a GitHub SSH Key to Jupyter](authenticating-github-jupyter) - * [Persisting your SSH Key and Enabling Extensions](persisting-ssh-and-extensions) - * [Cloning a Repository](cloning-a-repository) + - [Adding a GitHub SSH Key to Jupyter](authenticating-github-jupyter) + - [Persisting your SSH Key and Enabling Extensions](persisting-ssh-and-extensions) + - [Cloning a Repository](cloning-a-repository) ## Onboarding Setup We'll work through getting set up with SSH and GitHub on JupyterHub and cloning one GitHub repo. This is the first task you'll need to complete before contributing code. Repeat the steps in [Cloning a Repository](cloning-a-repository) for other repos. (authenticating-github-jupyter)= + ### Authenticating to GitHub via the gh CLI > This section describes using the GitHub CLI to set up SSH access, but the generic instructions can be found [here](https://docs.github.com/en/authentication/connecting-to-github-with-ssh). @@ -19,6 +22,7 @@ We'll work through getting set up with SSH and GitHub on JupyterHub and cloning 1. Create a GitHub username if necessary and ensure you're added to the appropriate Cal-ITP teams on GitHub. You'll be committing directly into the Cal-ITP repos! 2. Open a terminal in JupyterHub. All of our commands will be typed in this terminal. 3. `gh auth login` and select the following options + ``` (base) jovyan@f4b18b106c18:~$ gh auth login ? What account do you want to log into? GitHub.com @@ -28,9 +32,11 @@ We'll work through getting set up with SSH and GitHub on JupyterHub and cloning ? Title for your SSH key: GitHub CLI ? How would you like to authenticate GitHub CLI? Login with a web browser ``` + You can press `Enter` to leave the passphrase empty, or you may provide a password; in the future, you will need to enter this password when your server starts. If you've already created an SSH key, you will be prompted to select the existing key rather than creating a new one. 4. You will then be given a one-time code and instructed to press `Enter` to open a web browser, which will fail if you are using JupyterHub. However, you can manually open the link in a browser and enter the code. You will end up with output similar to the following. + ``` ! First copy your one-time code: ABCD-1234 Press Enter to open github.com in your browser... @@ -48,35 +54,40 @@ Press Enter to open github.com in your browser... After completing the steps above be sure to complete the section below to persist your SSH key between sessions and enable extensions. (persisting-ssh-and-extensions)= + ### Persisting your SSH Key and Enabling Extensions + To ensure that your SSH key settings persist between your sessions, run the following command in the Jupyter terminal. -* `echo "source .profile" >> .bashrc` +- `echo "source .profile" >> .bashrc` Now, restart your Jupyter server by selecting: -* `File` -> `Hub Control Panel` -> `Stop Server`, then `Start Server` +- `File` -> `Hub Control Panel` -> `Stop Server`, then `Start Server` From here, after opening a new Jupyter terminal you should see the notification: -* `ssh-add: Identities added: /home/jovyan/.ssh/id_ed25519` +- `ssh-add: Identities added: /home/jovyan/.ssh/id_ed25519` If the above doesn't work, try: -* Closing your terminal and opening a new one -* Following the instructions to restart your Jupyter server above -* Substituting the following for the `echo` command above and re-attempting: - * `echo "source .profile" >> .bash_profile` -* Following the steps below to change your .bash_profile: - * In terminal use `cd` to navigate to the home directory (not a repository) - * Type `nano .bash_profile` to open the .bash_profile in a text editor - * Change `source .profile` to `source ~/.profile` - * Exit with Ctrl+X, hit yes, then hit enter at the filename prompt - * Restart your server; you can check your changes with `cat .bash_profile` + +- Closing your terminal and opening a new one +- Following the instructions to restart your Jupyter server above +- Substituting the following for the `echo` command above and re-attempting: + - `echo "source .profile" >> .bash_profile` +- Following the steps below to change your .bash_profile: + - In terminal use `cd` to navigate to the home directory (not a repository) + - Type `nano .bash_profile` to open the .bash_profile in a text editor + - Change `source .profile` to `source ~/.profile` + - Exit with Ctrl+X, hit yes, then hit enter at the filename prompt + - Restart your server; you can check your changes with `cat .bash_profile` After completing this section, you will also enjoy various extensions in Jupyter, such as `black` hotkey auto-formatting with `ctrl+shft+k`, and the ability to see your current git branch in the Jupyter terminal. (cloning-a-repository)= + ### Cloning a Repository + 1. Navigate to the GitHub repository to clone. We'll work our way through the `data-analyses` [repo here](https://github.com/cal-itp/data-analyses). Click on the green `Code` button, select "SSH" and copy the URL. 1. You may be prompted to accept GitHub key's fingerprint if you are cloning a repository for the first time. 2. Clone the Git repo: `git clone git@github.com:cal-itp/data-analyses.git` diff --git a/docs/analytics_tools/jupyterhub.md b/docs/analytics_tools/jupyterhub.md index 601b9c712f..06f7d26169 100644 --- a/docs/analytics_tools/jupyterhub.md +++ b/docs/analytics_tools/jupyterhub.md @@ -1,7 +1,9 @@ (jupyterhub-intro)= + # JupyterHub ## Introduction to JupyterHub + Jupyterhub is a web application that allows users to analyze and create reports on warehouse data (or a number of data sources). Analyses on JupyterHub are accomplished using notebooks, which allow users to mix narrative with analysis code. @@ -9,17 +11,19 @@ Analyses on JupyterHub are accomplished using notebooks, which allow users to mi **You can access JuypterHub using this link: [notebooks.calitp.org](https://notebooks.calitp.org/)**. ## Table of Contents + 1. [Using JupyterHub](#using-jupyterhub) -1. [Logging in to JupyterHub](#logging-in-to-jupyterhub) -1. [Connecting to the Warehouse](#connecting-to-the-warehouse) -1. [Increasing the Query Limit](#increasing-the-query-limit) -1. [Querying with SQL in JupyterHub](querying-sql-jupyterhub) -1. [Saving Code to Github](saving-code-jupyter) -1. [Environment Variables](#environment-variables) -1. [Jupyter Notebook Best Practices](notebook-shortcuts) -1. [Developing warehouse models in Jupyter](jupyterhub-warehouse) +2. [Logging in to JupyterHub](#logging-in-to-jupyterhub) +3. [Connecting to the Warehouse](#connecting-to-the-warehouse) +4. [Increasing the Query Limit](#increasing-the-query-limit) +5. [Querying with SQL in JupyterHub](querying-sql-jupyterhub) +6. [Saving Code to Github](saving-code-jupyter) +7. [Environment Variables](#environment-variables) +8. [Jupyter Notebook Best Practices](notebook-shortcuts) +9. [Developing warehouse models in Jupyter](jupyterhub-warehouse) ## Using JupyterHub + For Python users, we have deployed a cloud-based instance of JupyterHub to make creating, using, and sharing notebooks easy. This avoids the need to set up a local environment, provides dedicated storage, and allows you to push to GitHub. @@ -31,6 +35,7 @@ JupyterHub currently lives at [notebooks.calitp.org](https://notebooks.calitp.or Note: you will need to have been added to the Cal-ITP organization on GitHub to obtain access. If you have yet to be added to the organization and need to be, ask in the `#services-team` channel in Slack. (connecting-to-warehouse)= + ### Connecting to the Warehouse Connecting to the warehouse requires a bit of setup after logging in to JupyterHub, but allows users to query data in the warehouse directly. @@ -40,8 +45,8 @@ See the screencast below for a full walkthrough.
- The commands required: + ```bash # init will both authenticate and do basic configuration # You do not have to set a default compute region/zone @@ -69,6 +74,7 @@ tbls._init() ``` (querying-sql-jupyterhub)= + ### Querying with SQL in JupyterHub JupyterHub makes it easy to query SQL in the notebooks. @@ -78,6 +84,7 @@ To query SQL, simply import the below at the top of your notebook: ```python import calitp_data_analysis.magics ``` + And add the following to the top of any cell block that you would like to query SQL in: ```sql @@ -89,6 +96,7 @@ Example: ```python import calitp_data_analysis.magics ``` + ```sql %%sql @@ -99,12 +107,16 @@ WHERE key = "db58891de4281f965b4e7745675415ab" LIMIT 10 ``` + (saving-code-jupyter)= + ### Saving Code to Github + Use [this link](committing-from-jupyterhub) to navigate to the `Saving Code` section of the docs to learn how to commit code to GitHub from the Jupyter terminal. Once there, you will need to complete the instructions in the following sections: -* [Adding a GitHub SSH Key to Jupyter](authenticating-github-jupyter) -* [Persisting your SSH Key and Enabling Extensions](persisting-ssh-and-extensions) -* [Cloning a Repository](cloning-a-repository) + +- [Adding a GitHub SSH Key to Jupyter](authenticating-github-jupyter) +- [Persisting your SSH Key and Enabling Extensions](persisting-ssh-and-extensions) +- [Cloning a Repository](cloning-a-repository) ### Environment Variables @@ -125,6 +137,7 @@ AIRTABLE_API_KEY=ABCDEFG123456789 ``` To pass these credentials in a Jupyter Notebook: + ```python import dotenv import os @@ -137,12 +150,16 @@ GITHUB_API_KEY = os.environ["GITHUB_API_KEY"] ``` (notebook-shortcuts)= + ### Jupyter Notebook Best Practices External resources: -* [Cheat Sheet - Jupyter Notebook ](https://defkey.com/jupyter-notebook-shortcuts?pdf=true&modifiedDate=20200909T053706) -* [Using Markdown in Jupyter Notebook](https://www.datacamp.com/community/tutorials/markdown-in-jupyter-notebook) + +- [Cheat Sheet - Jupyter Notebook ](https://defkey.com/jupyter-notebook-shortcuts?pdf=true&modifiedDate=20200909T053706) +- [Using Markdown in Jupyter Notebook](https://www.datacamp.com/community/tutorials/markdown-in-jupyter-notebook) (jupyterhub-warehouse)= + ### Developing warehouse models in JupyterHub + See the [warehouse README](https://github.com/cal-itp/data-infra/tree/main/warehouse#readme) for warehouse project setup instructions. diff --git a/docs/analytics_tools/knowledge_sharing.md b/docs/analytics_tools/knowledge_sharing.md index 2aefd1b72a..fd28df323f 100644 --- a/docs/analytics_tools/knowledge_sharing.md +++ b/docs/analytics_tools/knowledge_sharing.md @@ -1,69 +1,81 @@ (knowledge-sharing)= + # Helpful Links + Here are some resources data analysts have collected and referenced, that will hopefully help you out in your work. Have something you want to share? Create a new markdown file, add it [to the example report folder](https://github.com/cal-itp/data-analyses/tree/main/example_report), and [message Amanda.](https://app.slack.com/client/T014965JTHA/C013N8GELLF/user_profile/U02PCTPSZ8A) -* [Data Analysis](#data-analysis) - * [Python](#python) - * [Pandas](#pandas) - * [Summarizing](#summarizing) - * [Merging](#merging) - * [Dates](#dates) - * [Monetary Values](#monetary-values) -* [Visualizations](#visualization) - * [Charts](#charts) - * [Maps](#maps) - * [DataFrames](#dataframes) - * [Ipywidgets](#ipywidgets) - * [Markdown](#markdown) +- [Data Analysis](#data-analysis) + - [Python](#python) + - [Pandas](#pandas) + - [Summarizing](#summarizing) + - [Merging](#merging) + - [Dates](#dates) + - [Monetary Values](#monetary-values) +- [Visualizations](#visualization) + - [Charts](#charts) + - [Maps](#maps) + - [DataFrames](#dataframes) + - [Ipywidgets](#ipywidgets) + - [Markdown](#markdown) ## Data Analysis + ### Python -* [Composing Programs: comprehensive Python course](https://composingprograms.com/) -* [Intermediate Python: tips for improving your programs](https://book.pythontips.com/en/latest/index.html) -* [Stop Python from executing code when a module is imported.](https://stackoverflow.com/questions/6523791/why-is-python-running-my-module-when-i-import-it-and-how-do-i-stop-it) -* [Loop through 2 lists with zip in parallel.](https://stackoverflow.com/questions/1663807/how-to-iterate-through-two-lists-in-parallel) -* [Find the elements that are in one list, but not in another list.](https://stackoverflow.com/questions/41125909/python-find-elements-in-one-list-that-are-not-in-the-other) -* [What does += do?](https://stackoverflow.com/questions/4841436/what-exactly-does-do) + +- [Composing Programs: comprehensive Python course](https://composingprograms.com/) +- [Intermediate Python: tips for improving your programs](https://book.pythontips.com/en/latest/index.html) +- [Stop Python from executing code when a module is imported.](https://stackoverflow.com/questions/6523791/why-is-python-running-my-module-when-i-import-it-and-how-do-i-stop-it) +- [Loop through 2 lists with zip in parallel.](https://stackoverflow.com/questions/1663807/how-to-iterate-through-two-lists-in-parallel) +- [Find the elements that are in one list, but not in another list.](https://stackoverflow.com/questions/41125909/python-find-elements-in-one-list-that-are-not-in-the-other) +- [What does += do?](https://stackoverflow.com/questions/4841436/what-exactly-does-do) ### Pandas -* [Turn columns into dummy variables.](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) -* [Export multiple dataframes into their own sheets to a single Excel workbook.](https://xlsxwriter.readthedocs.io/example_pandas_multiple.html) -* [Display multiple dataframes side by side.](https://stackoverflow.com/questions/38783027/jupyter-notebook-display-two-pandas-tables-side-by-side) -* [Display all rows or columns of a dataframe in the notebook](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html) +- [Turn columns into dummy variables.](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) +- [Export multiple dataframes into their own sheets to a single Excel workbook.](https://xlsxwriter.readthedocs.io/example_pandas_multiple.html) +- [Display multiple dataframes side by side.](https://stackoverflow.com/questions/38783027/jupyter-notebook-display-two-pandas-tables-side-by-side) +- [Display all rows or columns of a dataframe in the notebook](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html) ### Summarizing -* [Groupby and calculate a new value, then use that value within your DataFrame.](https://stackoverflow.com/questions/35640364/python-pandas-max-value-in-a-group-as-a-new-column) -* [Explanation of the split-apply-combine paradigm.](https://stackoverflow.com/questions/30244952/how-do-i-create-a-new-column-from-the-output-of-pandas-groupby-sum) -* [Pandas profiling tool: creates html reports from DataFrames.](https://github.com/ydataai/pandas-profiling) - * [Examples](https://pandas-profiling.ydata.ai/examples/master/census/census_report.html) + +- [Groupby and calculate a new value, then use that value within your DataFrame.](https://stackoverflow.com/questions/35640364/python-pandas-max-value-in-a-group-as-a-new-column) +- [Explanation of the split-apply-combine paradigm.](https://stackoverflow.com/questions/30244952/how-do-i-create-a-new-column-from-the-output-of-pandas-groupby-sum) +- [Pandas profiling tool: creates html reports from DataFrames.](https://github.com/ydataai/pandas-profiling) + - [Examples](https://pandas-profiling.ydata.ai/examples/master/census/census_report.html) ### Merging -* When working with data sets where the "merge on" column is a string data type, it can be difficult to get the DataFrames to join. For example, df1 lists County of Sonoma, Human Services Department, Adult and Aging Division, but df2 references the same department as: County of Sonoma (Human Services Department) . - * Potential Solution #1: [fill in a column in one DataFrame that has a partial match with the string values in another one.](https://stackoverflow.com/questions/61811137/based-on-partial-string-match-fill-one-data-frame-column-from-another-dataframe) - * Potential Solution #2: [use the package fuzzymatcher. This will require you to carefully comb through for any bad matches.](https://pbpython.com/record-linking.html) - * Potential Solution #3: [if you don't have too many values, use a dictionary.](https://github.com/cal-itp/data-analyses/blob/main/drmt_grants/TIRCP_functions.py#:~:text=%23%23%23%20RECIPIENTS%20%23%23%23,%7D) + +- When working with data sets where the "merge on" column is a string data type, it can be difficult to get the DataFrames to join. For example, df1 lists County of Sonoma, Human Services Department, Adult and Aging Division, but df2 references the same department as: County of Sonoma (Human Services Department) . + - Potential Solution #1: [fill in a column in one DataFrame that has a partial match with the string values in another one.](https://stackoverflow.com/questions/61811137/based-on-partial-string-match-fill-one-data-frame-column-from-another-dataframe) + - Potential Solution #2: [use the package fuzzymatcher. This will require you to carefully comb through for any bad matches.](https://pbpython.com/record-linking.html) + - Potential Solution #3: [if you don't have too many values, use a dictionary.](https://github.com/cal-itp/data-analyses/blob/main/drmt_grants/TIRCP_functions.py#:~:text=%23%23%23%20RECIPIENTS%20%23%23%23,%7D) ### Dates -* [Use shift to calculate the number of days between two dates.](https://towardsdatascience.com/all-the-pandas-shift-you-should-know-for-data-analysis-791c1692b5e) + +- [Use shift to calculate the number of days between two dates.](https://towardsdatascience.com/all-the-pandas-shift-you-should-know-for-data-analysis-791c1692b5e) + ``` df['n_days_between'] = (df['prepared_date'] - df.shift(1)['prepared_date']).dt.days ``` -* [Assign fiscal year to a date.](https://stackoverflow.com/questions/59181855/get-the-financial-year-from-a-date-in-a-pandas-dataframe-and-add-as-new-column) + +- [Assign fiscal year to a date.](https://stackoverflow.com/questions/59181855/get-the-financial-year-from-a-date-in-a-pandas-dataframe-and-add-as-new-column) + ``` # Make sure your column is a date time object df['financial_year'] = df['base_date'].map(lambda x: x.year if x.month > 3 else x.year-1) ``` ### Monetary Values -* [Reformat values that are in scientific notation into millions or thousands.](https://github.com/d3/d3-format) - * [Example in notebook.](https://github.com/cal-itp/data-analyses/blob/30de5cd2fed3a37e2c9cfb661929252ad76f6514/dla/e76_obligated_funds/_dla_utils.py#L221) + +- [Reformat values that are in scientific notation into millions or thousands.](https://github.com/d3/d3-format) + - [Example in notebook.](https://github.com/cal-itp/data-analyses/blob/30de5cd2fed3a37e2c9cfb661929252ad76f6514/dla/e76_obligated_funds/_dla_utils.py#L221) + ``` x=alt.X("Funding Amount", axis=alt.Axis(format="$.2s", title="Obligated Funding ($2021)")) ``` -* [Reformat values from 19000000 to $19.0M.](https://stackoverflow.com/questions/41271673/format-numbers-in-a-python-pandas-dataframe-as-currency-in-thousands-or-millions) -* Adjust for inflation. +- [Reformat values from 19000000 to $19.0M.](https://stackoverflow.com/questions/41271673/format-numbers-in-a-python-pandas-dataframe-as-currency-in-thousands-or-millions) +- Adjust for inflation. ``` # Must install and import cpi package for the function to work. @@ -103,17 +115,21 @@ def adjust_prices(df): ``` ## Visualization + ### Charts + #### Altair -* [Manually concatenate a bar chart and line chart to create a dual axis graph.](https://github.com/altair-viz/altair/issues/1934) -* [Adjust the time units of a datetime column for an axis.](https://altair-viz.github.io/user_guide/transform/timeunit.html) -* [Label the lines on a line chart.](https://stackoverflow.com/questions/61194028/adding-labels-at-end-of-line-chart-in-altair) -* [Layer altair charts, lose color with no encoding, workaround to get different colors to appear on legend.](altair-viz/altair#1099) -* [Add regression line to scatterplot.](https://stackoverflow.com/questions/61447422/quick-way-to-visualise-multiple-columns-in-altair-with-regression-lines) -* [Adjust scales for axes to be the min and max values.](https://stackoverflow.com/questions/62281179/how-to-adjust-scale-ranges-in-altair) -* [Resolving the error 'TypeError: Object of type 'Timestamp' is not JSON serializable'](https://github.com/altair-viz/altair/issues/1355) -* [Manually sort a legend.](https://github.com/cal-itp/data-analyses/blob/460e9fc8f4311e90d9c647e149a23a9e38035394/Agreement_Overlap/Visuals.ipynb) -* Add tooltip to chart functions. + +- [Manually concatenate a bar chart and line chart to create a dual axis graph.](https://github.com/altair-viz/altair/issues/1934) +- [Adjust the time units of a datetime column for an axis.](https://altair-viz.github.io/user_guide/transform/timeunit.html) +- [Label the lines on a line chart.](https://stackoverflow.com/questions/61194028/adding-labels-at-end-of-line-chart-in-altair) +- [Layer altair charts, lose color with no encoding, workaround to get different colors to appear on legend.](altair-viz/altair#1099) +- [Add regression line to scatterplot.](https://stackoverflow.com/questions/61447422/quick-way-to-visualise-multiple-columns-in-altair-with-regression-lines) +- [Adjust scales for axes to be the min and max values.](https://stackoverflow.com/questions/62281179/how-to-adjust-scale-ranges-in-altair) +- [Resolving the error 'TypeError: Object of type 'Timestamp' is not JSON serializable'](https://github.com/altair-viz/altair/issues/1355) +- [Manually sort a legend.](https://github.com/cal-itp/data-analyses/blob/460e9fc8f4311e90d9c647e149a23a9e38035394/Agreement_Overlap/Visuals.ipynb) +- Add tooltip to chart functions. + ``` def add_tooltip(chart, tooltip1, tooltip2): chart = ( @@ -121,24 +137,28 @@ def add_tooltip(chart, tooltip1, tooltip2): return chart ``` - ### Maps -* [Examples of folium, branca, and color maps.](https://nbviewer.org/github/python-visualization/folium/blob/v0.2.0/examples/Colormaps.ipynb) -* [Quick interactive maps with Geopandas.gdf.explore()](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.explore.html) + +- [Examples of folium, branca, and color maps.](https://nbviewer.org/github/python-visualization/folium/blob/v0.2.0/examples/Colormaps.ipynb) +- [Quick interactive maps with Geopandas.gdf.explore()](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.explore.html) ### DataFrames -* [Styling dataframes with HTML.](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html) -* [After styling a DataFrame, you will have to access the underlying data with .data](https://stackoverflow.com/questions/56647813/perform-operations-after-styling-in-a-dataframe). + +- [Styling dataframes with HTML.](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html) +- [After styling a DataFrame, you will have to access the underlying data with .data](https://stackoverflow.com/questions/56647813/perform-operations-after-styling-in-a-dataframe). ### ipywidgets + #### Tabs -* Create tabs to switch between different views. -* [Stack Overflow Help.](https://stackoverflow.com/questions/50842160/how-to-display-matplotlib-plots-in-a-jupyter-tab-widget) - * [Notebook example.](https://github.com/cal-itp/data-analyses/blob/main/dla/e76_obligated_funds/charting_function_work.ipynb?short_path=1c01de9#L302333) - * [Example on Ipywidgets docs page.](https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20List.html#Tabs) + +- Create tabs to switch between different views. +- [Stack Overflow Help.](https://stackoverflow.com/questions/50842160/how-to-display-matplotlib-plots-in-a-jupyter-tab-widget) + - [Notebook example.](https://github.com/cal-itp/data-analyses/blob/main/dla/e76_obligated_funds/charting_function_work.ipynb?short_path=1c01de9#L302333) + - [Example on Ipywidgets docs page.](https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20List.html#Tabs) ### Markdown -* [Create a markdown table.](https://www.pluralsight.com/guides/working-tables-github-markdown) -* [Add a table of content that links to headers throughout a markdown file.](https://stackoverflow.com/questions/2822089/how-to-link-to-part-of-the-same-document-in-markdown) -* [Add links to local files.](https://stackoverflow.com/questions/32563078/how-link-to-any-local-file-with-markdown-syntax?rq=1) -* [Direct embed an image.](https://datascienceparichay.com/article/insert-image-in-a-jupyter-notebook/) + +- [Create a markdown table.](https://www.pluralsight.com/guides/working-tables-github-markdown) +- [Add a table of content that links to headers throughout a markdown file.](https://stackoverflow.com/questions/2822089/how-to-link-to-part-of-the-same-document-in-markdown) +- [Add links to local files.](https://stackoverflow.com/questions/32563078/how-link-to-any-local-file-with-markdown-syntax?rq=1) +- [Direct embed an image.](https://datascienceparichay.com/article/insert-image-in-a-jupyter-notebook/) diff --git a/docs/analytics_tools/local_oracle_db_connections.md b/docs/analytics_tools/local_oracle_db_connections.md index 010d06ca1d..47b4c9dd77 100644 --- a/docs/analytics_tools/local_oracle_db_connections.md +++ b/docs/analytics_tools/local_oracle_db_connections.md @@ -45,13 +45,14 @@ python –m pip install oracledb ``` - - python will be uninstalled when sqlalchemy is uninstalled so we need to reinstall. +- python will be uninstalled when sqlalchemy is uninstalled so we need to reinstall. ![Uninstall sqlalchemy](assets/lodc_step7.png) **Step 8:** + 1. Determine what directory you are in and navigate to your directory of choice within Python Command Prompt. It is highly recommended to go into your S Number folder (i.e. `cd C:\Users\SNUMBER`) - * TIP: [See this Bash cheat sheet with some helpful commands](https://hpc.ua.edu/wp-content/uploads/2022/02/Linux_bash_cheat_sheet.pdf) + - TIP: [See this Bash cheat sheet with some helpful commands](https://hpc.ua.edu/wp-content/uploads/2022/02/Linux_bash_cheat_sheet.pdf) ![Navigate out to base directory](assets/lodc_step8_1.png) @@ -60,15 +61,14 @@ python –m pip install oracledb ![Check Folder](assets/lodc_step8_2.png) 3. Open a new Python Command Prompt Window. Navigate to the same directory, only adding the name of your new folder (cell #1 below). Then, open a Jupyter Lab (cell #2 below). - * This should open a new internet tab with Jupyter Lab. - * TIP: you can use `dir` to find out what files and folders you are in your current directory. - - + - This should open a new internet tab with Jupyter Lab. + - TIP: you can use `dir` to find out what files and folders you are in your current directory. ```bash cd C:\Users\SNUMBER/Notebook_Folder ##example of path you can take to get into the directory of your choice. In this case, we are navigating to the Notebooks folder we just created above. ``` + ```bash jupyter lab ``` @@ -84,7 +84,7 @@ jupyter lab ![Open a Notebook](assets/lodc_step10.png) Go to the left sidebar and see your new notebook. Right now, it is named untitled.ipynb. It is recommended to rename the notebook without spaces, using underscores instead. For example, - `my_notebook.ipynb` is easier to access compared to `My Notebook.ipynb`. +`my_notebook.ipynb` is easier to access compared to `My Notebook.ipynb`. In the notebook, copy the following in the notebook cells. It is recommended to copy this in chunks, as delineated by the lines in the table. To run the code, you can press SHIFT + ENTER. @@ -96,6 +96,7 @@ import oracledb oracledb.version = "8.3.0" ``` + ```python sys.modules["cx_Oracle"] = oracledb ``` @@ -105,7 +106,6 @@ sqlalchemy.__version__ ##checks the version of sqlalchemy. Output should read version ``1.4.39` ``` - **Step 11:** Connect to the database. For this step, you will need the database Username, Password, Host Name, Service Name and Port. ```python @@ -123,6 +123,7 @@ ENGINE_PATH_WIN_AUTH = f"{DIALECT}://{USERNAME}:{PASSWORD}@{HOST}:{PORT}/?servic ```python engine = sqlalchemy.create_engine(ENGINE_PATH_WIN_AUTH) ``` + ```python ## test the query diff --git a/docs/analytics_tools/overview.md b/docs/analytics_tools/overview.md index 3fc36da8f1..59a2a24479 100644 --- a/docs/analytics_tools/overview.md +++ b/docs/analytics_tools/overview.md @@ -1,20 +1,23 @@ (intro-analytics-tools)= + # Introduction to Analytics Tools + Welcome to the Analytics Tools section! If you're here, you're ready to begin conducting analyses. **What you should know after reading**: -* [Where to easily access links to team tools](tools-quick-links) -* [How to use our BI & dashboarding tool](metabase) -* [How to use our cloud notebook](jupyterhub-intro) -* [How to query the warehouse with SQL](querying-sql-jupyterhub) -* [What Python libraries are suggested](python-libraries) -* [Where code is kept](saving-code) -* [How to store new data](storing-new-data), and best practices for [data catalogs](data-catalogs). -* [Useful Python resources compiled by our team](knowledge-sharing) + +- [Where to easily access links to team tools](tools-quick-links) +- [How to use our BI & dashboarding tool](metabase) +- [How to use our cloud notebook](jupyterhub-intro) +- [How to query the warehouse with SQL](querying-sql-jupyterhub) +- [What Python libraries are suggested](python-libraries) +- [Where code is kept](saving-code) +- [How to store new data](storing-new-data), and best practices for [data catalogs](data-catalogs). +- [Useful Python resources compiled by our team](knowledge-sharing) When used in combination with the [Introduction to the Warehouse](intro-warehouse) and [How to Publish Analyses](publish-analyses) sections you should be prepared to conduct an analysis from end-to-end. -  +  :::{seealso} Missing something? :class: tip Still need to know more about our team and how we work? diff --git a/docs/analytics_tools/python_libraries.md b/docs/analytics_tools/python_libraries.md index f3c4432a12..c1c1d47f0c 100644 --- a/docs/analytics_tools/python_libraries.md +++ b/docs/analytics_tools/python_libraries.md @@ -12,6 +12,7 @@ kernelspec: language: python name: python3 --- + (python-libraries)= # Useful Python Libraries @@ -23,10 +24,10 @@ The following libraries are available and recommended for use by Cal-ITP data an 1. [shared utils](#shared-utils) 2. [calitp-data-analysis](#calitp-data-analysis) 3. [siuba](#siuba) -
- [Basic Query](#basic-query) -
- [Collect Query Results](#collect-query-results) -
- [Show Query SQL](#show-query-sql) -
- [More siuba Resources](more-siuba-resources) +
- [Basic Query](#basic-query) +
- [Collect Query Results](#collect-query-results) +
- [Show Query SQL](#show-query-sql) +
- [More siuba Resources](more-siuba-resources) 4. [pandas](pandas-resources) 5. [Add New Packages](#add-new-packages) 6. [Appendix: calitp-data-infra](appendix) @@ -39,15 +40,15 @@ A set of shared utility functions can also be installed, similarly to any Python ### In terminal -* Navigate to the package folder: `cd data-analyses/_shared_utils` -* Use the make command to run through conda install and pip install: `make setup_env` - * Note: you may need to select Kernel -> Restart Kernel from the top menu after make setup_env in order to successfully import shared_utils -* Alternative: add an `alias` to your `.bash_profile`: - * In terminal use `cd` to navigate to the home directory (not a repository) - * Type `nano .bash_profile` to open the .bash_profile in a text editor - * Add a line at end: `alias go='cd ~/data-analyses/portfolio && pip install -r requirements.txt && cd ../_shared_utils && make setup_env && cd ..'` - * Exit with Ctrl+X, hit yes, then hit enter at the filename prompt - * Restart your server; you can check your changes with `cat .bash_profile` +- Navigate to the package folder: `cd data-analyses/_shared_utils` +- Use the make command to run through conda install and pip install: `make setup_env` + - Note: you may need to select Kernel -> Restart Kernel from the top menu after make setup_env in order to successfully import shared_utils +- Alternative: add an `alias` to your `.bash_profile`: + - In terminal use `cd` to navigate to the home directory (not a repository) + - Type `nano .bash_profile` to open the .bash_profile in a text editor + - Add a line at end: `alias go='cd ~/data-analyses/portfolio && pip install -r requirements.txt && cd ../_shared_utils && make setup_env && cd ..'` + - Exit with Ctrl+X, hit yes, then hit enter at the filename prompt + - Restart your server; you can check your changes with `cat .bash_profile` ### In notebook @@ -160,8 +161,8 @@ Note that here the pandas Series method `str.contains` corresponds to `regexp_co ### More siuba Resources -* [siuba docs](https://siuba.readthedocs.io) -* ['Tidy Tuesday' live analyses with siuba](https://www.youtube.com/playlist?list=PLiQdjX20rXMHc43KqsdIowHI3ouFnP_Sf) +- [siuba docs](https://siuba.readthedocs.io) +- ['Tidy Tuesday' live analyses with siuba](https://www.youtube.com/playlist?list=PLiQdjX20rXMHc43KqsdIowHI3ouFnP_Sf) (pandas-resources)= @@ -169,15 +170,15 @@ Note that here the pandas Series method `str.contains` corresponds to `regexp_co The library pandas is very commonly used in data analysis, and the external resources below provide a brief overview of it's use. -* [Cheat Sheet - pandas](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) +- [Cheat Sheet - pandas](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) ## Add New Packages While most Python packages an analyst uses come in JupyterHub, there may be additional packages you'll want to use in your analysis. -* Install [shared utility functions](#shared-utils) -* Change directory into the project task's subfolder and add `requirements.txt` and/or `conda-requirements.txt` -* Run `pip install -r requirements.txt` and/or `conda install --yes -c conda-forge --file conda-requirements.txt` +- Install [shared utility functions](#shared-utils) +- Change directory into the project task's subfolder and add `requirements.txt` and/or `conda-requirements.txt` +- Run `pip install -r requirements.txt` and/or `conda install --yes -c conda-forge --file conda-requirements.txt` (appendix)= diff --git a/docs/analytics_tools/rt_analysis.md b/docs/analytics_tools/rt_analysis.md index aef1bdeeb3..93781c72cf 100644 --- a/docs/analytics_tools/rt_analysis.md +++ b/docs/analytics_tools/rt_analysis.md @@ -8,10 +8,10 @@ It includes its own interface and data model. Some functionality may shift to a ### Which analyses does it currently support? -* [California Transit Speed Maps](https://analysis.calitp.org/rt/README.html) -* Technical Metric Generation for Solutions for Congested Corridors Program, Local Partnership Program -* Various prioritization exercises from intermediate data, such as using aggregated speed data as an input for identifying bus route improvements as part of a broader model -* Various ad-hoc speed and delay analyses, such as highlighting relevant examples for presentations to stakeholders, or providing a shapefile of bus speeds on certain routes to support a district’s grant application +- [California Transit Speed Maps](https://analysis.calitp.org/rt/README.html) +- Technical Metric Generation for Solutions for Congested Corridors Program, Local Partnership Program +- Various prioritization exercises from intermediate data, such as using aggregated speed data as an input for identifying bus route improvements as part of a broader model +- Various ad-hoc speed and delay analyses, such as highlighting relevant examples for presentations to stakeholders, or providing a shapefile of bus speeds on certain routes to support a district’s grant application ## How does it work? @@ -19,12 +19,12 @@ This section includes detailed information about the data model and processing s ### Which data does it require? -* GTFS-RT Vehicle Positions -* GTFS Schedule Trips -* GTFS Schedule Stops -* GTFS Schedule Routes -* GTFS Schedule Stop Times -* GTFS Schedule Shapes +- GTFS-RT Vehicle Positions +- GTFS Schedule Trips +- GTFS Schedule Stops +- GTFS Schedule Routes +- GTFS Schedule Stop Times +- GTFS Schedule Shapes All of the above are sourced from the v2 warehouse. Note that all components must be present and consistently keyed in order to successfully analyze. This module works at the organization level in order to match the reports site and maintain the structure of the speedmap site. @@ -99,28 +99,28 @@ This step uses the generated interpolator objects to estimate and store speed an The results of this step are saved in OperatorDayAnalysis.stop_delay_view, a geodataframe. -|||| -|--- |--- |--- | -|Column|Source|Type| -|shape_meters|Projection of GTFS Stop along GTFS Shape (with 0 being start of shape), additionally 1km segments generated where stops are infrequent|float64| -|stop_id|GTFS Schedule|string*| -|stop_name|GTFS Schedule|string*| -|geometry|GTFS Schedule|geometry| -|shape_id|GTFS Schedule|string| -|trip_id|GTFS Schedule|string| -|stop_sequence|GTFS Schedule|float64**| -|arrival_time|GTFS Schedule|np.datetime64[ns]*| -|route_id|GTFS Schedule|string| -|route_short_name|GTFS Schedule|string| -|direction_id|GTFS Schedule|float64| -|actual_time|VehiclePositionInterpolator|np.datetime64[ns]| -|delay_seconds|Calculated here (actual_time-arrival_time)***|float64*| - -*null if location is an added 1km segment - -**integer values from GTFS, but added 1km segments are inserted in between the nearest 2 stops with a decimal - -***early arrivals currently represented as zero delay +| | | | +| ---------------- | -------------------------------------------------------------------------------------------------------------------------------------- | --------------------- | +| Column | Source | Type | +| shape_meters | Projection of GTFS Stop along GTFS Shape (with 0 being start of shape), additionally 1km segments generated where stops are infrequent | float64 | +| stop_id | GTFS Schedule | string\* | +| stop_name | GTFS Schedule | string\* | +| geometry | GTFS Schedule | geometry | +| shape_id | GTFS Schedule | string | +| trip_id | GTFS Schedule | string | +| stop_sequence | GTFS Schedule | float64\*\* | +| arrival_time | GTFS Schedule | np.datetime64\[ns\]\* | +| route_id | GTFS Schedule | string | +| route_short_name | GTFS Schedule | string | +| direction_id | GTFS Schedule | float64 | +| actual_time | VehiclePositionInterpolator | np.datetime64\[ns\] | +| delay_seconds | Calculated here (actual_time-arrival_time)\*\*\* | float64\* | + +\*null if location is an added 1km segment + +\*\*integer values from GTFS, but added 1km segments are inserted in between the nearest 2 stops with a decimal + +\*\*\*early arrivals currently represented as zero delay #### __VehiclePositionsInterpolator: a foundational building block__ @@ -136,9 +136,9 @@ VehiclePositionsInterpolator has simple logging functionality built in through t #### Projection -Vehicle Positions data includes a series of positions for a single trip at different points in time. Since we’re interested in tracking speed and delay along the transit route, we need to project those lat/long positions to a linear reference along the actual transit route (GTFS Shape). This is accomplished by the constructor calling VehiclePositionsInterpolator._attach_shape, which first does a naive projection of each position using shapely.LineString.project. This linearly referenced value is stored in the shape_meters column. +Vehicle Positions data includes a series of positions for a single trip at different points in time. Since we’re interested in tracking speed and delay along the transit route, we need to project those lat/long positions to a linear reference along the actual transit route (GTFS Shape). This is accomplished by the constructor calling VehiclePositionsInterpolator.\_attach_shape, which first does a naive projection of each position using shapely.LineString.project. This linearly referenced value is stored in the shape_meters column. -Since later stages will have to interpolate these times and positions, it’s necessary to undertake some additional data cleaning. This happens by calling VehiclePositionsInterpolator._linear_reference, which casts shape_meters to be monotonically increasing with respect to time. This removes multiple position reports at the same location, as well as any positions that suggest the vehicle traveled backwards along the route. While this introduces the assumption that the GPS-derived Vehicle Positions data is fairly accurate, our experience is that this process produces good results in most cases. Future updates will better accommodate [looping and inlining](https://gtfs.org/schedule/best-practices/#shapestxt); these currently get dropped in certain cases, which is undesirable. +Since later stages will have to interpolate these times and positions, it’s necessary to undertake some additional data cleaning. This happens by calling VehiclePositionsInterpolator.\_linear_reference, which casts shape_meters to be monotonically increasing with respect to time. This removes multiple position reports at the same location, as well as any positions that suggest the vehicle traveled backwards along the route. While this introduces the assumption that the GPS-derived Vehicle Positions data is fairly accurate, our experience is that this process produces good results in most cases. Future updates will better accommodate [looping and inlining](https://gtfs.org/schedule/best-practices/#shapestxt); these currently get dropped in certain cases, which is undesirable. #### Interpolating, quickly @@ -154,29 +154,29 @@ This method saves 2 artifacts: a geoparquet of OperatorDayAnalysis.stop_delay_vi rt_trips is a dataframe of trip-level information for every trip for which a VehiclePositionsInterpolator was successfully generated. It supports filtering by various attributes and provides useful contextual information for maps and analyses. -|||| -|--- |--- |--- | -|Column|Source|Type| -|feed_key*|v2 warehouse (gtfs mart)|string| -|trip_key|Key from v2 warehouse|string| -|gtfs_dataset_key*|v2 warehouse (gtfs mart)|string| -|activity_date|v2 warehouse|datetime.date| -|trip_id|GTFS Schedule|string| -|route_id|GTFS Schedule|string| -|route_short_name|GTFS Schedule|string| -|shape_id|GTFS Schedule|string| -|direction_id|GTFS Schedule|string| -|route_type|GTFS Schedule|string| -|route_long_name|GTFS Schedule|string| -|route_desc|GTFS Schedule|string| -|route_long_name|GTFS Schedule|string| -|calitp_itp_id|v2 warehouse (transit database)|int64| -|median_time|VehiclePositionsInterpolator|datetime.time| -|direction|VehiclePositionsInterpolator|string| -|mean_speed_mph|VehiclePositionsInterpolator|float64| -|organization_name|v2 warehouse (transit database)|string| - -* keys and IDs in this table refer to GTFS Schedule datasets +| | | | +| ------------------ | ------------------------------- | ------------- | +| Column | Source | Type | +| feed_key\* | v2 warehouse (gtfs mart) | string | +| trip_key | Key from v2 warehouse | string | +| gtfs_dataset_key\* | v2 warehouse (gtfs mart) | string | +| activity_date | v2 warehouse | datetime.date | +| trip_id | GTFS Schedule | string | +| route_id | GTFS Schedule | string | +| route_short_name | GTFS Schedule | string | +| shape_id | GTFS Schedule | string | +| direction_id | GTFS Schedule | string | +| route_type | GTFS Schedule | string | +| route_long_name | GTFS Schedule | string | +| route_desc | GTFS Schedule | string | +| route_long_name | GTFS Schedule | string | +| calitp_itp_id | v2 warehouse (transit database) | int64 | +| median_time | VehiclePositionsInterpolator | datetime.time | +| direction | VehiclePositionsInterpolator | string | +| mean_speed_mph | VehiclePositionsInterpolator | float64 | +| organization_name | v2 warehouse (transit database) | string | + +- keys and IDs in this table refer to GTFS Schedule datasets ## How do I use it? @@ -271,17 +271,17 @@ To load intermediate data, use `rt_filter_map_plot.from_gcs` to create an RtFilt Using the `set_filter` method, RtFilterMapper supports filtering based on at least one of these attributes at a time: -||| -|--- |--- | -|Attribute|Type| -|start_time|str (%H:%M, i.e. 11:00)| -|end_time|str (%H:%M, i.e. 19:00)| -|route_names|list, pd.Series| -|shape_ids|list, pd.Series| -|direction_id|str, '0' or '1'| -|direction|str, "Northbound", etc, _experimental_| -|trip_ids|list, pd.Series| -|route_types|list, pd.Series| +| | | +| ------------ | -------------------------------------- | +| Attribute | Type | +| start_time | str (%H:%M, i.e. 11:00) | +| end_time | str (%H:%M, i.e. 19:00) | +| route_names | list, pd.Series | +| shape_ids | list, pd.Series | +| direction_id | str, '0' or '1' | +| direction | str, "Northbound", etc, _experimental_ | +| trip_ids | list, pd.Series | +| route_types | list, pd.Series | Mapping, charting, and metric generation methods, listed under "dynamic tools" in the chart above, will respect the current filter. After generating your desired output, you can call `set_filter` again to set a new filter, or use `reset_filter` to remove the filter entirely. Then you can continue to analyze, without needing to create a new RtFilterMapper instance. @@ -303,36 +303,36 @@ This method is much more efficient, and we rely on it to maintain the quantity a After generating a speed map, the underlying data is available at RtFilterMapper.stop_segment_speed_view, a geodataframe. This data can be easily exported into a geoparquet, geojson, shapefile, or spreadsheet with the appropriate geopandas method. -|||| -|--- |--- |--- | -|Column|Source|Type| -|shape_meters|Projection of GTFS Stop along GTFS Shape (with 0 being start of shape), additionally 1km segments generated where stops are infrequent|float64| -|stop_id|GTFS Schedule|string| -|stop_name|GTFS Schedule|string| -|geometry|GTFS Schedule|geometry| -|shape_id|GTFS Schedule|string| -|trip_id|GTFS Schedule|string| -|stop_sequence|GTFS Schedule|float64| -|route_id|GTFS Schedule|string| -|route_short_name|GTFS Schedule|string| -|direction_id|GTFS Schedule|float64| -|delay_seconds*|`stop_delay_view`|np.datetime64[ns]| -|seconds_from_last*|time for trip to travel to this stop from last stop|float64| -|last_loc*|previous stop `shape_meters`|float64| -|meters_from_last*|`shape_meters` - `last_loc`|float64| -|speed_from_last*|`meters_from_last` / `seconds_from_last`|float64| -|delay_chg_sec*|`delay_seconds` - delay at last stop|float64| -|speed_mph*|`speed_from_last` converted to miles per hour|float64| -|n_trips_shp**|number of unique trips on this GTFS shape in filter|int64| -|avg_mph**|average speed for all trips on this segment|float64| -|_20p_mph**|20th percentile speed for all trips on this segment|float64| -|_80p_mph**|80th percentile speed for all trips on this segment|float64| -|fast_slow_ratio**|ratio between p80 speed and p20 speed for all trips on this segment|float64| -|trips_per_hour|`n_trips_shp` / hours in filter|float64| - -*disaggregate value -- applies to this trip only - -**aggregated value -- based on all trips in filter +| | | | +| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------- | ------------------- | +| Column | Source | Type | +| shape_meters | Projection of GTFS Stop along GTFS Shape (with 0 being start of shape), additionally 1km segments generated where stops are infrequent | float64 | +| stop_id | GTFS Schedule | string | +| stop_name | GTFS Schedule | string | +| geometry | GTFS Schedule | geometry | +| shape_id | GTFS Schedule | string | +| trip_id | GTFS Schedule | string | +| stop_sequence | GTFS Schedule | float64 | +| route_id | GTFS Schedule | string | +| route_short_name | GTFS Schedule | string | +| direction_id | GTFS Schedule | float64 | +| delay_seconds\* | `stop_delay_view` | np.datetime64\[ns\] | +| seconds_from_last\* | time for trip to travel to this stop from last stop | float64 | +| last_loc\* | previous stop `shape_meters` | float64 | +| meters_from_last\* | `shape_meters` - `last_loc` | float64 | +| speed_from_last\* | `meters_from_last` / `seconds_from_last` | float64 | +| delay_chg_sec\* | `delay_seconds` - delay at last stop | float64 | +| speed_mph\* | `speed_from_last` converted to miles per hour | float64 | +| n_trips_shp\*\* | number of unique trips on this GTFS shape in filter | int64 | +| avg_mph\*\* | average speed for all trips on this segment | float64 | +| \_20p_mph\*\* | 20th percentile speed for all trips on this segment | float64 | +| \_80p_mph\*\* | 80th percentile speed for all trips on this segment | float64 | +| fast_slow_ratio\*\* | ratio between p80 speed and p20 speed for all trips on this segment | float64 | +| trips_per_hour | `n_trips_shp` / hours in filter | float64 | + +\*disaggregate value -- applies to this trip only + +\*\*aggregated value -- based on all trips in filter #### Other Charts and Metrics @@ -385,4 +385,4 @@ flowchart TD Note that these steps are substantially automated using the `rt_analysis.sccp_tools.sccp_average_metrics` function. -* 2022 SCCP/LPP default timeframe is Apr 30 - May 9 2022. +- 2022 SCCP/LPP default timeframe is Apr 30 - May 9 2022. diff --git a/docs/analytics_tools/saving_code.md b/docs/analytics_tools/saving_code.md index 949e9901c8..3483986546 100644 --- a/docs/analytics_tools/saving_code.md +++ b/docs/analytics_tools/saving_code.md @@ -1,4 +1,5 @@ (saving-code)= + # Saving Code Most Cal-ITP analysts should opt for working and committing code directly from JupyterHub. Leveraging this cloud-based, standardized environment should alleviate many of the pain points associated with creating reproducible, collaborative work. @@ -6,60 +7,64 @@ Most Cal-ITP analysts should opt for working and committing code directly from J Doing work locally and pushing directly from the command line is a similar workflow, but replace the JupyterHub terminal with your local terminal. ## Table of Contents -1. What's a typical [project workflow](#project-workflow)? -1. Someone is collaborating on my branch, how do we [stay in sync](#pulling-and-pushing-changes)? - * The `main` branch is ahead, and I want to [sync my branch with `main`](rebase-and-merge) - * [Rebase](#rebase) or [merge](#merge) - * Options to [Resolve Merge Conflicts](resolve-merge-conflicts) -1. [Other Common GitHub Commands](#other-common-github-commands) - * [External Git Resources](external-git-resources) - * [Committing in the Github User Interface](#pushing-drag-drop) +1. What's a typical [project workflow](#project-workflow)? +2. Someone is collaborating on my branch, how do we [stay in sync](#pulling-and-pushing-changes)? + - The `main` branch is ahead, and I want to [sync my branch with `main`](rebase-and-merge) + - [Rebase](#rebase) or [merge](#merge) + - Options to [Resolve Merge Conflicts](resolve-merge-conflicts) +3. [Other Common GitHub Commands](#other-common-github-commands) + - [External Git Resources](external-git-resources) + - [Committing in the Github User Interface](#pushing-drag-drop) (committing-from-jupyterhub)= + ## Project Workflow It is best practice to do have a dedicated branch for your task. A commit in GitHub is similar to saving your work. It allows the system to capture the changes you have made and offers checkpoints through IDs that both show the progress of your work and can be referenced for particular tasks. In the `data-analyses` repo, separate analysis tasks live in their own directories, such as `data-analyses/gtfs_report_emails`. -1. Start from the `main` branch: `git pull origin main` -1. Check out a new branch to do your work: `git switch -c my-new-branch` -1. Do some work...add, delete, rename files, etc -1. See all the status changes to your files: `git status` -1. When you're ready to save some of that work, stage the files you want to commit with `git add foldername/notebook1.ipynb foldername/script1.py`. To stage all the files, use `git add .`. -1. Once you are ready to commit, add a commit message to associate with all the changes: `git commit -m "exploratory work" ` -1. Push those changes from local to remote branch (note: branch is `my-new-branch` and not `main`): `git push origin my-new-branch`. -1. To review a log of past commits: `git log` -1. When you are ready to merge all the commits into `main`, open a pull request (PR) on the remote repository, and merge it in! -1. Go back to `main` and update your local to match the remote: `git switch main`, `git pull origin main` -1. Once you've merged your branch into `main` and deleted it from the remote, you can delete your branch locally: `git branch -d my-new-branch`. You can reuse the branch name later. - +01. Start from the `main` branch: `git pull origin main` +02. Check out a new branch to do your work: `git switch -c my-new-branch` +03. Do some work...add, delete, rename files, etc +04. See all the status changes to your files: `git status` +05. When you're ready to save some of that work, stage the files you want to commit with `git add foldername/notebook1.ipynb foldername/script1.py`. To stage all the files, use `git add .`. +06. Once you are ready to commit, add a commit message to associate with all the changes: `git commit -m "exploratory work" ` +07. Push those changes from local to remote branch (note: branch is `my-new-branch` and not `main`): `git push origin my-new-branch`. +08. To review a log of past commits: `git log` +09. When you are ready to merge all the commits into `main`, open a pull request (PR) on the remote repository, and merge it in! +10. Go back to `main` and update your local to match the remote: `git switch main`, `git pull origin main` +11. Once you've merged your branch into `main` and deleted it from the remote, you can delete your branch locally: `git branch -d my-new-branch`. You can reuse the branch name later. ## Pulling and Pushing Changes Especially when you have a collaborator working on the same branch, you want to regularly sync your work with what's been committed by your collaborator. Doing this frequently allows you to stay in sync, and avoid unnecessary merge conflicts. 1. Stash your changes temporarily: `git stash` -1. Pull from the remote to bring the local branch up-to-date (and pull any changes your collaborator made): `git pull origin my-new-branch` -1. Pop your changes: `git stash pop` -1. Stage and push your commit with `git add` and `git commit` and `git push origin my-new-branch` +2. Pull from the remote to bring the local branch up-to-date (and pull any changes your collaborator made): `git pull origin my-new-branch` +3. Pop your changes: `git stash pop` +4. Stage and push your commit with `git add` and `git commit` and `git push origin my-new-branch` (rebase-and-merge)= + ### Syncing my Branch with Main + If you find that the `main` branch is ahead, and you want to sync your branch with `main` you'll need to use one of the below commands: -* [Rebase](#rebase) -* [Merge](#merge) +- [Rebase](#rebase) +- [Merge](#merge) Read more about the differences between `rebase` and `merge`: -* [Atlassian tutorial](https://www.atlassian.com/git/tutorials/merging-vs-rebasing) -* [GitKraken](https://www.gitkraken.com/learn/git/problems/git-rebase-vs-merge) -* [Hackernoon](https://hackernoon.com/git-merge-vs-rebase-whats-the-diff-76413c117333) -* [Stack Overflow](https://stackoverflow.com/questions/59622140/git-merge-vs-git-rebase-for-merge-conflict-scenarios) -
+ +- [Atlassian tutorial](https://www.atlassian.com/git/tutorials/merging-vs-rebasing) +- [GitKraken](https://www.gitkraken.com/learn/git/problems/git-rebase-vs-merge) +- [Hackernoon](https://hackernoon.com/git-merge-vs-rebase-whats-the-diff-76413c117333) +- [Stack Overflow](https://stackoverflow.com/questions/59622140/git-merge-vs-git-rebase-for-merge-conflict-scenarios) +
#### Rebase + Rebasing is an important tool to be familiar with and introduce into your workflow. The video and instructions below help to provide information on how to begin using it in your collaborations with the team. [Youtube - A Better Git Workflow with Rebase](https://www.youtube.com/watch?v=f1wnYdLEpgI) @@ -67,67 +72,68 @@ Rebasing is an important tool to be familiar with and introduce into your workfl A rebase might be preferred, especially if all your work is contained on your branch, within your task's folder, and lots of activity is happening on `main`. You'd like to plop all your commits onto the most recent `main` branch, and have it appear as if all your work took place *after* those PRs were merged in. 1. At this point, you've either stashed or added commits on `my-new-branch`. -1. Check out the `main` branch: `git switch main` -1. Pull from origin: `git pull origin main` -1. Check out your current branch: `git switch my-new-branch` -1. Rebase and rewrite history so that your commits come *after* everything on main: `git rebase main` -1. At this point, the rebase may be successful, or you will have to address any conflicts! If you want to abort, use `git rebase --abort`. Changes in scripts will be easy to resolve, but notebook conflicts are difficult. If conflicts are easily resolved, open the file, make the changes, then `git add` the file(s), and `git rebase --continue`. -1. Make any commits you want (from step 1) with `git add`, `git commit -m "commit message"` -1. Force-push those changes to complete the rebase and rewrite the commit history: `git push origin my-new-branch -f` +2. Check out the `main` branch: `git switch main` +3. Pull from origin: `git pull origin main` +4. Check out your current branch: `git switch my-new-branch` +5. Rebase and rewrite history so that your commits come *after* everything on main: `git rebase main` +6. At this point, the rebase may be successful, or you will have to address any conflicts! If you want to abort, use `git rebase --abort`. Changes in scripts will be easy to resolve, but notebook conflicts are difficult. If conflicts are easily resolved, open the file, make the changes, then `git add` the file(s), and `git rebase --continue`. +7. Make any commits you want (from step 1) with `git add`, `git commit -m "commit message"` +8. Force-push those changes to complete the rebase and rewrite the commit history: `git push origin my-new-branch -f` #### Merge + Note: Merging with [fast-forward](https://git-scm.com/docs/git-merge#Documentation/git-merge.txt---ff) behaves similarly to a rebase. 1. At this point, you've either stashed or added commits on `my-new-branch`. -1. Pull from origin: `git switch main` and `git pull origin main` -1. Go back to your branch: `git switch my-new-branch` -1. Complete the merge of `my-new-branch` with `main` and create a new commit: `git merge my-new-branch main` -1. A merge commit window opens up. Type `:wq` to exit and complete the merge. -1. Type `git log` to see that the merge commit was created. +2. Pull from origin: `git switch main` and `git pull origin main` +3. Go back to your branch: `git switch my-new-branch` +4. Complete the merge of `my-new-branch` with `main` and create a new commit: `git merge my-new-branch main` +5. A merge commit window opens up. Type `:wq` to exit and complete the merge. +6. Type `git log` to see that the merge commit was created. (resolve-merge-conflicts)= + ### Options for Resolving Merge Conflicts + If you discover merge conflicts and they are within a single notebook that only you are working on it can be relatively easy to resolve them using the Git command line instructions: -* From the command line, run `git merge main`. This should show you the conflict. -* From here, there are two options depending on what version of the notebook you'd like to keep. - * To keep the version on your branch, run:
-`git checkout --ours path/to/notebook.ipynb` - * To keep the remote version, run:
-`git checkout --theirs path/to/notebook.ipynb` -* From here, just add the file and commit with a message as you normally would and the conflict should be fixed in your Pull Request. + +- From the command line, run `git merge main`. This should show you the conflict. +- From here, there are two options depending on what version of the notebook you'd like to keep. + - To keep the version on your branch, run:
+ `git checkout --ours path/to/notebook.ipynb` + - To keep the remote version, run:
+ `git checkout --theirs path/to/notebook.ipynb` +- From here, just add the file and commit with a message as you normally would and the conflict should be fixed in your Pull Request. ## Other Common GitHub Commands These are helpful Git commands an analyst might need, listed in no particular order. -* During collaboration, if another analyst already created a remote branch, and you want to work off of the same branch: `git fetch origin`, `git checkout -b our-project-branch origin/our-project-branch` -* To discard the changes you made to a file, `git checkout my-notebook.ipynb`, and you can revert back to the version that was last committed. -* Temporarily stash changes, move to a different branch, and come back and retain those changes: `git stash`, `git switch some-other-branch`, do stuff on the other branch, `git switch original-branch`, `git stash pop` -* Rename files and retain the version history associated (`mv` is move, and renaming is moving the file path): `git mv old-notebook.ipynb new-notebook.ipynb` -* Set your local `main` branch to be the same as the remote branch: `git fetch origin -git reset --hard origin/main` -* To delete a file that's been added in a previous commit: `git rm notebooks/my-notebook.ipynb` -* Cherry pick a commit and apply it to your branch: `git cherry-pick COMMIT_HASH`. Read more from [Stack Overflow](https://stackoverflow.com/questions/9339429/what-does-cherry-picking-a-commit-with-git-mean) and [Atlassian](https://www.atlassian.com/git/tutorials/cherry-pick). +- During collaboration, if another analyst already created a remote branch, and you want to work off of the same branch: `git fetch origin`, `git checkout -b our-project-branch origin/our-project-branch` +- To discard the changes you made to a file, `git checkout my-notebook.ipynb`, and you can revert back to the version that was last committed. +- Temporarily stash changes, move to a different branch, and come back and retain those changes: `git stash`, `git switch some-other-branch`, do stuff on the other branch, `git switch original-branch`, `git stash pop` +- Rename files and retain the version history associated (`mv` is move, and renaming is moving the file path): `git mv old-notebook.ipynb new-notebook.ipynb` +- Set your local `main` branch to be the same as the remote branch: `git fetch origin git reset --hard origin/main` +- To delete a file that's been added in a previous commit: `git rm notebooks/my-notebook.ipynb` +- Cherry pick a commit and apply it to your branch: `git cherry-pick COMMIT_HASH`. Read more from [Stack Overflow](https://stackoverflow.com/questions/9339429/what-does-cherry-picking-a-commit-with-git-mean) and [Atlassian](https://www.atlassian.com/git/tutorials/cherry-pick). (external-git-resources)= + ### External Resources -* [Git Terminal Cheat Sheet](https://gist.github.com/cferdinandi/ef665330286fd5d7127d) -* [Git Decision Tree - 'So you have a mess on your hands'](http://justinhileman.info/article/git-pretty/full/) + +- [Git Terminal Cheat Sheet](https://gist.github.com/cferdinandi/ef665330286fd5d7127d) +- [Git Decision Tree - 'So you have a mess on your hands'](http://justinhileman.info/article/git-pretty/full/) (pushing-drag-drop)= -### Committing in the Github User Interface + +### Committing in the Github User Interface If you would like to commit directly from the Github User Interface: 1. Navigate the Github repository and folder that you would like to add your work, and locate the file on your computer that you would like to commit + ![Collection Matrix](assets/step-1-gh-drag-drop.png) +2. 'Click and Drag' your file from your computer into the Github screen - ![Collection Matrix](assets/step-1-gh-drag-drop.png) - - -1. 'Click and Drag' your file from your computer into the Github screen - - - - ![Collection Matrix](assets/step-2-gh-drag-drop.png) + ![Collection Matrix](assets/step-2-gh-drag-drop.png) diff --git a/docs/analytics_tools/scripts.md b/docs/analytics_tools/scripts.md index 4c5481334c..4d09b1fc4e 100644 --- a/docs/analytics_tools/scripts.md +++ b/docs/analytics_tools/scripts.md @@ -1,4 +1,5 @@ (scripts)= + # Scripts Most Cal-ITP analysts will be using Jupyter Notebooks in our Jupyter Hub for their work. Jupyter Notebooks have numerous benefits, including seeing outputs at the end of each code block, ability to weave narrative with analysis through Markdown cells, and the ability to convert what's written in code directly into an HTML or pdf for making automated reports. They are great for exploratory work. @@ -10,65 +11,78 @@ Larger analytics projects often require substantial data processing, wrangling, ### Modularity **Notebooks** -* Functions and classes defined within a notebook stay within a notebook. -* No portability, hindering reproducibility, resulting in duplicative code for yourself or duplicative work in an organization. + +- Functions and classes defined within a notebook stay within a notebook. +- No portability, hindering reproducibility, resulting in duplicative code for yourself or duplicative work in an organization. **Scripts** -* Functions and classes defined here are importable to be used in notebooks and scripts. + +- Functions and classes defined here are importable to be used in notebooks and scripts. ### Self-Hinting + **Notebooks** -* You need to run a series of notebooks to complete all the data processing needed. The best case scenario is that you've provided the best documentation in a README and intuitive notebook names (neither of which are a given). This best case scenario is still more brittle compared to using a Makefile. + +- You need to run a series of notebooks to complete all the data processing needed. The best case scenario is that you've provided the best documentation in a README and intuitive notebook names (neither of which are a given). This best case scenario is still more brittle compared to using a Makefile. **Scripts** -* Pairing the series of scripts with a Makefile self-hints the order in which scripts should be executed. -* Running a single make command is a simple way to schedule and execute an entire workflow. + +- Pairing the series of scripts with a Makefile self-hints the order in which scripts should be executed. +- Running a single make command is a simple way to schedule and execute an entire workflow. ### Easy Git + **Notebooks** -* Re-running or clearing cells are changes that Git tracks. -* Potential merge conflicts when collaborating with others or even from switching branches. -* Merge conflicts are extremely difficult to resolve. This is due to the fact that Jupyter Notebook outputs are JSON. Even if someone else opened your notebook and didn't change anything, that could lead to changes in the underlying JSON metadata...resulting in a painful merge conflict that may not even be resolved. + +- Re-running or clearing cells are changes that Git tracks. +- Potential merge conflicts when collaborating with others or even from switching branches. +- Merge conflicts are extremely difficult to resolve. This is due to the fact that Jupyter Notebook outputs are JSON. Even if someone else opened your notebook and didn't change anything, that could lead to changes in the underlying JSON metadata...resulting in a painful merge conflict that may not even be resolved. **Scripts** -* Python scripts (`.py`) are plain text files. Git tracks plain text changes easily. -* Merge conflicts may arise but are easy to resolve. + +- Python scripts (`.py`) are plain text files. Git tracks plain text changes easily. +- Merge conflicts may arise but are easy to resolve. ### Robust and Scalable + **Notebooks** -* Different versions of notebooks may prevent reproducibility. -* There are issues with scaling notebooks, especially when wanting to test out different parameters, and making copies of notebooks is not wise. If you discovered an error later, would you make that change in the 10 notebook copies? Or make 10 duplicates again? + +- Different versions of notebooks may prevent reproducibility. +- There are issues with scaling notebooks, especially when wanting to test out different parameters, and making copies of notebooks is not wise. If you discovered an error later, would you make that change in the 10 notebook copies? Or make 10 duplicates again? **Scripts** -* Scripts are robust to scaling and reproducing work. -* Injecting various parameters is not an issue, as scripts often hold functions that can take different parameters and arguments. Rerunning a script when you detect an error is fairly straightforward. + +- Scripts are robust to scaling and reproducing work. +- Injecting various parameters is not an issue, as scripts often hold functions that can take different parameters and arguments. Rerunning a script when you detect an error is fairly straightforward. ## Best Practices **At minimum**, all research tasks / projects must include: -* 1 script for importing external data and changing it from shapefile/geojson/csv to parquet/geoparquet -* If only using warehouse data or upstream warehouse data cached in GCS, can skip this first script -* At least 1 script for data processing to produce processed output for visualization -* Break out scripts by concepts / stages -* Include data catalog, README for the project -* All functions used in scripts should have docstrings. Type hints are encouraged! + +- 1 script for importing external data and changing it from shapefile/geojson/csv to parquet/geoparquet +- If only using warehouse data or upstream warehouse data cached in GCS, can skip this first script +- At least 1 script for data processing to produce processed output for visualization +- Break out scripts by concepts / stages +- Include data catalog, README for the project +- All functions used in scripts should have docstrings. Type hints are encouraged! For **larger projects**, introduce more of these principles: -* Distinguish between data processing that is fairly one-off vs data processing that could be part of a pipeline (shared across multiple downstream products) -* Data processing pipeline refactored to scale - * Make it work, make it right, make it fast -* Add logging capability -* Identify shared patterns for functions that could be abstracted more generally. -* Replace functions that live in python scripts with top-level functions - * Make these top-level functions “installable” across directories - * Point downstream uses in scripts or notebooks at these top-level / upstream functions -* Batch scripting to create a pipeline for processing data very similarly - * YAML file to hold project configuration variables / top-level parameters +- Distinguish between data processing that is fairly one-off vs data processing that could be part of a pipeline (shared across multiple downstream products) +- Data processing pipeline refactored to scale + - Make it work, make it right, make it fast +- Add logging capability +- Identify shared patterns for functions that could be abstracted more generally. +- Replace functions that live in python scripts with top-level functions + - Make these top-level functions “installable” across directories + - Point downstream uses in scripts or notebooks at these top-level / upstream functions +- Batch scripting to create a pipeline for processing data very similarly + - YAML file to hold project configuration variables / top-level parameters ### References -* [Good Data Scientists Write Code Code](https://towardsdatascience.com/good-data-scientists-write-good-code-28352a826d1f) -* [Does Your Code Smell](https://towardsdatascience.com/does-your-code-smell-acb9f24bbb46) -* [Modularity, Readability, Speed](https://towardsdatascience.com/3-key-components-of-a-well-written-data-model-c426b1c1a293) -* [Batch scripting](https://aaltoscicomp.github.io/python-for-scicomp/scripts/) -* [Start in notebooks, finish in scripts](https://learnpython.com/blog/python-scripts-vs-jupyter-notebooks/) + +- [Good Data Scientists Write Code Code](https://towardsdatascience.com/good-data-scientists-write-good-code-28352a826d1f) +- [Does Your Code Smell](https://towardsdatascience.com/does-your-code-smell-acb9f24bbb46) +- [Modularity, Readability, Speed](https://towardsdatascience.com/3-key-components-of-a-well-written-data-model-c426b1c1a293) +- [Batch scripting](https://aaltoscicomp.github.io/python-for-scicomp/scripts/) +- [Start in notebooks, finish in scripts](https://learnpython.com/blog/python-scripts-vs-jupyter-notebooks/) diff --git a/docs/analytics_tools/storing_data.md b/docs/analytics_tools/storing_data.md index 7faa4dae1c..09178f8524 100644 --- a/docs/analytics_tools/storing_data.md +++ b/docs/analytics_tools/storing_data.md @@ -12,24 +12,26 @@ kernelspec: language: python name: python3 --- + (storing-new-data)= + # Storing Data During Analysis Our team uses Google Cloud Storage (GCS) buckets, specifically the `calitp-analytics-data` bucket, to store other datasets for analyses. GCS can store anything, of arbitrary object size and shape. It’s like a giant folder in the cloud. You can use it to store CSVs, parquets, pickles, videos, etc. **Within the bucket, the `data-analyses` folder with its sub-folders corresponds to the `data-analyses` GitHub repo with its sub-folders. Versioned data for a task should live within the correct folders.** ## Table of Contents -1. [Introduction](#introduction) -1. [Storing New Data - Screencast](storing-new-data-screencast) -1. [Uploading Data from a Notebook](uploading-from-notebook) -
- [Tabular Data](#tabular-data) -
- [Parquet](#parquet) -
- [CSV](#csv) -
- [Geospatial Data](#geospatial-data) -
- [Geoparquet](#geoparquet) -
- [Zipped shapefile](#zipped-shapefile) -
- [GeoJSON](#geojson) -1. [Uploading data in Google Cloud Storage](in-gcs) +1. [Introduction](#introduction) +2. [Storing New Data - Screencast](storing-new-data-screencast) +3. [Uploading Data from a Notebook](uploading-from-notebook) +
- [Tabular Data](#tabular-data) +
- [Parquet](#parquet) +
- [CSV](#csv) +
- [Geospatial Data](#geospatial-data) +
- [Geoparquet](#geoparquet) +
- [Zipped shapefile](#zipped-shapefile) +
- [GeoJSON](#geojson) +4. [Uploading data in Google Cloud Storage](in-gcs) ## Introduction @@ -37,19 +39,21 @@ Currently, report data can be stored in the `calitp-analytics-data` bucket in Go In order to save data being used in a report, you can use two methods: -* Using code in your notebook to upload the data. -* Using the Google Cloud Storage web UI to manually upload. +- Using code in your notebook to upload the data. +- Using the Google Cloud Storage web UI to manually upload. Watch the screencast below and read the additional information to begin. **Note**: To access Google Cloud Storage you will need to have set up your Google authentication. If you have yet to do so, [follow these instructions](connecting-to-warehouse). (storing-new-data-screencast)= + ## Storing New Data - Screencast
(uploading-from-notebook)= + ## Uploading Data from a Notebook In order to begin, import the following libraries in your notebook and set the `fs` variable @@ -62,6 +66,7 @@ import pandas as pd from calitp_data.storage import get_fs fs = get_fs() ``` + ### Tabular Data While GCS can store CSVs, parquets, Excel spreadsheets, etc, parquets are the preferred file type. Interacting with tabular datasets in GCS is fairly straightforward and is handled well by `pandas`. @@ -150,6 +155,7 @@ shared_utils.utils.geojson_gcs_export( ``` (in-gcs)= + ## Uploading data in Google Cloud Storage You can access the cloud bucket from the web from https://console.cloud.google.com/storage/browser/calitp-analytics-data. diff --git a/docs/analytics_tools/tools_quick_links.md b/docs/analytics_tools/tools_quick_links.md index 3c9231b82b..56f59f153f 100644 --- a/docs/analytics_tools/tools_quick_links.md +++ b/docs/analytics_tools/tools_quick_links.md @@ -1,21 +1,22 @@ (tools-quick-links)= + # Tools Quick Links -**Lost a link?** Find quick access to our tools below. -| Tool | Purpose | -| -------- | -------- | -| [**Analytics Repo**](https://github.com/cal-itp/data-analyses) | Analytics team code repository. | -| [**Analytics Project Board**](https://github.com/cal-itp/data-analyses/projects/1) | Analytics team work management. | -| [**notebooks.calitp.org**](https://notebooks.calitp.org/) | JupyterHub cloud-based notebooks | -| [**dashboards.calitp.org**](https://dashboards.calitp.org/) | Metabase dashboards & Business Insights | -| [**dbt-docs.calitp.org**](https://dbt-docs.calitp.org/) | DBT warehouse documentation | -| [**analysis.calitp.org**](https://analysis.calitp.org/) | Analytics portfolio landing page | -| [**Google BigQuery**](https://console.cloud.google.com/bigquery) | Our warehouse and SQL Querying | -| [**Google Cloud Storage**](https://console.cloud.google.com/storage/browser/calitp-analytics-data) | Cloud file storage | +**Lost a link?** Find quick access to our tools below. +| Tool | Purpose | +| -------------------------------------------------------------------------------------------------- | --------------------------------------- | +| [**Analytics Repo**](https://github.com/cal-itp/data-analyses) | Analytics team code repository. | +| [**Analytics Project Board**](https://github.com/cal-itp/data-analyses/projects/1) | Analytics team work management. | +| [**notebooks.calitp.org**](https://notebooks.calitp.org/) | JupyterHub cloud-based notebooks | +| [**dashboards.calitp.org**](https://dashboards.calitp.org/) | Metabase dashboards & Business Insights | +| [**dbt-docs.calitp.org**](https://dbt-docs.calitp.org/) | DBT warehouse documentation | +| [**analysis.calitp.org**](https://analysis.calitp.org/) | Analytics portfolio landing page | +| [**Google BigQuery**](https://console.cloud.google.com/bigquery) | Our warehouse and SQL Querying | +| [**Google Cloud Storage**](https://console.cloud.google.com/storage/browser/calitp-analytics-data) | Cloud file storage | +  -  ```{admonition} Still need access to a tool on this page? Ask in the `#services-team` channel in the Cal-ITP Slack. ``` diff --git a/docs/analytics_welcome/how_we_work.md b/docs/analytics_welcome/how_we_work.md index 70d7883a35..555ef52605 100644 --- a/docs/analytics_welcome/how_we_work.md +++ b/docs/analytics_welcome/how_we_work.md @@ -1,48 +1,59 @@ (how-we-work)= + # How We Work + ## Team Meetings + The section below outlines our team's primary meetings and their purposes, as well as our our team's shared meeting standards. (current-meetings)= + ### Current Meetings -**New analysts**, look out for these meetings to be added to your calendar. -| Name | Cadence | Description | -| -------- | -------- | -------- | -| **Technical Onboarding** | Week 1
40 Mins | To ensure access to tools and go over best practices. | -| **Data Office Hours** | Tues
50 Mins | An opportunity to learn about our software tools, ask technical questions and get code debugging help from peers. | -| **Analyst Team Meeting** | Thurs
45 Mins | Branch meeting to share your screen and discuss what you've been working on. | +**New analysts**, look out for these meetings to be added to your calendar. +| Name | Cadence | Description | +| ------------------------ | -------------------- | ----------------------------------------------------------------------------------------------------------------- | +| **Technical Onboarding** | Week 1
40 Mins | To ensure access to tools and go over best practices. | +| **Data Office Hours** | Tues
50 Mins | An opportunity to learn about our software tools, ask technical questions and get code debugging help from peers. | +| **Analyst Team Meeting** | Thurs
45 Mins | Branch meeting to share your screen and discuss what you've been working on. | (slack-intro)= + ## Communication Channels -| Channel | Purpose | Description | -| -------- | -------- | -------- | -| #**ct-bdat-internal** | Discussion | For Caltrans Division of Data and Digital Services employees. | -| #**data-analysis** | Discussion | For sharing and collaborating on Cal-ITP data analyses. | -| #**data-office-hours** | Discussion | A place to bring questions, issues, and observations for team discussion. | +| Channel | Purpose | Description | +| ------------------------ | ---------- | ------------------------------------------------------------------------------------------- | +| #**ct-bdat-internal** | Discussion | For Caltrans Division of Data and Digital Services employees. | +| #**data-analysis** | Discussion | For sharing and collaborating on Cal-ITP data analyses. | +| #**data-office-hours** | Discussion | A place to bring questions, issues, and observations for team discussion. | | #**data-warehouse-devs** | Discussion | For people building dbt models - focused on data warehouse performance considerations, etc. | ## Collaboration Tools (analytics-project-board)= + ### GitHub Analytics Project Board + **You can access The Analytics Project Board [using this link](https://github.com/cal-itp/data-analyses/projects/1)**. #### How We Track Work ##### Screencast - Navigating the Board + The screencast below introduces: -* Creating new GitHub issues to track your work -* Adding your issues to our analytics project board -* Viewing all of your issues on the board (e.g. clicking your avatar to filter) + +- Creating new GitHub issues to track your work +- Adding your issues to our analytics project board +- Viewing all of your issues on the board (e.g. clicking your avatar to filter)
(analytics-repo)= + ### GitHub Analytics Repo #### Using the data-analyses Repo + This is our main data analysis repository, for sharing quick reports and works in progress. Get set up on GitHub and clone the data-analyses repository [using this link](committing-from-jupyterhub). For collaborative short-term tasks, create a new folder and work off a separate branch. diff --git a/docs/analytics_welcome/overview.md b/docs/analytics_welcome/overview.md index e09017a2c4..7091cc083b 100644 --- a/docs/analytics_welcome/overview.md +++ b/docs/analytics_welcome/overview.md @@ -1,4 +1,5 @@ (analysts-welcome)= + # Welcome! Welcome to the Analysts section of our Cal-ITP Data Services documentation! @@ -6,16 +7,18 @@ Welcome to the Analysts section of our Cal-ITP Data Services documentation! Here you will be introduced to the resources and best practices that make our analytics team work. **After reading the 'Welcome' section, you will be more familiar with**: -* [Background on the Cal-ITP project](calitp-background) -* [Meetings, communication channels, and other forms of collaboration](how-we-work) + +- [Background on the Cal-ITP project](calitp-background) +- [Meetings, communication channels, and other forms of collaboration](how-we-work) After you've read through this section, continue reading through the remaining sections of this chapter for further introduction to the various technical elements required to conduct an end-to-end analysis. ---- +______________________________________________________________________ **Other Analytics Sections**: -* [Technical Onboarding](technical-onboarding) -* [Introduction to Analytics Tools](intro-analytics-tools) -* [Tutorials for New Python Users](beginner_analysts_tutorials) -* [Introduction to the Warehouse](intro-warehouse) -* [How to Publish Analyses](publish-analyses) + +- [Technical Onboarding](technical-onboarding) +- [Introduction to Analytics Tools](intro-analytics-tools) +- [Tutorials for New Python Users](beginner_analysts_tutorials) +- [Introduction to the Warehouse](intro-warehouse) +- [How to Publish Analyses](publish-analyses) diff --git a/docs/analytics_welcome/what_is_calitp.md b/docs/analytics_welcome/what_is_calitp.md index 9592769eb5..1948366b75 100644 --- a/docs/analytics_welcome/what_is_calitp.md +++ b/docs/analytics_welcome/what_is_calitp.md @@ -1,12 +1,17 @@ (calitp-background)= + # Cal-ITP Project Information + More information on the Cal-ITP project is available at [calitp.org](https://www.calitp.org/). ## On-boarding Resources and Reading + The following document provides insight into the Cal-ITP project for new team members. + - [Cal-ITP On-boarding Resources and Reading](https://docs.google.com/document/d/1430Yc11j_RISdh4aIjuFlmCCUSPgx1Y5OL-_dzIU70E/edit?usp=sharing) ## Project Background + A selection of resources taken from the above document providing background on the history of Cal-ITP. - [Market Sounding](https://dot.ca.gov/-/media/dot-media/cal-itp/documents/final-cal-itp-market-sounding-market-response-summary-103119b-a11y.pdf): This 2019 market sounding laid the foundation for our current priorities. It was followed up with a business case analysis of the project in a [Feasibility Study](https://dot.ca.gov/-/media/dot-media/cal-itp/documents/calitp-feasibility-study-042420-a11y.pdf). diff --git a/docs/architecture/architecture_overview.md b/docs/architecture/architecture_overview.md index 179c96ad8d..c126365618 100644 --- a/docs/architecture/architecture_overview.md +++ b/docs/architecture/architecture_overview.md @@ -1,11 +1,12 @@ (architecture-overview)= + # Architecture Overview The Cal-ITP data infrastructure facilitates several types of data workflows: -* `Ingestion` -* `Modeling/transformation` -* `Analysis` +- `Ingestion` +- `Modeling/transformation` +- `Analysis` In addition, we have `Infrastructure` tools that monitor the health of the system itself or deploy or run other services and do not directly interact with data or support end user data access. @@ -48,25 +49,27 @@ class ingestion_label,modeling_label,analysis_label group_labelstyle ``` This documentation outlines two ways to think of this system and its components from a technical/maintenance perspective: -* [Services](services) that are deployed and maintained (ex. Metabase, JupyterHub, etc.) -* [Data pipelines](data) to ingest specific types of data (ex. GTFS Schedule, Payments, etc.) + +- [Services](services) that are deployed and maintained (ex. Metabase, JupyterHub, etc.) +- [Data pipelines](data) to ingest specific types of data (ex. GTFS Schedule, Payments, etc.) ## Environments Across both data and services, we often have a "production" (live, end-user-facing) environment and some type of testing, staging, or development environment. ### production -* Managed Airflow (i.e. Google Cloud Composer) -* Production gtfs-rt-archiver-v3 -* `cal-itp-data-infra` database (i.e. project) in BigQuery -* Google Cloud Storage buckets _without_ a prefix - * e.g. `gs://calitp-gtfs-schedule-parsed-hourly` +- Managed Airflow (i.e. Google Cloud Composer) +- Production gtfs-rt-archiver-v3 +- `cal-itp-data-infra` database (i.e. project) in BigQuery +- Google Cloud Storage buckets _without_ a prefix + - e.g. `gs://calitp-gtfs-schedule-parsed-hourly` ### testing/staging/dev -* Locally-run Airflow (via docker-compose) -* Test gtfs-rt-archiver-v3 -* `cal-itp-data-infra-staging` database (i.e. project) in BigQuery -* GCS buckets with the `test-` prefix - * e.g. `gs://test-calitp-gtfs-rt-raw-v2` - * Some buckets prefixed with `dev-` also exist; primarily for testing the RT archiver locally + +- Locally-run Airflow (via docker-compose) +- Test gtfs-rt-archiver-v3 +- `cal-itp-data-infra-staging` database (i.e. project) in BigQuery +- GCS buckets with the `test-` prefix + - e.g. `gs://test-calitp-gtfs-rt-raw-v2` + - Some buckets prefixed with `dev-` also exist; primarily for testing the RT archiver locally diff --git a/docs/architecture/data.md b/docs/architecture/data.md index 9b3bcbb924..b2668dc470 100644 --- a/docs/architecture/data.md +++ b/docs/architecture/data.md @@ -1,12 +1,14 @@ (architecture-data)= # Data pipelines + In general, our data ingest follows versions of the pattern diagrammed below. For an example PR that ingests a brand new data source from scratch, see [data infra PR #2376](https://github.com/cal-itp/data-infra/pull/2376). Some of the key attributes of our approach: -* We generate an [`outcomes`](https://github.com/cal-itp/data-infra/blob/main/packages/calitp-data-infra/calitp_data_infra/storage.py#L418) file describing whether scrape, parse, or validate operations were successful. This makes operation outcomes visible in BigQuery, so they can be analyzed (for example: how long has the download operation for X feed been failing?) -* We try to limit the amount of manipulation in Airflow tasks to the bare minimum to make the data legible to BigQuery (for example, replace illegal column names that would break the external tables.) We use gzipped JSONL files in GCS as our default parsed data format. -* [External tables](https://cloud.google.com/bigquery/docs/external-data-sources#external_tables) provide the interface between ingested data and BigQuery modeling/transformations. + +- We generate an [`outcomes`](https://github.com/cal-itp/data-infra/blob/main/packages/calitp-data-infra/calitp_data_infra/storage.py#L418) file describing whether scrape, parse, or validate operations were successful. This makes operation outcomes visible in BigQuery, so they can be analyzed (for example: how long has the download operation for X feed been failing?) +- We try to limit the amount of manipulation in Airflow tasks to the bare minimum to make the data legible to BigQuery (for example, replace illegal column names that would break the external tables.) We use gzipped JSONL files in GCS as our default parsed data format. +- [External tables](https://cloud.google.com/bigquery/docs/external-data-sources#external_tables) provide the interface between ingested data and BigQuery modeling/transformations. While many of the key elements of our architecture are common to most of our data sources, each data source has some unique aspects as well. [This spreadsheet](https://docs.google.com/spreadsheets/d/1bv1K5lZMnq1eCSZRy3sPd3MgbdyghrMl4u8HvjNjWPw/edit#gid=0) details overviews by data source, outlining the specific code/resources that correspond to each step in the general data flow shown below. diff --git a/docs/architecture/services.md b/docs/architecture/services.md index a0d09fb4fe..f417c46488 100644 --- a/docs/architecture/services.md +++ b/docs/architecture/services.md @@ -2,15 +2,14 @@ Here is a list of services that are deployed as part of the Cal-ITP project. -| Name | Function | URL | Source code | K8s namespace | Development/test environment? | Type? -|------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|-----------------------------------------------------------------------------------------------------|--------------------|-------------------------------|--------------------| +| Name | Function | URL | Source code | K8s namespace | Development/test environment? | Type? | +| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------- | --------------------------------------------------------------------------------------------------- | ------------------ | ----------------------------- | -------------------------- | | Airflow | General orchestation/automation platform; downloads non-GTFS Realtime data and orchestrates data transformations outside of dbt; executes stateless jobs such as dbt and data publishing | https://o1d2fa0877cf3fb10p-tp.appspot.com/home | https://github.com/cal-itp/data-infra/tree/main/airflow | n/a | Yes (local) | Infrastructure / Ingestion | -| GTFS-RT Archiver | Downloads GTFS Realtime data (more rapidly than Airflow can handle) | n/a | https://github.com/cal-itp/data-infra/tree/main/services/gtfs-rt-archiver-v3 | gtfs-rt-v3 | Yes (gtfs-rt-v3-test) | Ingestion | -| Metabase | Web-hosted BI tool | https://dashboards.calitp.org | https://github.com/cal-itp/data-infra/tree/main/kubernetes/apps/charts/metabase | metabase | Yes (metabase-test) | Analysis | -| Grafana | Application observability (i.e. monitoring and alerting on metrics) | https://monitoring.calitp.org | https://github.com/JarvusInnovations/cluster-template/tree/develop/k8s-common/grafana (via hologit) | monitoring-grafana | No | Infrastructure | | -| Sentry | Application error observability (i.e. collecting errors for investigation) | https://sentry.calitp.org | https://github.com/cal-itp/data-infra/tree/main/kubernetes/apps/charts/sentry | sentry | No | Infrastructure | -| JupyterHub | Kubernetes-driven Jupyter workspace provider | https://notebooks.calitp.org | https://github.com/cal-itp/data-infra/tree/main/kubernetes/apps/charts/jupyterhub | jupyterhub | No | Analysis | - +| GTFS-RT Archiver | Downloads GTFS Realtime data (more rapidly than Airflow can handle) | n/a | https://github.com/cal-itp/data-infra/tree/main/services/gtfs-rt-archiver-v3 | gtfs-rt-v3 | Yes (gtfs-rt-v3-test) | Ingestion | +| Metabase | Web-hosted BI tool | https://dashboards.calitp.org | https://github.com/cal-itp/data-infra/tree/main/kubernetes/apps/charts/metabase | metabase | Yes (metabase-test) | Analysis | +| Grafana | Application observability (i.e. monitoring and alerting on metrics) | https://monitoring.calitp.org | https://github.com/JarvusInnovations/cluster-template/tree/develop/k8s-common/grafana (via hologit) | monitoring-grafana | No | Infrastructure | +| Sentry | Application error observability (i.e. collecting errors for investigation) | https://sentry.calitp.org | https://github.com/cal-itp/data-infra/tree/main/kubernetes/apps/charts/sentry | sentry | No | Infrastructure | +| JupyterHub | Kubernetes-driven Jupyter workspace provider | https://notebooks.calitp.org | https://github.com/cal-itp/data-infra/tree/main/kubernetes/apps/charts/jupyterhub | jupyterhub | No | Analysis | ## Code and deployments (unless otherwise specified, deployments occur via GitHub Actions) diff --git a/docs/contribute/content_types.md b/docs/contribute/content_types.md index 24b2bc1038..f531d35f8a 100644 --- a/docs/contribute/content_types.md +++ b/docs/contribute/content_types.md @@ -1,29 +1,38 @@ (content-types)= + # Common Content + On this page you can find some of the common content types used in the Cal-ITP Data Services Documentation. Although the ecosystem we use, Jupyter Book, allows flexibility, the pages in our docs are typically generated in the formats below. If you haven't yet, navigate to the [Best Practices](bp-reference) section of the documentation for more context on our docs management, and the [Submitting Changes](submitting-changes) section for how to contribute. ## File Types -* Markdown (`.md`) -* Jupyter Notebooks (`.ipynb`) -* Images less than 500kb (`.png` preferred) + +- Markdown (`.md`) +- Jupyter Notebooks (`.ipynb`) +- Images less than 500kb (`.png` preferred) ## Content Syntax - Resources -* [MyST](https://jupyterbook.org/reference/cheatsheet.html) - a flavor of Markdown used by Jupyter Book for `md` documents -* [Jupyter Notebook Markdown](https://jupyterbook.org/file-types/notebooks.html) - Markdown for use in `.ipynb` documents + +- [MyST](https://jupyterbook.org/reference/cheatsheet.html) - a flavor of Markdown used by Jupyter Book for `md` documents +- [Jupyter Notebook Markdown](https://jupyterbook.org/file-types/notebooks.html) - Markdown for use in `.ipynb` documents ## Common Content - Examples + Below we've provided some examples of commons types of content for quick use. To find more detailed information and extended examples use the links above under `Allowable Syntax - Resources` + 1. [Images](adding-images) 2. [Executing Code](executing-code) - * [Python](executing-code-python) - * [SQL](executing-code-sql) + - [Python](executing-code-python) + - [SQL](executing-code-sql) 3. [Non-executing Code](non-executing-code) 4. [Internal References and Cross References](internal-refs) -(executing-code)= + (executing-code)= + ### Executing Code + Place the following syntax at the top of a `.md` document to include code that will execute. + ``` --- jupytext: @@ -44,48 +53,63 @@ kernelspec: To create the actual code block: (executing-code-python)= **Python** -``` + +```` ```{code-cell} Sample Code ``` -``` +```` + (executing-code-sql)= **SQL** To run SQL within the Jupyter Book we are using an iPython wrapper called `cell Magics` with `%%sql`. + ```python import calitp_data_analysis.magics ``` -``` + +```` ```{code-cell} %%sql Sample SQL Here ``` -``` +```` + You can visit [this page](https://jupyterbook.org/content/code-outputs.html) for more information on how to format code outputs. (non-executing-code)= + ### Non-Executing Code + Non-executing code is formatted similarly to the executing code above, but replaces `{code-cell}` with the name of the language you would like to represent, as seen below, to provide syntax highlighting. -``` + +```` ```python Sample Code ``` -``` -``` +```` + +```` ```sql Sample Code ``` -``` +```` + (adding-images)= + ### Images + Images are currently being stored in an `assets` folder within each `docs` folder. Preference is for `.png` file extension and no larger than `500kb`. Images can be loaded into Jupyter Book by using the following syntax: ``` ![Collection Matrix](assets/your-file-name.png) ``` + (internal-refs)= + ### Internal References and Cross-References + Referencing within the documentation can be accomplished quickly with `labels` and `markdown link syntax`. **Note**: be sure to make reference names unique. If a reference has the same name as a file name, for example, the build process will fail. diff --git a/docs/contribute/contribute-best-practices.md b/docs/contribute/contribute-best-practices.md index 9aa63913a2..a37448d8e0 100644 --- a/docs/contribute/contribute-best-practices.md +++ b/docs/contribute/contribute-best-practices.md @@ -1,43 +1,58 @@ (bp-reference)= + # Best Practices + This page aggregates best practices and helpful information for use when contributing to our documentation. Our Cal-ITP Data Services Documentation uses the Jupyter Book ecosystem to generate our docs. You can find their full resources at this link: [Jupyter Book Documentation](https://jupyterbook.org/intro.html). + 1. [Universal Rules](universal-rules) 2. [Guidelines by Contribution Type](guidelines-by-contribution) - * [Small Changes](small-changes) - * [New Sections (Headers)](new-sections) - * [New Pages and Chapters](new-pages) + - [Small Changes](small-changes) + - [New Sections (Headers)](new-sections) + - [New Pages and Chapters](new-pages) (universal-rules)= + ## Universal Rules + There are a few things that are true for all files and content types. Here is a short list: -* **Files must have a title.** Generally this means that they must begin with a line that starts with a single # -* **Use only one top-level header.** Because each page must have a clear title, it must also only have one top-level header. You cannot have multiple headers with single # tag in them. -* **Headers should increase linearly.** If you’re inside of a section with one #, then the next nested section should start with ##. Avoid jumping straight from # to ###. + +- **Files must have a title.** Generally this means that they must begin with a line that starts with a single # +- **Use only one top-level header.** Because each page must have a clear title, it must also only have one top-level header. You cannot have multiple headers with single # tag in them. +- **Headers should increase linearly.** If you’re inside of a section with one #, then the next nested section should start with ##. Avoid jumping straight from # to ###. (guidelines-by-contribution)= + ## Guidelines by Contribution Size + Read below for guidance based on the size of your contribution. When you are ready to make changes, visit the [Submitting Changes](submitting-changes) section for how to contribute, and utilize the [Common Content](content-types) section for information on adding specific types of content. (small-changes)= + ### Small Changes + For small changes such as typos, clarification, or changes within existing content you can reference the [Common Content](content-types) section as needed. (new-sections)= + ### New Sections (Headers) + If you feel a new section is warranted, make sure you follow Jupyter Book's guidelines on headers: > **Headers should increase linearly.** If you’re inside of a section with one #, then the next nested section should start with ##. Avoid jumping straight from # to ###. (new-pages)= + ### New Pages and Chapters + Add new pages and chapters only as truly needed. If you are adding new pages or chapters, you will need to also update the `_toc.yml` file. You can find more information at Jupyter Book's resource [Structure and organize content](https://jupyterbook.org/basics/organize.html). You will also need to follow Jupyter Book's guidelines for when adding files: ->**Files must have a title.** Generally this means that they must begin with a line that starts with a single # ->**Use only one top-level header.** Because each page must have a clear title, it must also only have one top-level header. You cannot have multiple headers with single # tag in them. +> **Files must have a title.** Generally this means that they must begin with a line that starts with a single # + +> **Use only one top-level header.** Because each page must have a clear title, it must also only have one top-level header. You cannot have multiple headers with single # tag in them. diff --git a/docs/contribute/overview.md b/docs/contribute/overview.md index 755c38f107..09f59f1227 100644 --- a/docs/contribute/overview.md +++ b/docs/contribute/overview.md @@ -1,4 +1,5 @@ (contribute-overview)= + # Getting Started The pages in this section outline the conventions we follow for making changes to our documentation as well as options for including various forms of content. @@ -6,6 +7,6 @@ The pages in this section outline the conventions we follow for making changes t Contributing to our docs is encouraged! If you see content that needs updating or recognize missing information please use the information found in this chapter to contribute. -* [Best Practices](bp-reference) -* [Submitting Changes](submitting-changes) -* [Common Content](content-types) +- [Best Practices](bp-reference) +- [Submitting Changes](submitting-changes) +- [Common Content](content-types) diff --git a/docs/contribute/submitting_changes.md b/docs/contribute/submitting_changes.md index 9e5b84c09d..fcaf9c8b7e 100644 --- a/docs/contribute/submitting_changes.md +++ b/docs/contribute/submitting_changes.md @@ -1,74 +1,93 @@ (submitting-changes)= + # Submitting Changes + 1. [Making Changes and Merging PRs](docs-changes) - * [Using Git (Command Line)](docs-changes-git) - * [Using the GitHub User Interface (Website)](docs-changes-github) + - [Using Git (Command Line)](docs-changes-git) + - [Using the GitHub User Interface (Website)](docs-changes-github) 2. [GitHub Docs Action](docs-gh-action) 3. [How do I preview my docs change?](docs-preview) -(docs-changes)= + (docs-changes)= + ## Making Changes and Merging PRs + There are two common ways to make changes to the docs. For those not used to using Git or the command line use the instructions for the [GitHub website](docs-changes-github). (docs-changes-git)= + ### Using Git (Command Line) -* Follow the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) standard for all commits and PR titles - * Prefix docs commits and PR titles with `docs(subject-area):` -* Installing and Using pre-commit hooks - * Installing: - * `pip install pre-commit` - * `pre-commit install` in the appropriate repo - * Using: - * The hooks will check your markdown for errors when you commit your changes. - * If upon committing your changes you see that the pre-commit checks have failed, the fixes will be made automatically but you will need to **re-add and re-commit the files**. Don't forget to do this! - * If for any reason you would like to bypass the hooks, append the flag `--no-verify` - * If needed, run `pre-commit run --all-files` to run the hooks on all files, not just those staged for changes. -* Open a PR - * Use GitHub's *draft* status to indicate PRs that are not ready for review/merging - * Give your PR a descriptive title that has a prefix of `docs(subject-area):` as according to the Conventional Commits standard **(1)**. - * You will find there is already a template populated in the description area. Scroll to the bottom and use only the portion beneath `Docs changes checklist`. Add description where requested **(2)**. - * In the right-hand sidebar add the following **(3)**: - * **Reviewers** This is the person or people who will review and approve your edits to be added to the main codebase. If no one is selected, the docs `CODEOWNER` will be flagged for review. Beyond that, request those who will be affected by changes or those with expertise in relevant subject areas. - * **Assignees** If you're responsible for this work tag yourself here. Also tag any collaborators that you may have. - * **Affix the label** `documentation` to more easily keep track of this work. - * If this work is ready for review, select 'Create pull request'. If more work is required, select 'Create draft pull request' from the dropdown **(4)**. -![Collection Matrix](assets/pr-intro.png) - * Do not use GitHub's "update branch" button or merge the `main` branch back into a PR branch to update it. Instead, rebase PR branches to update them and resolve any merge conflicts. -* Once you have created a PR and it has been reviewed and approved, beyond any requested changes, you will be notified that your work has been merged into the live documentation! -(docs-changes-github)= +- Follow the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) standard for all commits and PR titles + - Prefix docs commits and PR titles with `docs(subject-area):` +- Installing and Using pre-commit hooks + - Installing: + - `pip install pre-commit` + - `pre-commit install` in the appropriate repo + - Using: + - The hooks will check your markdown for errors when you commit your changes. + - If upon committing your changes you see that the pre-commit checks have failed, the fixes will be made automatically but you will need to **re-add and re-commit the files**. Don't forget to do this! + - If for any reason you would like to bypass the hooks, append the flag `--no-verify` + - If needed, run `pre-commit run --all-files` to run the hooks on all files, not just those staged for changes. +- Open a PR + - Use GitHub's *draft* status to indicate PRs that are not ready for review/merging + - Give your PR a descriptive title that has a prefix of `docs(subject-area):` as according to the Conventional Commits standard **(1)**. + - You will find there is already a template populated in the description area. Scroll to the bottom and use only the portion beneath `Docs changes checklist`. Add description where requested **(2)**. + - In the right-hand sidebar add the following **(3)**: + - **Reviewers** This is the person or people who will review and approve your edits to be added to the main codebase. If no one is selected, the docs `CODEOWNER` will be flagged for review. Beyond that, request those who will be affected by changes or those with expertise in relevant subject areas. + - **Assignees** If you're responsible for this work tag yourself here. Also tag any collaborators that you may have. + - **Affix the label** `documentation` to more easily keep track of this work. + - If this work is ready for review, select 'Create pull request'. If more work is required, select 'Create draft pull request' from the dropdown **(4)**. + ![Collection Matrix](assets/pr-intro.png) + - Do not use GitHub's "update branch" button or merge the `main` branch back into a PR branch to update it. Instead, rebase PR branches to update them and resolve any merge conflicts. +- Once you have created a PR and it has been reviewed and approved, beyond any requested changes, you will be notified that your work has been merged into the live documentation! + (docs-changes-github)= + ### Using the GitHub User Interface (Website) + These documents are currently editable on GitHub's website. Read the instructions below to use the GitHub website to make changes. + #### Navigate to GitHub and make changes -* Click the GitHub icon in the top right corner of the page you'd like to edit and choose `Suggest Edit`. -* Make changes on that page with your desired [content types](content-types). -![Collection Matrix](assets/suggest-edit.png) + +- Click the GitHub icon in the top right corner of the page you'd like to edit and choose `Suggest Edit`. +- Make changes on that page with your desired [content types](content-types). + ![Collection Matrix](assets/suggest-edit.png) + #### Commit your changes and create a new branch -* On the page that you've edited, navigate to the bottom and find where it says `Commit changes`. -* Add a short title and description for your changes **(1)**. - * Make sure to prefix the title with `docs(subject-area):` as according to the Conventional Commits standard. -* Select the second option `Create a new branch...` and add a short but descriptive name for this new branch **(2)**. -* Select `Commit Changes`. This will take you to a new page to create a `Pull Request`, the mechanism that will allow your new work to be added to the docs. -![Collection Matrix](assets/commit-screenshot.png) + +- On the page that you've edited, navigate to the bottom and find where it says `Commit changes`. +- Add a short title and description for your changes **(1)**. + - Make sure to prefix the title with `docs(subject-area):` as according to the Conventional Commits standard. +- Select the second option `Create a new branch...` and add a short but descriptive name for this new branch **(2)**. +- Select `Commit Changes`. This will take you to a new page to create a `Pull Request`, the mechanism that will allow your new work to be added to the docs. + ![Collection Matrix](assets/commit-screenshot.png) + #### Create a Pull Request for review and merging -* After committing your changes you will be brought to another page to create a PR. -* Give your PR a descriptive title that has a prefix of `docs(subject-area):` as according to the Conventional Commits standard **(1)**. -* You will find there is already a template populated in the description area. Scroll to the bottom and use only the portion beneath `Docs changes checklist`. Add description where requested **(2)**. -* In the right-hand sidebar add the following **(3)**: - * **Reviewers** This is the person or people who will review and approve your edits to be added to the main codebase. If no one is selected, the docs `CODEOWNER` will be flagged for review. Beyond that, request those who will be affected by changes or those with expertise in relevant subject areas. - * **Assignees** If you're responsible for this work tag yourself here. Also tag any collaborators that you may have. - * **Affix the label** `documentation` to more easily keep track of this work. -* If this work is ready for review, select 'Create pull request'. If more work is required, select 'Create draft pull request' from the dropdown **(4)**. -![Collection Matrix](assets/pr-intro.png) -* Once you have created a PR and it has been reviewed and approved, beyond any requested changes, you will be notified that your work has been merged into the live documentation! -(docs-gh-action)= + +- After committing your changes you will be brought to another page to create a PR. +- Give your PR a descriptive title that has a prefix of `docs(subject-area):` as according to the Conventional Commits standard **(1)**. +- You will find there is already a template populated in the description area. Scroll to the bottom and use only the portion beneath `Docs changes checklist`. Add description where requested **(2)**. +- In the right-hand sidebar add the following **(3)**: + - **Reviewers** This is the person or people who will review and approve your edits to be added to the main codebase. If no one is selected, the docs `CODEOWNER` will be flagged for review. Beyond that, request those who will be affected by changes or those with expertise in relevant subject areas. + - **Assignees** If you're responsible for this work tag yourself here. Also tag any collaborators that you may have. + - **Affix the label** `documentation` to more easily keep track of this work. +- If this work is ready for review, select 'Create pull request'. If more work is required, select 'Create draft pull request' from the dropdown **(4)**. + ![Collection Matrix](assets/pr-intro.png) +- Once you have created a PR and it has been reviewed and approved, beyond any requested changes, you will be notified that your work has been merged into the live documentation! + (docs-gh-action)= + ## GitHub Docs Action + ### What is the GitHub Docs Action? + The action is an automated service provided by GitHub that ensures suggested additions are in the proper syntax and facilitates the [preview of your changes](docs-preview). You can see if this action was successful at the bottom of your docs PR. ![Collection Matrix](assets/gh-action.png) ### How is the docs GitHub action triggered? + Our GitHub action is triggered on pushes to the `data-infra` repository related to the `docs` directory. (docs-preview)= + ## How do I preview my docs change? + Once the GitHub action has run and all tests have passed a 'Netlify' preview link will be generated. You can find this link in the comments of your PR. Follow that link to preview your changes. ![Collection Matrix](assets/netlify-link.png) diff --git a/docs/intro.md b/docs/intro.md index c2dbe4b8ef..458548af3f 100644 --- a/docs/intro.md +++ b/docs/intro.md @@ -6,6 +6,6 @@ This resource serves as an opportunity to learn more about how the data services Use the links below to access dedicated chapters for the following users: -* [Analysts](analysts-welcome) -* [Developers](architecture-overview) -* [Contribute to the Docs!](contribute-overview) +- [Analysts](analysts-welcome) +- [Developers](architecture-overview) +- [Contribute to the Docs!](contribute-overview) diff --git a/docs/kubernetes/JupyterHub.md b/docs/kubernetes/JupyterHub.md index 3ad6295c47..e46bbb5ffb 100644 --- a/docs/kubernetes/JupyterHub.md +++ b/docs/kubernetes/JupyterHub.md @@ -186,9 +186,9 @@ Within the GitHub OAuth application, in Github, the homepage and callback URLs w After the changes have been made to the GitHub OAuth application, the following portions of the JupyterHub chart's `values.yaml` must be changed: - - `hub.config.GitHubOAuthenticator.oauth_callback_url` - - `ingress.hosts` - - `ingress.tls.hosts` +- `hub.config.GitHubOAuthenticator.oauth_callback_url` +- `ingress.hosts` +- `ingress.tls.hosts` Apply these chart changes with: diff --git a/docs/kubernetes/README.md b/docs/kubernetes/README.md index 3a0a9a1a92..c61c565dd5 100644 --- a/docs/kubernetes/README.md +++ b/docs/kubernetes/README.md @@ -14,8 +14,10 @@ kernelspec: --- # Kubernetes -## Cluster Administration ## -### preflight ### + +## Cluster Administration + +### preflight Check logged in user @@ -41,7 +43,7 @@ gcloud config get-value compute/region # gcloud config set compute/region us-west1 ``` -### quick start ### +### quick start ```bash ./kubernetes/gke/cluster-create.sh @@ -50,7 +52,7 @@ export KUBECONFIG=$PWD/kubernetes/gke/kube/admin.yaml kubectl cluster-info ``` -### cluster lifecycle ### +### cluster lifecycle Create the cluster by running `kubernetes/gke/cluster-create.sh`. @@ -67,14 +69,14 @@ environment variable to `kubernetes/gke/kube/admin.yaml`. The cluster can be deleted by running `kubernetes/gke/cluster-delete.sh`. -### nodepool lifecycle ### +### nodepool lifecycle Certain features of node pools are immutable (e.g., machine type); to change such parameters requires creating a new node pool with the desired new values, migrating workloads off of the old node pool, and then deleting the old node pool. The node pool lifecycle scripts help simplify this process. -#### create a new node pool #### +#### create a new node pool Configure a new node pool by adding its name to the `GKE_NODEPOOL_NAMES` array in [`kubernetes/gke/config-nodepool.sh`](https://github.com/cal-itp/data-infra/blob/main/kubernetes/gke/config-nodepool.sh). @@ -85,14 +87,14 @@ Once the new nodepool is configured, it can be stood up by running `kubernetes/g or by simply running `kubernetes/gke/nodepool-up.sh`, which will stand up all configured node pools which do not yet exist. -#### drain and delete an old node pool #### +#### drain and delete an old node pool Once a new nodepool has been created to replace an active node pool, the old node pool must be removed from the `GKE_NODEPOOL_NAMES` array. Once the old node pool is removed from the array, it can be drained and deleted by running `kubernetes/gke/nodepool-down.sh `. -## Deploy Cluster Workloads ## +## Deploy Cluster Workloads Cluster workloads are divided into two classes: @@ -101,7 +103,7 @@ Cluster workloads are divided into two classes: Apps are the workloads that users actually care about. -### system workloads ### +### system workloads ```bash kubectl apply -k kubernetes/system @@ -112,7 +114,7 @@ such as an ingress controller, monitoring, logging, etc. The system deploy comma is run at cluster create time, but when new system workloads are added it may need to be run again. -### app: metabase ### +### app: metabase First deploy: diff --git a/docs/publishing/overview.md b/docs/publishing/overview.md index 415322bb03..1046d89c4f 100644 --- a/docs/publishing/overview.md +++ b/docs/publishing/overview.md @@ -1,4 +1,5 @@ (publish-analyses)= + # Where can I publish data? Analysts have a variety of tools available to publish their final @@ -6,19 +7,20 @@ deliverables. With iterative work, analysts can implement certain best practices within these bounds to do as much as is programmatically practical. The workflow will look different depending on these factors: -* Are visualizations static or interactive? -* Does the deliverable need to be updated on a specified frequency or a one-off analysis? -* Is the deliverable format PDF, HTML, interactive dashboard, or a slide deck? +- Are visualizations static or interactive? +- Does the deliverable need to be updated on a specified frequency or a one-off analysis? +- Is the deliverable format PDF, HTML, interactive dashboard, or a slide deck? Analysts can string together a combination of these solutions. These options are listed in increasing order of complexity and therefore capability. -* [Static visualizations](publishing-static-files) can be inserted directly + +- [Static visualizations](publishing-static-files) can be inserted directly into slide decks (e.g. PNG) or emailed to stakeholders (e.g. HTML or PDF) -* HTML visualizations can be rendered in [GitHub Pages](publishing-github-pages) +- HTML visualizations can be rendered in [GitHub Pages](publishing-github-pages) and embedded as a URL into slide deck -* More advanced HTML-based reports can be hosted in the [analytics portfolio](publishing-analytics-portfolio-site) +- More advanced HTML-based reports can be hosted in the [analytics portfolio](publishing-analytics-portfolio-site) which supports interactivity and notebook parameterization. -* Interactive dashboards should be hosted in [Metabase](publishing-metabase) to +- Interactive dashboards should be hosted in [Metabase](publishing-metabase) to share with external stakeholders. -* Structured tabular data may be published to [CKAN](publishing-ckan) to facilitate usage by analysts, researchers, or other stakeholders. These will be hosted at [https://data.ca.gov](https://data.ca.gov). -* Structured geospatial data may be published to the Caltrans [Geoportal](publishing-geoportal). These will be hosted at [https://https://gis.data.ca.gov](https://gis.data.ca.gov). +- Structured tabular data may be published to [CKAN](publishing-ckan) to facilitate usage by analysts, researchers, or other stakeholders. These will be hosted at [https://data.ca.gov](https://data.ca.gov). +- Structured geospatial data may be published to the Caltrans [Geoportal](publishing-geoportal). These will be hosted at [https://https://gis.data.ca.gov](https://gis.data.ca.gov). diff --git a/docs/publishing/sections/1_publishing_principles.md b/docs/publishing/sections/1_publishing_principles.md index 3e72be1288..09b3e20a82 100644 --- a/docs/publishing/sections/1_publishing_principles.md +++ b/docs/publishing/sections/1_publishing_principles.md @@ -1,19 +1,23 @@ (publishing-principles=) + # Data Publishing Principles ## Follow prior art + The [California Open Data Publisher's Handbook](https://docs.data.ca.gov/california-open-data-publishers-handbook/) is the inspiration for much of this process. Its sections include a [pre-publishing checklist (including descriptions of ownership roles)](https://docs.data.ca.gov/california-open-data-publishers-handbook/1.-review-the-pre-publishing-checklist) and [best practices for creating metadata](https://docs.data.ca.gov/california-open-data-publishers-handbook/3.-create-metadata-and-data-dictionary). ## Assume the data must stand on its own + Once out in the wild, we don't really have much control over how data will be used or who may rely on it. The documentation should reflect this; we should include as much information as possible while maintaining backreferences to the data's source. ## Publish the right amount of data + Pick an appropriate subset of the data to publish, based on volume, expected usage, and refresh/update frequency. For example, GTFS Schedule is fairly low volume and slow to change, so updating weekly or monthly is more than diff --git a/docs/publishing/sections/2_static_files.md b/docs/publishing/sections/2_static_files.md index acccaca61c..9d4223154f 100644 --- a/docs/publishing/sections/2_static_files.md +++ b/docs/publishing/sections/2_static_files.md @@ -1,4 +1,5 @@ (publishing-static-files)= + # Static Visualizations Static visualizations should be created in a Jupyter Notebook, saved locally @@ -23,6 +24,7 @@ chart.save(filename = '../my-visualization.png') ``` ## Publishing Reports + Reports can be shared as HTML webpages or PDFs. Standalone HTML pages tend to be self-contained and can be sent via email or similar. diff --git a/docs/publishing/sections/3_github_pages.md b/docs/publishing/sections/3_github_pages.md index 890dcb76e5..a9d6493c59 100644 --- a/docs/publishing/sections/3_github_pages.md +++ b/docs/publishing/sections/3_github_pages.md @@ -1,4 +1,5 @@ (publishing-github-pages)= + # HTML Visualizations Visualizations that benefit from limited interactivity, such as displaying tooltips on hover or zooming in / out and scrolling can be rendered within GitHub pages. @@ -36,20 +37,20 @@ fig.save("../my-visualization.html") ``` ## Rendering Jupyter Notebook as HTML -A single notebook can be converted to HTML using `nbconvert`. If it's a quick analysis in a standalone notebook, sometimes an analyst may choose not to go down the [portfolio method](publishing-analytics-portfolio-site). -* In the terminal: `jupyter nbconvert --to html --no-input --no-prompt` - * `--no-input`: hide code cells - * `--no-prompt`: hide prompts to have all cells vertically aligned -* A longer example of [converting multiple notebooks into HTML pages and uploading to GitHub](https://github.com/cal-itp/data-analyses/blob/main/bus_service_increase/publish_single_report.py) +A single notebook can be converted to HTML using `nbconvert`. If it's a quick analysis in a standalone notebook, sometimes an analyst may choose not to go down the [portfolio method](publishing-analytics-portfolio-site). +- In the terminal: `jupyter nbconvert --to html --no-input --no-prompt` + - `--no-input`: hide code cells + - `--no-prompt`: hide prompts to have all cells vertically aligned +- A longer example of [converting multiple notebooks into HTML pages and uploading to GitHub](https://github.com/cal-itp/data-analyses/blob/main/bus_service_increase/publish_single_report.py) ## Use GitHub pages to display these HTML pages Analysts should use this only in case of emergencies (missing `netlify` credentials or analysts working outside of our `data-analyses` repo). We prefer launching through our portfolio, which can take single, unparameterized notebooks as well. 1. Go to the repo's [settings](https://github.com/cal-itp/data-analyses/settings) -1. Navigate to `Pages` on the left -1. Change the branch GH pages is sourcing from: `main` to `my-current-branch` -1. Embed the URL into the slides. Example URL: https://docs.calitp.org/data-analyses/PROJECT-FOLDER/MY-VISUALIZATION.html -1. Once a PR is ready and merged, the GH pages can be changed back to source from `main`. The URL is preserved within the slide deck. +2. Navigate to `Pages` on the left +3. Change the branch GH pages is sourcing from: `main` to `my-current-branch` +4. Embed the URL into the slides. Example URL: https://docs.calitp.org/data-analyses/PROJECT-FOLDER/MY-VISUALIZATION.html +5. Once a PR is ready and merged, the GH pages can be changed back to source from `main`. The URL is preserved within the slide deck. diff --git a/docs/publishing/sections/4_analytics_portfolio_site.md b/docs/publishing/sections/4_analytics_portfolio_site.md index be541d8dad..80f9e46c26 100644 --- a/docs/publishing/sections/4_analytics_portfolio_site.md +++ b/docs/publishing/sections/4_analytics_portfolio_site.md @@ -1,4 +1,5 @@ (publishing-analytics-portfolio-site)= + # The Cal-ITP Analytics Portfolio Depending on the complexity of your visualizations, you may want to produce @@ -11,134 +12,149 @@ present in the data-analyses repo is your friend. You can find the Cal-ITP Analytics Portfolio at [analysis.calitp.org](https://analysis.calitp.org). ## Setup + Before executing the build, there are a few prior steps you need to do. 1. Set up netlify key: - * Install netlify: `npm install -g netlify-cli` - * Navigate to your main directory - * Edit your bash profile using Nano: - * In your terminal, enter `nano ~/.bash_profile` to edit. - * Navigate using arrows (down, right, etc) to create 2 new lines. Paste (`CTRL` + `V`) your netlify key in the lines in the following format, each line prefixed with "export" - * `export NETLIFY_AUTH_TOKEN= YOURTOKENHERE123` - * `export NETLIFY_SITE_ID=cal-itp-data-analyses` - * To exit, press `CTRL` + `X` - * Nano will ask if you want to save your changes. Type `Y` to save. - * Type `N` to discard your changes and exit - * For the changes to take effect, open a new terminal or run `source ~/.bash_profile` - * Back in your terminal, enter `env | grep NETLIFY` to see that your Netlify token is there + + - Install netlify: `npm install -g netlify-cli` + - Navigate to your main directory + - Edit your bash profile using Nano: + - In your terminal, enter `nano ~/.bash_profile` to edit. + - Navigate using arrows (down, right, etc) to create 2 new lines. Paste (`CTRL` + `V`) your netlify key in the lines in the following format, each line prefixed with "export" + - `export NETLIFY_AUTH_TOKEN= YOURTOKENHERE123` + - `export NETLIFY_SITE_ID=cal-itp-data-analyses` + - To exit, press `CTRL` + `X` + - Nano will ask if you want to save your changes. Type `Y` to save. + - Type `N` to discard your changes and exit + - For the changes to take effect, open a new terminal or run `source ~/.bash_profile` + - Back in your terminal, enter `env | grep NETLIFY` to see that your Netlify token is there 2. Create a `.yml` file in [data-analyses/portfolio/sites](https://github.com/cal-itp/data-analyses/tree/main/portfolio/sites). Each `.yml` file is a site, so if you have separate research topics, they should each have their own `.yml` file. - * This `.yml` file will include the directory to the notebook(s) you want to publish. - * Name your `.yml` file. For now we will use `my_report.yml` as an example. - * The structure of your `.yml` file depends on the type of your analysis: - * If you have one parameterized notebook with **one parameter**: - * Example: [dla.yml](https://github.com/cal-itp/data-analyses/blob/main/portfolio/sites/dla.yml) - - ``` - title: My Analyses - directory: ./my-analyses/ - readme: ./my-analyses/README.md - notebook: ./my-analyses/my-notebook.ipynb - parts: - - caption: Introduction - - chapters: - - params: - district_parameter: 1 - district_title: District 1 - ``` - * If you have a parameterized notebook with **multiple parameters**: - * Example: [rt.yml](https://github.com/cal-itp/data-analyses/blob/main/portfolio/sites/rt.yml) - - ``` - title: My Analyses - directory: ./my-analyses/ - readme: ./my-analyses/README.md - notebook: ./my-analyses/my-notebook.ipynb - parts: - - chapters: - - caption: County Name - params: - parameter1_county_name - sections: - - city: parameter2_city_name - - city: parameter2_city_name - ``` - * If you have an individual notebook with **no parameters**: - * Example: [hqta.yml](https://github.com/cal-itp/data-analyses/blob/main/portfolio/sites/hqta.yml) - - ``` - title: My Analyses - directory: ./my-analyses/ - readme: ./my-analyses/README.md - parts: - - caption: Introduction - - chapters: - - notebook: ./my-analyses/notebook_1.ipynb - - notebook: ./my-analyses/notebook_2.ipynb - ``` - - * If you have multiple parameterized notebooks with **the same parameters**: - * Example: [rt_parallel.yml](https://github.com/cal-itp/data-analyses/blob/main/portfolio/rt_parallel.yml) - ``` - title: My Analyses - directory: ./my-analyses/ - readme: ./my-analyses/README.md - parts: - - caption: District Name - - chapters: - - caption: Parameter 1 - params: - itp_id: parameter_1 - sections: §ions - - notebook: ./analysis_1/notebook_1.ipynb - - notebook: ./analysis_2/notebook_2.ipynb - - caption: Parameter 2 - params: - itp_id: parameter_2 - sections: *sections - ``` + - This `.yml` file will include the directory to the notebook(s) you want to publish. + - Name your `.yml` file. For now we will use `my_report.yml` as an example. + - The structure of your `.yml` file depends on the type of your analysis: + - If you have one parameterized notebook with **one parameter**: + + - Example: [dla.yml](https://github.com/cal-itp/data-analyses/blob/main/portfolio/sites/dla.yml) + + ``` + title: My Analyses + directory: ./my-analyses/ + readme: ./my-analyses/README.md + notebook: ./my-analyses/my-notebook.ipynb + parts: + - caption: Introduction + - chapters: + - params: + district_parameter: 1 + district_title: District 1 + ``` + + - If you have a parameterized notebook with **multiple parameters**: + + - Example: [rt.yml](https://github.com/cal-itp/data-analyses/blob/main/portfolio/sites/rt.yml) + + ``` + title: My Analyses + directory: ./my-analyses/ + readme: ./my-analyses/README.md + notebook: ./my-analyses/my-notebook.ipynb + parts: + - chapters: + - caption: County Name + params: + parameter1_county_name + sections: + - city: parameter2_city_name + - city: parameter2_city_name + ``` + + - If you have an individual notebook with **no parameters**: + + - Example: [hqta.yml](https://github.com/cal-itp/data-analyses/blob/main/portfolio/sites/hqta.yml) + + ``` + title: My Analyses + directory: ./my-analyses/ + readme: ./my-analyses/README.md + parts: + - caption: Introduction + - chapters: + - notebook: ./my-analyses/notebook_1.ipynb + - notebook: ./my-analyses/notebook_2.ipynb + ``` + + - If you have multiple parameterized notebooks with **the same parameters**: + + - Example: [rt_parallel.yml](https://github.com/cal-itp/data-analyses/blob/main/portfolio/rt_parallel.yml) + + ``` + title: My Analyses + directory: ./my-analyses/ + readme: ./my-analyses/README.md + parts: + - caption: District Name + - chapters: + - caption: Parameter 1 + params: + itp_id: parameter_1 + sections: §ions + - notebook: ./analysis_1/notebook_1.ipynb + - notebook: ./analysis_2/notebook_2.ipynb + - caption: Parameter 2 + params: + itp_id: parameter_2 + sections: *sections + ``` ## Building and Deploying your Report + ### Build your Report + **Note:** The build command must be run from the root of the repo! + 1. Navigate back to the repo data-analyses and install the portfolio requirements with -`pip install -r portfolio/requirements.txt` + `pip install -r portfolio/requirements.txt` 2. Then run `python portfolio/portfolio.py build my_report` to build your report - * **Note:** `my_report.yml` will be replaced by the name of your `.yml` file in [data-analyses/portfolio/sites](https://github.com/cal-itp/data-analyses/tree/main/portfolio/sites). - * Your build will be located in: `data-analyses/portfolio/my_report/_build/html/index.html` -4. Add the files using `git add` and commit your progress! - + - **Note:** `my_report.yml` will be replaced by the name of your `.yml` file in [data-analyses/portfolio/sites](https://github.com/cal-itp/data-analyses/tree/main/portfolio/sites). + - Your build will be located in: `data-analyses/portfolio/my_report/_build/html/index.html` +3. Add the files using `git add` and commit your progress! ### Deploy your Report 1. Make sure you are in the root of the data-analyses repo: `~/data-analyses` + 2. Run `python portfolio/portfolio.py build my_report --deploy` - * By running `--deploy`, you are deploying the changes to display in the Analytics Portfolio. - * **Note:** The `my_report` will be replaced by the name of your `.yml` file in [data-analyses/portfolio/sites](https://github.com/cal-itp/data-analyses/tree/main/portfolio/sites). - * If you have already deployed but want to make changes to the README, run: `python portfolio/portfolio.py build my_report --papermill-no-execute` - * Running this is helpful for larger outputs or if you are updating the README. + + - By running `--deploy`, you are deploying the changes to display in the Analytics Portfolio. + - **Note:** The `my_report` will be replaced by the name of your `.yml` file in [data-analyses/portfolio/sites](https://github.com/cal-itp/data-analyses/tree/main/portfolio/sites). + - If you have already deployed but want to make changes to the README, run: `python portfolio/portfolio.py build my_report --papermill-no-execute` + - Running this is helpful for larger outputs or if you are updating the README. 3. Once this runs, you can check the preview link at the bottom of the output. It should look something like: - * `–no-deploy`: `file:///home/jovyan/data-analyses/portfolio/my_report/_build/html/index.html` - * `–deploy`: `Website Draft URL: https://my-report--cal-itp-data-analyses.netlify.app` + + - `–no-deploy`: `file:///home/jovyan/data-analyses/portfolio/my_report/_build/html/index.html` + - `–deploy`: `Website Draft URL: https://my-report--cal-itp-data-analyses.netlify.app` + 4. Add the files using `git add` and commit! -5. Your notebook should now be displayed in the [Cal-ITP Analytics Portfolio](https://analysis.calitp.org/) +5. Your notebook should now be displayed in the [Cal-ITP Analytics Portfolio](https://analysis.calitp.org/) ### Other Specifications - * You also have the option to specify: run `python portfolio/portfolio.py build --help` to see the following options: - * `--deploy / --no-deploy` - * deploy this component to netlify. - * `--prepare-only / --no-prepare-only` - * Pass-through flag to papermill; if true, papermill will not actually execute cells. - * `--execute-papermill / --no-execute-papermill` - * If false, will skip calls to papermill - * `--no-stderr / --no-no-stderr` - * If true, will clear stderr stream for cell outputs - * `--continue-on-error / --no-continue-on-error` - * Default: no-continue-on-error +- You also have the option to specify: run `python portfolio/portfolio.py build --help` to see the following options: + - `--deploy / --no-deploy` + - deploy this component to netlify. + - `--prepare-only / --no-prepare-only` + - Pass-through flag to papermill; if true, papermill will not actually execute cells. + - `--execute-papermill / --no-execute-papermill` + - If false, will skip calls to papermill + - `--no-stderr / --no-no-stderr` + - If true, will clear stderr stream for cell outputs + - `--continue-on-error / --no-continue-on-error` + - Default: no-continue-on-error ## Adding to the Makefile diff --git a/docs/publishing/sections/5_notebooks_styling.md b/docs/publishing/sections/5_notebooks_styling.md index 87baeb667d..17712216ca 100644 --- a/docs/publishing/sections/5_notebooks_styling.md +++ b/docs/publishing/sections/5_notebooks_styling.md @@ -3,13 +3,14 @@ ## Headers ### Parameterized Titles -* If you're parameterizing the notebook, the first Markdown cell must include parameters to inject. - * Ex: If `district` is one of the parameters in your `sites/my_report.yml`, a header Markdown cell could be `# District {district} Analysis`. - * Note: The site URL is constructed from the original notebook name and the parameter in the JupyterBook build: `0_notebook_name__district_x_analysis.html` + +- If you're parameterizing the notebook, the first Markdown cell must include parameters to inject. + - Ex: If `district` is one of the parameters in your `sites/my_report.yml`, a header Markdown cell could be `# District {district} Analysis`. + - Note: The site URL is constructed from the original notebook name and the parameter in the JupyterBook build: `0_notebook_name__district_x_analysis.html` ### Consecutive Headers -* Headers must move consecutively in Markdown cells. No skipping! +- Headers must move consecutively in Markdown cells. No skipping! ``` # Notebook Title @@ -18,41 +19,45 @@ ### Another subheading ``` -* To get around consecutive headers, you can use `display(HTML())`. +- To get around consecutive headers, you can use `display(HTML())`. - ``` - display(HTML(

First Header

) display(HTML(

Next Header

)) - ``` + ``` + display(HTML(

First Header

) display(HTML(

Next Header

)) + ``` ### Capturing Parameters -* If you're using a heading, you can either use HTML or capture the parameter and inject. -* HTML - this option works when you run your notebook locally. - ``` - from IPython.display import HTML +- If you're using a heading, you can either use HTML or capture the parameter and inject. - display(HTML(f"

Header with {variable}

")) - ``` +- HTML - this option works when you run your notebook locally. -* Capture parameters - this option won't display locally in your notebook (it will still show `{district_number}`), but will be injected with the value when the JupyterBook is built. + ``` + from IPython.display import HTML - In a code cell: - ``` - %%capture_parameters + display(HTML(f"

Header with {variable}

")) + ``` - district_number = f"{df.caltrans_district.iloc[0].split('-')[0].strip()}" - ``` +- Capture parameters - this option won't display locally in your notebook (it will still show `{district_number}`), but will be injected with the value when the JupyterBook is built. -
+ In a code cell: - In a Markdown cell: - ``` - ## District {district_number} - ``` + ``` + %%capture_parameters + + district_number = f"{df.caltrans_district.iloc[0].split('-')[0].strip()}" + ``` + +
+ In a Markdown cell: + + ``` + ## District {district_number} + ``` ### Suppress Warnings -* Suppress warnings from displaying in the portfolio site (`shared_utils`). + +- Suppress warnings from displaying in the portfolio site (`shared_utils`). ``` # Include this in the cell where packages are imported @@ -63,12 +68,13 @@ warnings.filterwarnings('ignore') ``` ## Narrative -* Narrative content can be done in Markdown cells or code cells. - * Markdown cells should be used when there are no variables to inject. - * Code cells should be used to write narrative whenever variables constructed from f-strings are used. -* For `papermill`, add a [parameters tag to the code cell](https://papermill.readthedocs.io/en/latest/usage-parameterize.html) - Note: Our portfolio uses a custom `papermill` engine and we can skip this step. -* Markdown cells can inject f-strings if it's plain Markdown (not a heading) using `display(Markdown())` in a code cell. + +- Narrative content can be done in Markdown cells or code cells. + - Markdown cells should be used when there are no variables to inject. + - Code cells should be used to write narrative whenever variables constructed from f-strings are used. +- For `papermill`, add a [parameters tag to the code cell](https://papermill.readthedocs.io/en/latest/usage-parameterize.html) + Note: Our portfolio uses a custom `papermill` engine and we can skip this step. +- Markdown cells can inject f-strings if it's plain Markdown (not a heading) using `display(Markdown())` in a code cell. ``` from IPython.display import Markdown @@ -76,9 +82,9 @@ from IPython.display import Markdown display(Markdown(f"The value of {variable} is {value}.")) ``` -* **Use f-strings to fill in variables and values instead of hard-coding them** - * Turn anything that runs in a loop or relies on a function into a variable. - * Use functions to grab those values for a specific entity (operator, district), rather than hard-coding the values into the narrative. +- **Use f-strings to fill in variables and values instead of hard-coding them** + - Turn anything that runs in a loop or relies on a function into a variable. + - Use functions to grab those values for a specific entity (operator, district), rather than hard-coding the values into the narrative. ``` n_routes = (df[df.calitp_itp_id == itp_id] @@ -101,56 +107,62 @@ display( ) ``` -* Stay away from loops if you need to use headers. - * You will need to create Markdown cells for headers or else JupyterBook will not build correctly. For parameterized notebooks, this is an acceptable trade-off. - * For unparameterized notebooks, you may want use `display(HTML())`. - * Caveat: Using `display(HTML())` means you'll lose the table of contents navigation in the top right corner in the JupyterBook build. +- Stay away from loops if you need to use headers. + - You will need to create Markdown cells for headers or else JupyterBook will not build correctly. For parameterized notebooks, this is an acceptable trade-off. + - For unparameterized notebooks, you may want use `display(HTML())`. + - Caveat: Using `display(HTML())` means you'll lose the table of contents navigation in the top right corner in the JupyterBook build. ## Writing Guide These are a set of principles to adhere to when writing the narrative content in a Jupyter Notebook. Use your best judgment to decide when there are exceptions to these principles. -* Decimals less than 1, always prefix with a 0, for readability. - * 0.05, not .05 -* Integers when referencing dates, times, etc - * 2020 for year, not 2020.0 (coerce to int64 or Int64 in `pandas`; Int64 are nullable integers, which allow for NaNs to appear alongside integers) - * 1 hr 20 min, not 1.33 hr (use best judgment to decide what's easier for readers to interpret) -* Round at the end of the analysis. Use best judgment to decide on significant digits. - * Too many decimal places give an air of precision that may not be present. - * Too few decimal places may not give enough detail to distinguish between categories or ranges. - * A good rule of thumb is to start with 1 extra decimal place than what is present in the other columns when deriving statistics (averages, percentiles), and decide from there if you want to round up. - * An average of `$100,000.0` can simply be rounded to `$100,000`. - * An average of 5.2 mi might be left as is. - * National Institutes of Health [Rounding Rules](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4483789/table/ARCHDISCHILD2014) (full [article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4483789/#:~:text=Ideally%20data%20should%20be%20rounded,might%20call%20it%20Goldilocks%20rounding.&text=The%20European%20Association%20of%20Science,2%E2%80%933%20effective%20digits%E2%80%9D.)) - -* Additional references: [American Psychological Association (APA) style](https://apastyle.apa.org/instructional-aids/numbers-statistics-guide.pdf), and [Purdue](https://owl.purdue.edu/owl/research_and_citation/apa_style/apa_formatting_and_style_guide/apa_numbers_statistics.html) +- Decimals less than 1, always prefix with a 0, for readability. + + - 0.05, not .05 + +- Integers when referencing dates, times, etc + + - 2020 for year, not 2020.0 (coerce to int64 or Int64 in `pandas`; Int64 are nullable integers, which allow for NaNs to appear alongside integers) + - 1 hr 20 min, not 1.33 hr (use best judgment to decide what's easier for readers to interpret) + +- Round at the end of the analysis. Use best judgment to decide on significant digits. + + - Too many decimal places give an air of precision that may not be present. + - Too few decimal places may not give enough detail to distinguish between categories or ranges. + - A good rule of thumb is to start with 1 extra decimal place than what is present in the other columns when deriving statistics (averages, percentiles), and decide from there if you want to round up. + - An average of `$100,000.0` can simply be rounded to `$100,000`. + - An average of 5.2 mi might be left as is. + - National Institutes of Health [Rounding Rules](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4483789/table/ARCHDISCHILD2014) (full [article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4483789/#:~:text=Ideally%20data%20should%20be%20rounded,might%20call%20it%20Goldilocks%20rounding.&text=The%20European%20Association%20of%20Science,2%E2%80%933%20effective%20digits%E2%80%9D.)) + +- Additional references: [American Psychological Association (APA) style](https://apastyle.apa.org/instructional-aids/numbers-statistics-guide.pdf), and [Purdue](https://owl.purdue.edu/owl/research_and_citation/apa_style/apa_formatting_and_style_guide/apa_numbers_statistics.html) ## Standard Names -* GTFS data in our warehouse stores information on operators, routes, and stops. -* Analysts should reference the operator name, route name, and Caltrans district the same way across analyses. - * ITP ID: 182 is `Metro` (not LA Metro, Los Angeles County Metropolitan Transportation Authority, though those are all correct names for the operator) - * Caltrans District: 7 is `07 - Los Angeles` - * Between `route_short_name`, `route_long_name`, `route_desc`, which one should be used to describe `route_id`? Use `shared_utils.portfolio_utils`, which relies on regular expressions, to select the most human-readable route name. -* Before deploying your portfolio, make sure the operator name you're using is what's used in other analyses in the portfolio. - * Use `shared_utils.portfolio_utils` to help you grab the right names to use. - - ``` - from shared_utils import portfolio_utils - - route_names = portfolio_utils.add_route_name() - - # Merge in the selected route name using route_id - df = pd.merge(df, - route_names, - on = ["calitp_itp_id", "route_id"] - ) +- GTFS data in our warehouse stores information on operators, routes, and stops. +- Analysts should reference the operator name, route name, and Caltrans district the same way across analyses. + - ITP ID: 182 is `Metro` (not LA Metro, Los Angeles County Metropolitan Transportation Authority, though those are all correct names for the operator) + - Caltrans District: 7 is `07 - Los Angeles` + - Between `route_short_name`, `route_long_name`, `route_desc`, which one should be used to describe `route_id`? Use `shared_utils.portfolio_utils`, which relies on regular expressions, to select the most human-readable route name. +- Before deploying your portfolio, make sure the operator name you're using is what's used in other analyses in the portfolio. + - Use `shared_utils.portfolio_utils` to help you grab the right names to use. - agency_names = portfolio_utils.add_agency_name() + ``` + from shared_utils import portfolio_utils - # Merge in the operator's name using calitp_itp_id - df = pd.merge(df, - agency_names, - on = "calitp_itp_id" - ) - ``` + route_names = portfolio_utils.add_route_name() + + # Merge in the selected route name using route_id + df = pd.merge(df, + route_names, + on = ["calitp_itp_id", "route_id"] + ) + + + agency_names = portfolio_utils.add_agency_name() + + # Merge in the operator's name using calitp_itp_id + df = pd.merge(df, + agency_names, + on = "calitp_itp_id" + ) + ``` diff --git a/docs/publishing/sections/6_metabase.md b/docs/publishing/sections/6_metabase.md index 541d20f9ab..840680c243 100644 --- a/docs/publishing/sections/6_metabase.md +++ b/docs/publishing/sections/6_metabase.md @@ -1,4 +1,5 @@ (publishing-metabase)= + # Metabase Interactive charts should be displayed in Metabase. Using Voila on Jupyter Notebooks works locally, but doesn't allow for sharing with external stakeholders. The data cleaning and processing should still be done within Python scripts or Jupyter notebooks. The processed dataset backing the dashboard should be exported to a Google Cloud Storage bucket. diff --git a/docs/publishing/sections/7_gcs.md b/docs/publishing/sections/7_gcs.md index 3f94487f09..20ca914d4b 100644 --- a/docs/publishing/sections/7_gcs.md +++ b/docs/publishing/sections/7_gcs.md @@ -1,4 +1,5 @@ (publishing-gcs)= + # GCS NOTE: If you are planning on publishing to [CKAN](publishing-ckan) and you are diff --git a/docs/publishing/sections/8_ckan.md b/docs/publishing/sections/8_ckan.md index 78ab696e3e..8287b017a5 100644 --- a/docs/publishing/sections/8_ckan.md +++ b/docs/publishing/sections/8_ckan.md @@ -18,11 +18,11 @@ metadata and a data dictionary. ### Cal-ITP datasets -* [Cal-ITP GTFS-Ingest Pipeline Dataset (schedule data)](https://data.ca.gov/dataset/cal-itp-gtfs-ingest-pipeline-dataset) +- [Cal-ITP GTFS-Ingest Pipeline Dataset (schedule data)](https://data.ca.gov/dataset/cal-itp-gtfs-ingest-pipeline-dataset) ## What is the publication script? -The publication script [publish.py](https://github.com/cal-itp/data-infra/blob/main/warehouse/scripts/publish.py), typically used within the [publish_open_data Airflow workflow](https://o1d2fa0877cf3fb10p-tp.appspot.com/dags/publish_open_data/grid), relies on a [dbt exposure](https://docs.getdbt.com/docs/build/exposures) to determine what to publish - in practice, that exposure is titled `california_open_data`. The tables included in that exposure, their CKAN destinations, and their published descriptions are defined in [_gtfs_schedule_latest.yml](https://github.com/cal-itp/data-infra/blob/main/warehouse/models/mart/gtfs_schedule_latest/_gtfs_schedule_latest.yml) under the `exposures` heading. +The publication script [publish.py](https://github.com/cal-itp/data-infra/blob/main/warehouse/scripts/publish.py), typically used within the [publish_open_data Airflow workflow](https://o1d2fa0877cf3fb10p-tp.appspot.com/dags/publish_open_data/grid), relies on a [dbt exposure](https://docs.getdbt.com/docs/build/exposures) to determine what to publish - in practice, that exposure is titled `california_open_data`. The tables included in that exposure, their CKAN destinations, and their published descriptions are defined in [\_gtfs_schedule_latest.yml](https://github.com/cal-itp/data-infra/blob/main/warehouse/models/mart/gtfs_schedule_latest/_gtfs_schedule_latest.yml) under the `exposures` heading. By default, the columns of a table included in the exposure are _not_ published on the portal. This is to prevent fields that are useful for internal data management but are hard to interpret for public users, like `_is_current`, from being included in the open data portal. Columns meant for publication are explicitly included in publication via the dbt `meta` tag `publish.include: true`, which you can see on various columns of the models in the same YAML file where the exposure itself is defined. diff --git a/docs/publishing/sections/9_geoportal.md b/docs/publishing/sections/9_geoportal.md index cecf3578de..9a1df8f70f 100644 --- a/docs/publishing/sections/9_geoportal.md +++ b/docs/publishing/sections/9_geoportal.md @@ -1,4 +1,5 @@ (publishing-geoportal)= + # Publishing data to California State Geoportal Spatial data cannot be directly published to CKAN. The Geoportal runs on the ESRI ArcGIS Online Hub platform. The Geoportal is synced to the CA open data portal for spatial datasets, and users are able to find the same spatial dataset via [data.ca.gov](https://data.ca.gov) or [gis.data.ca.gov](https://gis.data.ca.gov). @@ -11,16 +12,17 @@ The state of California's ESRI ArcGIS instance is called the [California State G Data is published through the enterprise geodatabase and made available in a variety of formats, including as ArcGIS Hub datasets, geoservices, geojsons, CSVs, KMLs, and shapefiles. ### General Process + 1. Submit the required metadata and data dictionary to the Caltrans GIS team. -1. Set up permissions related to the enterprise geodatabase. Learn more about the ArcGIS Pro [file geodatabase](https://pro.arcgis.com/en/pro-app/latest/help/data/geodatabases/overview/what-is-a-geodatabase-.htm) structure. -1. Create the metadata XML to use with each layer of the file geodatabase and update that at the time of each publishing. -1. Sync your local file geodatabase to the enterprise file geodatabase. -1. Open a ticket and the GIS team will sync your latest update to the Geoportal. +2. Set up permissions related to the enterprise geodatabase. Learn more about the ArcGIS Pro [file geodatabase](https://pro.arcgis.com/en/pro-app/latest/help/data/geodatabases/overview/what-is-a-geodatabase-.htm) structure. +3. Create the metadata XML to use with each layer of the file geodatabase and update that at the time of each publishing. +4. Sync your local file geodatabase to the enterprise file geodatabase. +5. Open a ticket and the GIS team will sync your latest update to the Geoportal. ### Cal-ITP data sets 1. CA transit [routes](https://gis.data.ca.gov/datasets/dd7cb74665a14859a59b8c31d3bc5a3e_0) / [stops](https://gis.data.ca.gov/datasets/900992cc94ab49dbbb906d8f147c2a72_0) - simple transformation of GTFS schedule `shapes` and `stops` from tabular to geospatial with minimum data cleaning -1. High quality transit [areas](https://gis.data.ca.gov/datasets/863e61eacbf3463ab239beb3cee4a2c3_0) and [stops](https://gis.data.ca.gov/datasets/f6c30480f0e84be699383192c099a6a4_0) - using GTFS schedule to determine whether corridors are high quality or not according to the California Public Resources Code +2. High quality transit [areas](https://gis.data.ca.gov/datasets/863e61eacbf3463ab239beb3cee4a2c3_0) and [stops](https://gis.data.ca.gov/datasets/f6c30480f0e84be699383192c099a6a4_0) - using GTFS schedule to determine whether corridors are high quality or not according to the California Public Resources Code ### Sample Workflow diff --git a/docs/transit_database/transitdatabase.md b/docs/transit_database/transitdatabase.md index d004461d2f..b62b9dfa57 100644 --- a/docs/transit_database/transitdatabase.md +++ b/docs/transit_database/transitdatabase.md @@ -4,19 +4,20 @@ The Cal-ITP Airtable Transit Database stores key relationships about how transit Important Airtable documentation is maintained elsewhere: -* [Airtable Data Documentation Google Doc](https://docs.google.com/document/d/1KvlYRYB8cnyTOkT1Q0BbBmdQNguK_AMzhSV5ELXiZR4/edit#heading=h.u7y2eosf0i1d) - documentation of specific fields in Airtable -* [California Transit Data - Operating Procedures Google Doc](https://docs.google.com/document/d/1IO8x9-31LjwmlBDH0Jri-uWI7Zygi_IPc9nqd7FPEQM/edit#) - outlines the processes by which Airtable data is maintained +- [Airtable Data Documentation Google Doc](https://docs.google.com/document/d/1KvlYRYB8cnyTOkT1Q0BbBmdQNguK_AMzhSV5ELXiZR4/edit#heading=h.u7y2eosf0i1d) - documentation of specific fields in Airtable +- [California Transit Data - Operating Procedures Google Doc](https://docs.google.com/document/d/1IO8x9-31LjwmlBDH0Jri-uWI7Zygi_IPc9nqd7FPEQM/edit#) - outlines the processes by which Airtable data is maintained In addition, some documentation is available automatically within Airtable (these require Airtable authentication to access): -* Airtable creates an API documentation page for each base (for example, [here is the page for California Transit](https://airtable.com/appPnJWrQ7ui4UmIl/api/docs)). This page provides technical information about field types and relationships. Airtable does not currently have an effective mechanism to programmatically download your data schema (they have paused issuing keys to their metadata API). -* When looking at a base, there is an `Extensions` tab at the far upper right corner (below the share, notifications, and user icons). If you click that, an extensions sidebar will open. In that sidebar, there is an extension called `Base schema` (you may have to open it fullscreen to actually see it.) This extension will let you see an auto-generated visualization of the technical relationships among fields in the base. + +- Airtable creates an API documentation page for each base (for example, [here is the page for California Transit](https://airtable.com/appPnJWrQ7ui4UmIl/api/docs)). This page provides technical information about field types and relationships. Airtable does not currently have an effective mechanism to programmatically download your data schema (they have paused issuing keys to their metadata API). +- When looking at a base, there is an `Extensions` tab at the far upper right corner (below the share, notifications, and user icons). If you click that, an extensions sidebar will open. In that sidebar, there is an extension called `Base schema` (you may have to open it fullscreen to actually see it.) This extension will let you see an auto-generated visualization of the technical relationships among fields in the base. Cal-ITP uses two main Airtable bases: -| **Base** | **Description** | -| :------------ | :-------------- | -| [**California Transit**](#california-transit) | Defines key organizational relationships and properties. Organizations, geography, funding programs, transit services, service characteristics, transit datasets such as GTFS, and the intersection between transit datasets and services. -| [**Transit Technology Stacks**](#transit-technology-stacks) | Defines operational setups at transit provider organizations. Defines relationships between vendor organizations, transit provider and operator organizations, products, contracts to provide products, transit stack components, and how they relate to one-another. +| **Base** | **Description** | +| :---------------------------------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [**California Transit**](#california-transit) | Defines key organizational relationships and properties. Organizations, geography, funding programs, transit services, service characteristics, transit datasets such as GTFS, and the intersection between transit datasets and services. | +| [**Transit Technology Stacks**](#transit-technology-stacks) | Defines operational setups at transit provider organizations. Defines relationships between vendor organizations, transit provider and operator organizations, products, contracts to provide products, transit stack components, and how they relate to one-another. | The rest of this page outlines stray technical considerations associated with Airtable and its ingestion into the data warehouse. @@ -31,28 +32,32 @@ We ingest data from Airtable into the Cal-ITP data warehouse. For an overview of To ingest a new Airtable table or base and make it available in the warehouse, you need to make updates throughout the data ingest flow, from the Airtable scraper Airflow DAG all the way to dbt mart tables. See [data infra PR #2781](https://github.com/cal-itp/data-infra/pull/2781) for an example of what this can look like. Ingesting new columns in an existing table is similar; see [data infra PR #2383](https://github.com/cal-itp/data-infra/pull/2383) for an example. ### Gotchas + Bringing Airtable data into the warehouse can involve a few tricky situations. Here are a few we've encountered so far, with suggested resolutions. #### Foreign keys and bridge tables + Airtable allows users to define links between tables, to create relationships between records of different types. In the Airtable UI, these links display the primary field for the linked record in the relevant column (so, for example, the `Services.provider` column contains an organization's name like `City of Anaheim`.) However, these foreign key links are exported via the Airtable API as an array of the back-end record IDs (so, instead of a single organization name like `City of Anaheim`, that `Services.provider` field will appear as an array containing a record ID, like `[rec0123asdf]`.) It does this even if the given field only ever contains exactly one foreign key (i.e., it turns it into an array even if all the arrays have only one entry.) This means: -* All foreign keys need to be unpacked from arrays in the warehouse to become useful for joins. See below for more on this. -* If a linked field is severed in Airtable (if the foreign key relationship is removed, but the columns that contained the links are not deleted) it can break our data ingest, because these array-type fields will become string-type fields. Ideally, it is best to just delete any associated columns when a foreign key relationship/link is ended. If this is not done and the data ingest does break, the solution is to suppress the broken column from the associated table by removing it from the external table schema. If the external table uses schema auto-detect, you may have to define a schema for the table that does not include the broken column. See [data infra PR #2441](https://github.com/cal-itp/data-infra/pull/2441) for an example of this process (though addressing a different issue.) + +- All foreign keys need to be unpacked from arrays in the warehouse to become useful for joins. See below for more on this. +- If a linked field is severed in Airtable (if the foreign key relationship is removed, but the columns that contained the links are not deleted) it can break our data ingest, because these array-type fields will become string-type fields. Ideally, it is best to just delete any associated columns when a foreign key relationship/link is ended. If this is not done and the data ingest does break, the solution is to suppress the broken column from the associated table by removing it from the external table schema. If the external table uses schema auto-detect, you may have to define a schema for the table that does not include the broken column. See [data infra PR #2441](https://github.com/cal-itp/data-infra/pull/2441) for an example of this process (though addressing a different issue.) Airtable foreign keys in the warehouse also require some special handling because: -* Most Airtable data is treated as dimensions (i.e., entities that we version over time) -* Some Airtable data contains many-to-many relationships +- Most Airtable data is treated as dimensions (i.e., entities that we version over time) +- Some Airtable data contains many-to-many relationships The mechanism that we have used to deal with both of these is the **bridge table**, [described in our dbt docs](https://dbt-docs.calitp.org/#!/overview). The bridge table stores the foreign key pairs to allow you to traverse a relationship, instead of trying to store these on each of the tables in the relationship itself. Trying to store the foreign keys on the tables directly opens you up to issues: -* You have to either store the foreign keys as an array or change the cardinality of the table (to account for the fact that one record may need to store multiple foreign keys, either to capture versioning on the foreign table or to capture relationships with multiple records). Metabase does not natively allow unnesting arrays to do joins in the GUI query editor, so we try to have non-array foreign keys in mart tables. -* You risk infinite loops if you try to version a record that includes a versioned foreign key on both sides of the relationship (which is how Airtable stores these relationships). For example, you have an organization and a service that are linked, with both containing a foreign key to the other. An attribute is changed on the service, creating a new versioned key. You need to add that new versioned service key to the organization record. But now that has triggered a change on the organization record, which makes a new versioned key on the organization record. So now you have to update the organization versioned key on the service record. And thus to infinity. Another solution here is to only store the relationship on one side, but then you still have the first problem of arrays and cardinality. +- You have to either store the foreign keys as an array or change the cardinality of the table (to account for the fact that one record may need to store multiple foreign keys, either to capture versioning on the foreign table or to capture relationships with multiple records). Metabase does not natively allow unnesting arrays to do joins in the GUI query editor, so we try to have non-array foreign keys in mart tables. +- You risk infinite loops if you try to version a record that includes a versioned foreign key on both sides of the relationship (which is how Airtable stores these relationships). For example, you have an organization and a service that are linked, with both containing a foreign key to the other. An attribute is changed on the service, creating a new versioned key. You need to add that new versioned service key to the organization record. But now that has triggered a change on the organization record, which makes a new versioned key on the organization record. So now you have to update the organization versioned key on the service record. And thus to infinity. Another solution here is to only store the relationship on one side, but then you still have the first problem of arrays and cardinality. Bridge tables do introduce some complexity in handling fanout from joins, but they remove that complexity from the dimension tables themselves. Another solution would be to only store the unversioned natural key for the foreign key, in which case you would only need bridge tables for true many-to-many relationships (to handle the array/cardinality issue), but that would still create fanout without the explicit artifact of the bridge table to help troubleshoot. #### Synced tables + Airtable allows you to "sync" a table from one base to another, where it appears with all the data from its source location and can be linked to records in the second base. An example in our Airtable is the `California Transit.organizations` table is synced to `Transit Technology Stacks.organizations`; you will see a little lightning icon to show that it is a synced table. This requires special handling when importing to the warehouse, because Airtable assigns new back-end record IDs in the synced table, which means that foreign keys to the synced table in the second base will not match record IDs in the source table. We resolve this by mapping all foreign keys to point to the source table in a base layer in dbt. See [data infra PR #2781](https://github.com/cal-itp/data-infra/pull/2781) for an example. diff --git a/docs/warehouse/adding_oneoff_data.md b/docs/warehouse/adding_oneoff_data.md index 3b9c85af11..9e6f065911 100644 --- a/docs/warehouse/adding_oneoff_data.md +++ b/docs/warehouse/adding_oneoff_data.md @@ -1,5 +1,7 @@ (adding-data-to-warehouse)= + # Adding Ad-Hoc Data to the Warehouse + To work with data in our BI tool ([Metabase](https://dashboards.calitp.org/)) we first have to add the data to our warehouse ([BigQuery](https://console.cloud.google.com/bigquery)). This page describes how to do an ad-hoc, one-time import of a dataset (for example, an individual extract from some other system.) ```{warning} @@ -13,22 +15,25 @@ To add one-time data to BigQuery for use in Metabase follow the instructions bel 2. Next, navigate to a [JupyterLab](https://notebooks.calitp.org/) terminal window. 3. Once in the terminal, input the following command with the appropriate structure: + ``` bq --location=us-west2 load --autodetect ``` -* The **``** specifies the type of file you would like to use. An example of this flag's use is `--source_format=CSV`. Other options include `PARQUET` and `NEWLINE_DELIMITED_JSON` +- The **``** specifies the type of file you would like to use. An example of this flag's use is `--source_format=CSV`. Other options include `PARQUET` and `NEWLINE_DELIMITED_JSON` -* The **``** is the table you would like to create, or append to if the table already exists. Your uploaded table destination should always be the `uploaded_data` dataset in BigQuery (e.g. the `destination_table` name should always have the format `uploaded_data.your_new_table_name`). - * If you are looking to **create a new table**: use a new table name - * If you are looking to **append to existing data**: re-use the name of the existing table - * If you are looking to **replace an existing table**: use the `--replace` flag after the `load` command +- The **``** is the table you would like to create, or append to if the table already exists. Your uploaded table destination should always be the `uploaded_data` dataset in BigQuery (e.g. the `destination_table` name should always have the format `uploaded_data.your_new_table_name`). -* The **``** argument is the `gsutil URI` (the path to the Google Cloud Storage bucket you are sourcing from). + - If you are looking to **create a new table**: use a new table name + - If you are looking to **append to existing data**: re-use the name of the existing table + - If you are looking to **replace an existing table**: use the `--replace` flag after the `load` command -* If you run into upload errors related to the source file format, you may need to include the flag `--allow_quoted_newlines`. This may be helpful in resolving errors related to newline-delimited text, which may be present in file conversions from Excel to CSV. +- The **``** argument is the `gsutil URI` (the path to the Google Cloud Storage bucket you are sourcing from). + +- If you run into upload errors related to the source file format, you may need to include the flag `--allow_quoted_newlines`. This may be helpful in resolving errors related to newline-delimited text, which may be present in file conversions from Excel to CSV. Ex. + ``` bq --location=us-west2 load --source_format=CSV --autodetect --allow_quoted_newlines uploaded_data.tircp_with_temporary_expenditure_sol_copy gs://calitp-analytics-data/data-analyses/tircp/tircp.csv ``` diff --git a/docs/warehouse/developing_dbt_models.md b/docs/warehouse/developing_dbt_models.md index ad401bd4bf..1f076822f3 100644 --- a/docs/warehouse/developing_dbt_models.md +++ b/docs/warehouse/developing_dbt_models.md @@ -6,10 +6,10 @@ Information related to contributing to the [Cal-ITP dbt project](https://github. ## Resources -* If you have questions specific to our project or you encounter any issues when developing, please bring those questions to the [`#data-warehouse-devs`](https://cal-itp.slack.com/archives/C050ZNDUL21) or [`#data-office-hours`](https://cal-itp.slack.com/archives/C02KH3DGZL7) Cal-ITP Slack channels. Working through questions "in public" helps build shared knowledge that's searchable later on. -* For Cal-ITP-specific data warehouse documentation, including high-level concepts and naming conventions, see [our Cal-ITP dbt documentation site](https://dbt-docs.calitp.org/#!/overview). This documentation is automatically generated by dbt, and incorporates the table- and column-level documentation that developers enter in YAML files in the dbt project. -* For general dbt concepts (for example, [models](https://docs.getdbt.com/docs/build/models), dbt [Jinja](https://docs.getdbt.com/guides/advanced/using-jinja) or [tests](https://docs.getdbt.com/docs/build/tests)), see the [general dbt documentation site](https://docs.getdbt.com/docs/introduction). -* For general SQL or BigQuery concepts (for example, [tables](https://cloud.google.com/bigquery/docs/tables-intro), [views](https://cloud.google.com/bigquery/docs/views-intro), or [window functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls)), see the [BigQuery docs site](https://cloud.google.com/bigquery/docs). +- If you have questions specific to our project or you encounter any issues when developing, please bring those questions to the [`#data-warehouse-devs`](https://cal-itp.slack.com/archives/C050ZNDUL21) or [`#data-office-hours`](https://cal-itp.slack.com/archives/C02KH3DGZL7) Cal-ITP Slack channels. Working through questions "in public" helps build shared knowledge that's searchable later on. +- For Cal-ITP-specific data warehouse documentation, including high-level concepts and naming conventions, see [our Cal-ITP dbt documentation site](https://dbt-docs.calitp.org/#!/overview). This documentation is automatically generated by dbt, and incorporates the table- and column-level documentation that developers enter in YAML files in the dbt project. +- For general dbt concepts (for example, [models](https://docs.getdbt.com/docs/build/models), dbt [Jinja](https://docs.getdbt.com/guides/advanced/using-jinja) or [tests](https://docs.getdbt.com/docs/build/tests)), see the [general dbt documentation site](https://docs.getdbt.com/docs/introduction). +- For general SQL or BigQuery concepts (for example, [tables](https://cloud.google.com/bigquery/docs/tables-intro), [views](https://cloud.google.com/bigquery/docs/views-intro), or [window functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls)), see the [BigQuery docs site](https://cloud.google.com/bigquery/docs). ## How to contribute to the dbt project @@ -40,9 +40,9 @@ Because the warehouse is collectively maintained and changes can affect a variet For an example of working with dbt in JupyterHub, see the recording of the [original onboarding call in April 2023 (requires Cal-ITP Google Drive access).](https://drive.google.com/file/d/1NDh_4u0-ROsH0w8J3Z1ccn_ICLAHtDhX/view?usp=drive_link) A few notes on this video: -* The documentation shown is an older version of this docs page; the information shared verbally is correct but the page has been updated. -* The bug encountered towards the end of the video (that prevented us from running dbt tests) has been fixed. -* The code owners mentioned in the video have changed; consult in Slack for process guidance. +- The documentation shown is an older version of this docs page; the information shared verbally is correct but the page has been updated. +- The bug encountered towards the end of the video (that prevented us from running dbt tests) has been fixed. +- The code owners mentioned in the video have changed; consult in Slack for process guidance. (modeling-considerations)= @@ -100,9 +100,9 @@ Here is a series of recordings showing a workflow for debugging a failing dbt te Usually, bugs are caused by: -* New or historical data issues. For example, an agency may be doing something in their GTFS data that we didn't expect and this may have broken one of our models. This can happen with brand new data that is coming in or in historical data that wasn't included in local testing (this is especially relevant for RT data, where local testing usually includes a very small subset of the full data.) -* GTFS or data schema bugs. Sometimes we may have misinterpreted the GTFS spec (or another incoming data model) and modeled something incorrectly. -* SQL bugs. Sometimes we may have written SQL incorrectly (for example, used the wrong kind of join.) +- New or historical data issues. For example, an agency may be doing something in their GTFS data that we didn't expect and this may have broken one of our models. This can happen with brand new data that is coming in or in historical data that wasn't included in local testing (this is especially relevant for RT data, where local testing usually includes a very small subset of the full data.) +- GTFS or data schema bugs. Sometimes we may have misinterpreted the GTFS spec (or another incoming data model) and modeled something incorrectly. +- SQL bugs. Sometimes we may have written SQL incorrectly (for example, used the wrong kind of join.) How to investigate the bug depends on how the bug was noticed. @@ -122,14 +122,14 @@ In either case, you may need to consider upstream models. To identify your model Changes to dbt models are likely to be appropriate when one or more of the following is true: -* There is a consistent or ongoing need for this data. dbt can ensure that transformations are performed consistently at scale, every day. -* The data is big. Doing transformations in BigQuery can be more performant than doing them in notebooks or any workflow where the large data must be loaded into local memory. -* We want to use the same data across multiple domains or tools. The BigQuery data warehouse is the easiest way to provide consistent data throughout the Cal-ITP data ecosystem (in JupyterHub, Metabase, open data publishing, the reports site, etc.) +- There is a consistent or ongoing need for this data. dbt can ensure that transformations are performed consistently at scale, every day. +- The data is big. Doing transformations in BigQuery can be more performant than doing them in notebooks or any workflow where the large data must be loaded into local memory. +- We want to use the same data across multiple domains or tools. The BigQuery data warehouse is the easiest way to provide consistent data throughout the Cal-ITP data ecosystem (in JupyterHub, Metabase, open data publishing, the reports site, etc.) dbt models may not be appropriate when: -* You are doing exploratory data analysis, especially on inconsistently-constructed data. It will almost always be faster to do initial exploration of data via Jupyter/Python than in SQL. If you only plan to use the data for a short period of time, or plan to reshape it many speculatively before you settle on a more long-lived form, you probably don't need to represent it with a dbt model quite yet. -* You want to apply a simple transformation (for example, a grouped summary or filter) to answer a specific question. In this case, it may be more appropriate to create a Metabase dashboard with the desired transformations. +- You are doing exploratory data analysis, especially on inconsistently-constructed data. It will almost always be faster to do initial exploration of data via Jupyter/Python than in SQL. If you only plan to use the data for a short period of time, or plan to reshape it many speculatively before you settle on a more long-lived form, you probably don't need to represent it with a dbt model quite yet. +- You want to apply a simple transformation (for example, a grouped summary or filter) to answer a specific question. In this case, it may be more appropriate to create a Metabase dashboard with the desired transformations. (model-grain)= @@ -145,7 +145,7 @@ This concept of grain can be one of the biggest differences between notebook-bas If there is already a model with the grain you are targeting, you should almost always add new columns to that existing model rather than making a new model with the same grain. -``` {admonition} Example: fct_scheduled_trips +```{admonition} Example: fct_scheduled_trips Consider [`fct_scheduled_trips`](https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.fct_scheduled_trips). This is our core trip-level table. Every scheduled trip should have a row in this model and attributes that you might want from that trip should be present for easy access. As a result, this table has a lot of columns, because when we need new information about trips, we add it here. For example, when we wanted to fix time zone handling for trips, we [added those columns](https://github.com/cal-itp/data-infra/pull/2457) instead of creating a new model. ``` @@ -165,27 +165,27 @@ If you find yourself making big changes that seem likely to significantly affect Here are a few example `data-infra` PRs that fixed past bugs: -* [PR #2076](https://github.com/cal-itp/data-infra/pull/2076) fixed two bugs: There was a hardcoded incorrect value in our SQL that was causing Sundays to not appear in our scheduled service index (SQL syntax bug), and there was a bug in how we were handling the relationship between `calendar_dates` and `calendar` (GTFS logic bug). -* [PR #2623](https://github.com/cal-itp/data-infra/pull/2623) fixed bugs caused by unexpected calendar data from a producer. +- [PR #2076](https://github.com/cal-itp/data-infra/pull/2076) fixed two bugs: There was a hardcoded incorrect value in our SQL that was causing Sundays to not appear in our scheduled service index (SQL syntax bug), and there was a bug in how we were handling the relationship between `calendar_dates` and `calendar` (GTFS logic bug). +- [PR #2623](https://github.com/cal-itp/data-infra/pull/2623) fixed bugs caused by unexpected calendar data from a producer. #### Example new column PRs Here are a few example `data-infra` PRs that added columns to existing models: -* [PR #2778](https://github.com/cal-itp/data-infra/pull/2778) is a simple example of adding a column that already exists in staging to a mart table. -* For intermediate examples of adding a column in a staging table and propagating it through a few different downstream models, see - * [PR #2768](https://github.com/cal-itp/data-infra/pull/2768) - * [PR #2601](https://github.com/cal-itp/data-infra/pull/2686) -* [PR #2383](https://github.com/cal-itp/data-infra/pull/2383) adds a column to Airtable data end-to-end (starting from the raw data/external tables; this involves non-dbt code). +- [PR #2778](https://github.com/cal-itp/data-infra/pull/2778) is a simple example of adding a column that already exists in staging to a mart table. +- For intermediate examples of adding a column in a staging table and propagating it through a few different downstream models, see + - [PR #2768](https://github.com/cal-itp/data-infra/pull/2768) + - [PR #2601](https://github.com/cal-itp/data-infra/pull/2686) +- [PR #2383](https://github.com/cal-itp/data-infra/pull/2383) adds a column to Airtable data end-to-end (starting from the raw data/external tables; this involves non-dbt code). #### Example new model PRs Here are a few `data-infra` PRs that created brand new models: -* [PR #2686](https://github.com/cal-itp/data-infra/pull/2686) created a new model based on existing warehouse data. -* For examples of adding models to dbt end-to-end (starting from raw data/external tables; this involves non-dbt code), see: - * [PR #2509](https://github.com/cal-itp/data-infra/pull/2509) - * [PR #2781](https://github.com/cal-itp/data-infra/pull/2781) +- [PR #2686](https://github.com/cal-itp/data-infra/pull/2686) created a new model based on existing warehouse data. +- For examples of adding models to dbt end-to-end (starting from raw data/external tables; this involves non-dbt code), see: + - [PR #2509](https://github.com/cal-itp/data-infra/pull/2509) + - [PR #2781](https://github.com/cal-itp/data-infra/pull/2781) (test-changes)= @@ -205,102 +205,102 @@ What to test/check will vary based on what you're doing, but below are some exam Are the values in your column/model what you expect? For example, are there nulls? Does the column have all the values you anticipated (for example, if you have a day of the week column, is data from all 7 days present)? If it's numeric, what are the minimum and maximum values; do they make sense (for example, if you have a percentage column, is it always between 0 and 100)? What is the most common value? -* To check nulls: +- To check nulls: - ```sql - SELECT * FROM - WHERE IS NULL - ``` + ```sql + SELECT * FROM + WHERE IS NULL + ``` -* To check distinct values in a column: +- To check distinct values in a column: - ```sql - SELECT DISTINCT - FROM - ``` + ```sql + SELECT DISTINCT + FROM + ``` -* To check min/max: +- To check min/max: - ```sql - SELECT - MIN(), - MAX() - FROM - ``` + ```sql + SELECT + MIN(), + MAX() + FROM + ``` -* To check most common values: +- To check most common values: - ```sql - SELECT - , - COUNT(*) AS ct - FROM - GROUP BY 1 - ORDER BY ct DESC - ``` + ```sql + SELECT + , + COUNT(*) AS ct + FROM + GROUP BY 1 + ORDER BY ct DESC + ``` #### Row count and uniqueness To confirm that the grain is what you expect, you should check whether an anticipated unique key is actually unique. For example, if you were making a daily shapes table, you might expect that `date + feed_key + shape_id` would be unique. Similarly, you should have a ballpark idea of the order of magnitude of the number of rows you expect. If you're making a yearly organizations table and your table has a million rows, something is likely off. Some example queries could be: -* To check row count: - - ```sql - SELECT COUNT(*) FROM - ``` - -* To check row count by some attribute (for example, rows per date): - - ```sql - SELECT , COUNT(*) AS ct - FROM - GROUP BY 1 - ORDER BY 1 - ``` - -* To check uniqueness based on a combination of a few columns: - - ```sql - WITH tbl AS ( - SELECT * FROM - ), - - dups AS ( - SELECT - , - , - , - COUNT(*) AS ct - FROM tbl - -- adjust this based on the number of columns that make the composite unique key - GROUP BY 1, 2, 3 - HAVING ct > 1 - ) - - SELECT * - FROM dups - LEFT JOIN tbl USING (, , ) - ORDER BY , , - ``` +- To check row count: + + ```sql + SELECT COUNT(*) FROM + ``` + +- To check row count by some attribute (for example, rows per date): + + ```sql + SELECT , COUNT(*) AS ct + FROM + GROUP BY 1 + ORDER BY 1 + ``` + +- To check uniqueness based on a combination of a few columns: + + ```sql + WITH tbl AS ( + SELECT * FROM + ), + + dups AS ( + SELECT + , + , + , + COUNT(*) AS ct + FROM tbl + -- adjust this based on the number of columns that make the composite unique key + GROUP BY 1, 2, 3 + HAVING ct > 1 + ) + + SELECT * + FROM dups + LEFT JOIN tbl USING (, , ) + ORDER BY , , + ``` #### Performance While testing, you should keep an eye on the performance (cost/data efficiency) of the model: -* When you run the dbt model locally, look at how many bytes are billed to build the model(s). -* Before you run test queries, [check the bytes estimates](https://cloud.google.com/bigquery/docs/best-practices-costs#use-query-validator) (these may not be accurate for queries on [views](https://cloud.google.com/bigquery/docs/views-intro#view_pricing) or [clustered tables](https://cloud.google.com/bigquery/docs/clustered-tables#clustered_table_pricing)) -* After you run test queries, look at the total bytes billed after the fact in the **Job Information** tab in the **Query results** section of the BigQuery console. +- When you run the dbt model locally, look at how many bytes are billed to build the model(s). +- Before you run test queries, [check the bytes estimates](https://cloud.google.com/bigquery/docs/best-practices-costs#use-query-validator) (these may not be accurate for queries on [views](https://cloud.google.com/bigquery/docs/views-intro#view_pricing) or [clustered tables](https://cloud.google.com/bigquery/docs/clustered-tables#clustered_table_pricing)) +- After you run test queries, look at the total bytes billed after the fact in the **Job Information** tab in the **Query results** section of the BigQuery console. If the model takes more than 100 GB to build, or if test queries seem to be reading a lot of data (this is subjective; it's ok to build a sense over time), you may want to consider performance optimizations. Below are a few options to improve performance. [Data infra PR #2711](https://github.com/cal-itp/data-infra/pull/2711) has examples of several different types of performance interventions. -* If the model is expensive to **build**: First, try to figure out what specific steps are expensive. You can run individual portions of your model SQL in the BigQuery console to assess the performance of individual [CTEs](https://docs.getdbt.com/terms/cte). - * If the model involves transformations on a lot of data that doesn't need to be reprocessed every day, you may want to make the model [incremental](https://docs.getdbt.com/docs/build/incremental-models). You can run `poetry run dbt ls -s config.materialized:incremental --resource-type model` to see examples of other incremental models in the repo. - * If the model reads data from an expensive parent table, you may want to consider leveraging clustering or partitioning on that parent table to make a join or select more efficient. See [this comment on data infra PR #2743](https://github.com/cal-itp/data-infra/pull/2743#pullrequestreview-1570532320) for an example of a case where changing a join condition was a more appropriate performance intervention than making the table incremental. -* If the model is expensive to **query**: The main interventions to make a model more efficient to query involve changing the data storage. - * Consider storing it as a [table rather than a view](https://docs.getdbt.com/docs/build/materializations). - * If the model is already a table, you can consider [partitioning](https://cloud.google.com/bigquery/docs/partitioned-tables) or [clustering](https://cloud.google.com/bigquery/docs/clustered-tables#when_to_use_clustering) on columns that will commonly be used as filters. +- If the model is expensive to **build**: First, try to figure out what specific steps are expensive. You can run individual portions of your model SQL in the BigQuery console to assess the performance of individual [CTEs](https://docs.getdbt.com/terms/cte). + - If the model involves transformations on a lot of data that doesn't need to be reprocessed every day, you may want to make the model [incremental](https://docs.getdbt.com/docs/build/incremental-models). You can run `poetry run dbt ls -s config.materialized:incremental --resource-type model` to see examples of other incremental models in the repo. + - If the model reads data from an expensive parent table, you may want to consider leveraging clustering or partitioning on that parent table to make a join or select more efficient. See [this comment on data infra PR #2743](https://github.com/cal-itp/data-infra/pull/2743#pullrequestreview-1570532320) for an example of a case where changing a join condition was a more appropriate performance intervention than making the table incremental. +- If the model is expensive to **query**: The main interventions to make a model more efficient to query involve changing the data storage. + - Consider storing it as a [table rather than a view](https://docs.getdbt.com/docs/build/materializations). + - If the model is already a table, you can consider [partitioning](https://cloud.google.com/bigquery/docs/partitioned-tables) or [clustering](https://cloud.google.com/bigquery/docs/clustered-tables#when_to_use_clustering) on columns that will commonly be used as filters. ```{warning} Incremental models have two different run modes: **full refreshes** (which re-process all historical data available) and **incremental runs** that load data in batches based on your incremental logic. These two modes run different code. @@ -401,5 +401,5 @@ In 2022, Laurie [gave a lunch and learn](https://cal-itp.slack.com/archives/C02N Some folks from Data Services attended Coalesce (dbt's conference) in 2022 and thought the following talks may be of interest: -* [The accidental analytics engineer by Michael Chow](https://www.youtube.com/watch?v=EYdb1x1cO9U&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=66) - this talk outlines some differences Michael has experienced between R/tidyverse and dbt/MDS (modern data stack) approaches to working with data -* [dbt and MDS in small-batch academic research](https://www.youtube.com/watch?v=0SDp1yTK2zc&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=112) - this talk outlines some benefits this researcher found to using dbt in an academic context; note that he uses DuckDB (instead of BigQuery) +- [The accidental analytics engineer by Michael Chow](https://www.youtube.com/watch?v=EYdb1x1cO9U&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=66) - this talk outlines some differences Michael has experienced between R/tidyverse and dbt/MDS (modern data stack) approaches to working with data +- [dbt and MDS in small-batch academic research](https://www.youtube.com/watch?v=0SDp1yTK2zc&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=112) - this talk outlines some benefits this researcher found to using dbt in an academic context; note that he uses DuckDB (instead of BigQuery) diff --git a/docs/warehouse/navigating_dbt_docs.md b/docs/warehouse/navigating_dbt_docs.md index bfd67d92bf..9df4b9fb28 100644 --- a/docs/warehouse/navigating_dbt_docs.md +++ b/docs/warehouse/navigating_dbt_docs.md @@ -12,6 +12,7 @@ kernelspec: language: python name: python3 --- + # Navigating the dbt Docs `dbt` is the tool that we use to create data transformations in our warehouse, and it is also the tool that generates our dataset and table documentation. @@ -23,28 +24,32 @@ Visit this link to view the [dbt Cal-ITP warehouse documentation](https://dbt-do In the [dbt Cal-ITP warehouse documentation](https://dbt-docs.calitp.org/#!/overview), you can navigate from either the `Database` perspective (table-level) or the `Project` perspective (as the files are configured in the repository). ### The `Database` Perspective + This allows you to view the dbt project as it exists in the warehouse. To examine the documentation from the `Database` perspective: 1. Once at the [dbt docs homepage](https://dbt-docs.calitp.org/#!/overview), make sure that the `Database` tab is selected in the left-side panel -1. In the same left-side panel, under the `Tables and Views` heading, click on `cal-itp-data-infra` which will expand -1. Within that list, select the dataset schema of your choice -1. From here, a dropdown list of tables will appear and you can select a table to view its documentation +2. In the same left-side panel, under the `Tables and Views` heading, click on `cal-itp-data-infra` which will expand +3. Within that list, select the dataset schema of your choice +4. From here, a dropdown list of tables will appear and you can select a table to view its documentation ### The `Project` Perspective + This allows you to view the warehouse project as it exists as files in the repository. To examine the documentation for our tables from the `Project` perspective: -* Once at the [dbt docs homepage](https://dbt-docs.calitp.org/#!/overview), make sure that the `Project` tab is selected in the left-side panel. - * To examine our `source` tables: - 1. In the same left-side panel, find the `Sources` heading - 1. From here, select the source that you would like to view - 1. A dropdown list of tables will appear and you can select a table to view its documentation - - * To examine all of our other tables: - 1. In the same left-side panel, under the `Projects`, heading click on `calitp_warehouse` which will expand. - 2. Within that list, select `models` - 3. From here, file directories will appear below. - 4. Select the directory of your choice. A dropdown list of tables will appear and you can select a table to view its documentation +- Once at the [dbt docs homepage](https://dbt-docs.calitp.org/#!/overview), make sure that the `Project` tab is selected in the left-side panel. + - To examine our `source` tables: + + 1. In the same left-side panel, find the `Sources` heading + 2. From here, select the source that you would like to view + 3. A dropdown list of tables will appear and you can select a table to view its documentation + + - To examine all of our other tables: + + 1. In the same left-side panel, under the `Projects`, heading click on `calitp_warehouse` which will expand. + 2. Within that list, select `models` + 3. From here, file directories will appear below. + 4. Select the directory of your choice. A dropdown list of tables will appear and you can select a table to view its documentation diff --git a/docs/warehouse/overview.md b/docs/warehouse/overview.md index 792c59189a..d9c4ee4597 100644 --- a/docs/warehouse/overview.md +++ b/docs/warehouse/overview.md @@ -1,7 +1,10 @@ (intro-warehouse)= + # Introduction to the Warehouse + The section serves as an introduction to the Cal-ITP warehouse through conventions, best practices, and tips and tricks to better understand the Cal-ITP data warehouse. -* [Starter Kit](warehouse-starter-kit-page) -* [Adding Data to the Warehouse](adding-data-to-warehouse) -* [Developing dbt Models](developing-dbt-models) -* [What is GTFS, anyway?](what-is-gtfs) + +- [Starter Kit](warehouse-starter-kit-page) +- [Adding Data to the Warehouse](adding-data-to-warehouse) +- [Developing dbt Models](developing-dbt-models) +- [What is GTFS, anyway?](what-is-gtfs) diff --git a/docs/warehouse/warehouse_starter_kit.md b/docs/warehouse/warehouse_starter_kit.md index 1fd9f0f7e6..42c16000f1 100644 --- a/docs/warehouse/warehouse_starter_kit.md +++ b/docs/warehouse/warehouse_starter_kit.md @@ -1,55 +1,68 @@ (warehouse-starter-kit-page)= + # Warehouse: Where to Begin + [There is a large selection of data available in the warehouse.](https://console.cloud.google.com/bigquery?project=cal-itp-data-infra&ws=!1m0) Consider this a short guide to the most commonly used tables in our work. -* [Important Links](#links) -* [Trips](#trips) -* [Shapes](#shapes) -* [Daily](#daily) -* [Other](#other) +- [Important Links](#links) +- [Trips](#trips) +- [Shapes](#shapes) +- [Daily](#daily) +- [Other](#other) ## Important Links -* [DBT Docs Cal-ITP](https://dbt-docs.calitp.org/#!/overview) contains information on all the tables in the warehouse. -* [Example notebook](https://github.com/cal-itp/data-analyses/blob/main/starter_kit/gtfs_utils_v2_examples.ipynb) - uses functions in `shared_utils.gtfs_utils_v2` that query some of the tables below. + +- [DBT Docs Cal-ITP](https://dbt-docs.calitp.org/#!/overview) contains information on all the tables in the warehouse. +- [Example notebook](https://github.com/cal-itp/data-analyses/blob/main/starter_kit/gtfs_utils_v2_examples.ipynb) + uses functions in `shared_utils.gtfs_utils_v2` that query some of the tables below. ## Trips + On a given day: -* [fct_scheduled_trips](https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.fct_scheduled_trips) - * Use `gtfs_utils_v2.get_trips()`. - * Answer how many trips a provider is scheduled to run and how many trips a particular route may make? -* [fct_observed_trips](https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.fct_observed_trips) - * Realtime observations of trips to get a full picture of what occurred. - * Find a trip's start time, where it went, and which route it is associated with. + +- [fct_scheduled_trips](https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.fct_scheduled_trips) + - Use `gtfs_utils_v2.get_trips()`. + - Answer how many trips a provider is scheduled to run and how many trips a particular route may make? +- [fct_observed_trips](https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.fct_observed_trips) + - Realtime observations of trips to get a full picture of what occurred. + - Find a trip's start time, where it went, and which route it is associated with. ## Shapes -* [fct_daily_scheduled_shapes](https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.fct_daily_scheduled_shapes) - * Use `gtfs_utils_v2.get_shapes()`. - * Contains `point` geometry, so you can see the length and location of a route a provider can run on a given date. - * Each shape has its own `shape_id` and `shape_array_key`. - * An express version and the regular version of a route are considered two different shapes. + +- [fct_daily_scheduled_shapes](https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.fct_daily_scheduled_shapes) + - Use `gtfs_utils_v2.get_shapes()`. + - Contains `point` geometry, so you can see the length and location of a route a provider can run on a given date. + - Each shape has its own `shape_id` and `shape_array_key`. + - An express version and the regular version of a route are considered two different shapes. ## Daily + For a given day: -* [fct_daily_scheduled_stops](https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.fct_daily_scheduled_stops) - * Use `gtfs_utils_v2.get_stops()`. - * Contains `point` geometry. - * How many stops did a provider make? Where did they stop? - * How many stops did a particular transit type (streetcar, rail, ferry...)? - * Detailed information such as how passengers embark/disembark (ex: on a stop/at a station) onto a vehicle. - -* [fct_daily_schedule_feeds](https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.fct_daily_schedule_feeds) - * Use `gtfs_utils_v2.schedule_daily_feed_to_organization()` to find feed names, regional feed type, and gtfs dataset key. - * Please note,the `name` column returned from the function above refers to a name of the feed, not to a provider. - * Use `gtfs_utils_v2.schedule_daily_feed_to_organization()` to find regional feed type, gtfs dataset key, and feed type for an organization. + +- [fct_daily_scheduled_stops](https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.fct_daily_scheduled_stops) + + - Use `gtfs_utils_v2.get_stops()`. + - Contains `point` geometry. + - How many stops did a provider make? Where did they stop? + - How many stops did a particular transit type (streetcar, rail, ferry...)? + - Detailed information such as how passengers embark/disembark (ex: on a stop/at a station) onto a vehicle. + +- [fct_daily_schedule_feeds](https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.fct_daily_schedule_feeds) + + - Use `gtfs_utils_v2.schedule_daily_feed_to_organization()` to find feed names, regional feed type, and gtfs dataset key. + - Please note,the `name` column returned from the function above refers to a name of the feed, not to a provider. + - Use `gtfs_utils_v2.schedule_daily_feed_to_organization()` to find regional feed type, gtfs dataset key, and feed type for an organization. ### Other -* [dim_annual_ntd_agency_information](https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.dim_annual_database_agency_information) - * View some of the data produced by the [US Department of Transportation](https://www.transit.dot.gov/ntd) for the National Transit Database. - * Information from 2018-2021 are available. - * Includes information such as reporter type, organization type, website, and address. - * Not every provider is required to report their data to the NTD, so this is not a comprehensive dataset. - -* [fct_daily_organization_combined_guideline_checks](https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.fct_daily_organization_combined_guideline_checks) - * Understand GTFS quality - how well a transit provider's GTFS data conforms to [California's Transit Data Guidelines](https://dot.ca.gov/cal-itp/california-transit-data-guidelines). - * Each provider [has one row per guideline check](https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.int_gtfs_quality__guideline_checks_long). Each row details how well a provider's GTFS data conforms to a certain guideline (availability on website, accurate accessibility data, etc). + +- [dim_annual_ntd_agency_information](https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.dim_annual_database_agency_information) + + - View some of the data produced by the [US Department of Transportation](https://www.transit.dot.gov/ntd) for the National Transit Database. + - Information from 2018-2021 are available. + - Includes information such as reporter type, organization type, website, and address. + - Not every provider is required to report their data to the NTD, so this is not a comprehensive dataset. + +- [fct_daily_organization_combined_guideline_checks](https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.fct_daily_organization_combined_guideline_checks) + + - Understand GTFS quality - how well a transit provider's GTFS data conforms to [California's Transit Data Guidelines](https://dot.ca.gov/cal-itp/california-transit-data-guidelines). + - Each provider [has one row per guideline check](https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.int_gtfs_quality__guideline_checks_long). Each row details how well a provider's GTFS data conforms to a certain guideline (availability on website, accurate accessibility data, etc). diff --git a/docs/warehouse/what_is_agency.md b/docs/warehouse/what_is_agency.md index 09d4e34995..c0f5a26069 100644 --- a/docs/warehouse/what_is_agency.md +++ b/docs/warehouse/what_is_agency.md @@ -1,12 +1,13 @@ # What is an `agency`? + `Agency` is a term that is used often across the Cal-ITP project but depending on the context of its use, it can have varying meanings when conducting an analysis. Inconsistent use of the term `agency` can be confusing, so this section of the documentation seeks to help analysts determine how to translate the use of the word `agency` in research requests depending on the area of focus that the research request falls into. -| Area of Focus | How to Identify an `agency` | -| -------- | -------- | -| **GTFS Datasets** | For both GTFS Static and GTFS Real-Time data, when trying to analyze GTFS datasets it is easiest to think of `agency` as **"unique feed publisher"**, with the exception of the combined regional feed in the Bay Area, as it is a regional reporter that publishes duplicates of other feeds that we also consume.

**To identify "unique feed publishers":**
  • Decide whether customer-facing feeds or agency feeds make sense for the analysis. For data quality analyses, customer-facing is crucial; for transit planning analyses, agency subfeeds is more relevant.
  • Deduplicate feeds
| -| **GTFS-Provider-Service Relationships** | In the warehouse, this is the relationship between `organizations` and the `services` they manage. An agency can be interpreted as both depending on the use case.

This is not an exhaustive list of all services managed by providers, only those that we are targeting to get into GTFS reporting.

Each record defines an organization and one of it's services. For the most part, each service is managed by a single organization with a small number of exceptions (e.g. *Solano Express*, which is jointly managed by Solano and Napa). In all cases, it is best to define how you are using `agency` within your analyses.

**Reference table**: Use this table to identify provider-service relationships
`cal-itp-data-infra.mart_transit_database.dim_provider_gtfs_data`
  • Column: `organization_name`
  • Column: `mobility_service`

  • | -| **Non-GTFS Datasets** | Depending on the data you are using, defining an agency can change. In most cases, an `agency` refers to a public entity. For analyses that include non-public entities, `organization` can be used as a catch-all term to include local government agencies and other entities that may not fall under this definition of `agency`.

    **Examples of `agency` definitions:**
    • [DLA Local Public Agency](https://dot.ca.gov/-/media/dot-media/programs/local-assistance/documents/guide/dla-glossary052022.pdf): "A California City, county, tribal government or other local public agency. In many instances this term is used loosely to include nonprofit organizations." | +| Area of Focus | How to Identify an `agency` | +| ------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **GTFS Datasets** | For both GTFS Static and GTFS Real-Time data, when trying to analyze GTFS datasets it is easiest to think of `agency` as **"unique feed publisher"**, with the exception of the combined regional feed in the Bay Area, as it is a regional reporter that publishes duplicates of other feeds that we also consume.

      **To identify "unique feed publishers":**
      • Decide whether customer-facing feeds or agency feeds make sense for the analysis. For data quality analyses, customer-facing is crucial; for transit planning analyses, agency subfeeds is more relevant.
      • Deduplicate feeds
      | +| **GTFS-Provider-Service Relationships** | In the warehouse, this is the relationship between `organizations` and the `services` they manage. An agency can be interpreted as both depending on the use case.

      This is not an exhaustive list of all services managed by providers, only those that we are targeting to get into GTFS reporting.

      Each record defines an organization and one of it's services. For the most part, each service is managed by a single organization with a small number of exceptions (e.g. *Solano Express*, which is jointly managed by Solano and Napa). In all cases, it is best to define how you are using `agency` within your analyses.

      **Reference table**: Use this table to identify provider-service relationships
      `cal-itp-data-infra.mart_transit_database.dim_provider_gtfs_data`
      • Column: `organization_name`
      • Column: `mobility_service`

      • | +| **Non-GTFS Datasets** | Depending on the data you are using, defining an agency can change. In most cases, an `agency` refers to a public entity. For analyses that include non-public entities, `organization` can be used as a catch-all term to include local government agencies and other entities that may not fall under this definition of `agency`.

        **Examples of `agency` definitions:**
        • [DLA Local Public Agency](https://dot.ca.gov/-/media/dot-media/programs/local-assistance/documents/guide/dla-glossary052022.pdf): "A California City, county, tribal government or other local public agency. In many instances this term is used loosely to include nonprofit organizations." | **Note**: Defining your unit of analysis within your analyses — whether it be `agency` or `organization` or another term — can help clarify how you are using the term. diff --git a/docs/warehouse/what_is_gtfs.md b/docs/warehouse/what_is_gtfs.md index 4379bf3160..9cd5b52eee 100644 --- a/docs/warehouse/what_is_gtfs.md +++ b/docs/warehouse/what_is_gtfs.md @@ -1,4 +1,5 @@ (what-is-gtfs)= + # What is GTFS, anyway? Lots of information in the warehouse comes from the General Transit Feed Specification, or GTFS. @@ -9,12 +10,12 @@ Lots of information in the warehouse comes from the General Transit Feed Specifi ### Video Notes -* Laurie mentions downloading an example feed to look at the files. This is a great idea! But when working with our data warehouse remember that you don't have to interact with raw GTFS feeds directly (lots of important work has been done for you!). Still, we recommend taking a look at an example feed to understand what it looks like. Here's one from [Big Blue Bus](http://gtfs.bigbluebus.com/current.zip). -* We don't really use Partridge, but here's a link to their [repo](https://github.com/remix/partridge) in case you want to see where that handy diagram came from. Notice how trips are central to a GTFS feed. -* Ignore references to `calitp_url_number`, `calitp_extracted_at`, or `calitp_deleted_at`. These refer to an older version of our warehouse. Today, we tell feeds apart with keys such as `gtfs_dataset_key` and `feed_key`. Learn about our current data warehouse [here](warehouse-starter-kit-page). -* A "route" is a somewhat ambiguous concept! Transit providers have a lot of flexibility in branding their services. The same route can, and often does, have some trips following one path and some trips following another. GTFS has another concept of a "shape" which describes a path through physical space that one or more trips can follow. +- Laurie mentions downloading an example feed to look at the files. This is a great idea! But when working with our data warehouse remember that you don't have to interact with raw GTFS feeds directly (lots of important work has been done for you!). Still, we recommend taking a look at an example feed to understand what it looks like. Here's one from [Big Blue Bus](http://gtfs.bigbluebus.com/current.zip). +- We don't really use Partridge, but here's a link to their [repo](https://github.com/remix/partridge) in case you want to see where that handy diagram came from. Notice how trips are central to a GTFS feed. +- Ignore references to `calitp_url_number`, `calitp_extracted_at`, or `calitp_deleted_at`. These refer to an older version of our warehouse. Today, we tell feeds apart with keys such as `gtfs_dataset_key` and `feed_key`. Learn about our current data warehouse [here](warehouse-starter-kit-page). +- A "route" is a somewhat ambiguous concept! Transit providers have a lot of flexibility in branding their services. The same route can, and often does, have some trips following one path and some trips following another. GTFS has another concept of a "shape" which describes a path through physical space that one or more trips can follow. ## Further GTFS Resources -* [Slides](https://docs.google.com/presentation/d/1fqIeXevb18T5s5k6XPxFbVEMHBPybeV29rFoFXROCw8/) from the video -* [GTFS Specification](https://gtfs.org) +- [Slides](https://docs.google.com/presentation/d/1fqIeXevb18T5s5k6XPxFbVEMHBPybeV29rFoFXROCw8/) from the video +- [GTFS Specification](https://gtfs.org) diff --git a/images/dask/README.md b/images/dask/README.md index fe7a24e00b..ec4e523e85 100644 --- a/images/dask/README.md +++ b/images/dask/README.md @@ -6,6 +6,7 @@ used by analytics code. See [the dask docs](https://docs.dask.org/en/stable/how- for some additional detail. ## Building and pushing manually + ```bash docker build -t ghcr.io/cal-itp/data-infra/dask:2022.10.13 . docker push ghcr.io/cal-itp/data-infra/dask:2022.10.13 diff --git a/images/jupyter-singleuser/README.md b/images/jupyter-singleuser/README.md index 410012f6be..ec11a9b858 100644 --- a/images/jupyter-singleuser/README.md +++ b/images/jupyter-singleuser/README.md @@ -4,6 +4,7 @@ This is the notebook image that individual users are served via JupyterHub. ## Building and pushing manually + ```bash docker build -t ghcr.io/cal-itp/data-infra/jupyter-singleuser:2022.10.13 . docker push ghcr.io/cal-itp/data-infra/jupyter-singleuser:2022.10.13 diff --git a/jobs/gtfs-rt-parser-v2/README.md b/jobs/gtfs-rt-parser-v2/README.md index 2f8f99682d..9bafc0acf9 100644 --- a/jobs/gtfs-rt-parser-v2/README.md +++ b/jobs/gtfs-rt-parser-v2/README.md @@ -14,11 +14,13 @@ and validating in the same codebase, controlled by CLI flags. Much of RT parsing combine them. ## Testing + This image can be built and tested via local Airflow. In addition, there is at least one Python test that can be executed via `poetry run pytest`. ## The GTFS-RT validator + The [validator jar](./rt-validator.jar) is an old snapshot of the GTFS Realtime validator that now lives under [MobilityData](https://github.com/MobilityData/gtfs-realtime-validator). We've temporarily vendored an old version (specifically "1.0.0-SNAPSHOT") to help make our builds less dependent on external services. We should begin diff --git a/packages/calitp-data-infra/README.md b/packages/calitp-data-infra/README.md index a11f44363b..edeffa98ca 100644 --- a/packages/calitp-data-infra/README.md +++ b/packages/calitp-data-infra/README.md @@ -9,6 +9,7 @@ GTFS feeds based on a GTFSDownloadConfig and is used by both the GTFS Schedule download Airflow DAG and the GTFS RT archiver. ## Testing and publishing + This repository should pass mypy and other static checkers, and has a small number of tests. These checks are executed in [GiHub Actions](../../.github/workflows/build-calitp-data-infra.yml) and the package will eventually be published to pypi on merge. diff --git a/runbooks/data/deprecation-stored-files.md b/runbooks/data/deprecation-stored-files.md index 19ea76d5db..926be4ba24 100644 --- a/runbooks/data/deprecation-stored-files.md +++ b/runbooks/data/deprecation-stored-files.md @@ -5,16 +5,19 @@ Occasionally, we want to assess our Google Cloud Storage buckets for outdatednes 1. In the Google Cloud Console Metrics Explorer, identify GCS buckets that have not recently had data written to them or read from them - [this query](https://console.cloud.google.com/monitoring/metrics-explorer;duration=P84D?pageState=%7B%22domainObjectDeprecationId%22:%22D20E3E2D-1786-4C36-988D-09C4EB19587E%22,%22title%22:%22Untitled%22,%22xyChart%22:%7B%22constantLines%22:%5B%5D,%22dataSets%22:%5B%7B%22plotType%22:%22LINE%22,%22targetAxis%22:%22Y1%22,%22timeSeriesFilter%22:%7B%22aggregations%22:%5B%7B%22crossSeriesReducer%22:%22REDUCE_SUM%22,%22groupByFields%22:%5B%22metric.label.%5C%22method%5C%22%22,%22resource.label.%5C%22bucket_name%5C%22%22%5D,%22perSeriesAligner%22:%22ALIGN_RATE%22%7D%5D,%22apiSource%22:%22DEFAULT_CLOUD%22,%22crossSeriesReducer%22:%22REDUCE_SUM%22,%22filter%22:%22metric.type%3D%5C%22storage.googleapis.com%2Fapi%2Frequest_count%5C%22%20resource.type%3D%5C%22gcs_bucket%5C%22%20resource.label.%5C%22bucket_name%5C%22%3D%5C%22calitp-gtfs-rt-raw-v2%5C%22%22,%22groupByFields%22:%5B%22metric.label.%5C%22method%5C%22%22,%22resource.label.%5C%22bucket_name%5C%22%22%5D,%22minAlignmentPeriod%22:%2260s%22,%22perSeriesAligner%22:%22ALIGN_RATE%22%7D%7D%5D,%22options%22:%7B%22mode%22:%22COLOR%22%7D,%22y1Axis%22:%7B%22label%22:%22%22,%22scale%22:%22LINEAR%22%7D%7D%7D&project=cal-itp-data-infra) produces a visualization of recent activity for objects within a given bucket (the targeted bucket is set in the "Filters" section, with the calitp-gtfs-rt-raw-v2 bucket provided as an example). For any bucket without recent ReadObject or WriteObject activity, proceed to the next steps. 2. Among buckets not recently modified (more than ~12 weeks since the last update), there are two general categories: - * Buckets used for testing, prefixed with "test-", generally correspond to infrequent tests of Airflow jobs and other scripts that take place during new feature development. The buckets themselves should generally remain in existence (unless the corresponding job/script is no longer actively used in production), but any objects they contain from previous rounds of testing can be deleted. Note that deletion of objects from some test buckets may introduce a later need to place raw artifacts inside those buckets in order to test parsing and validation tasks. - * All other buckets are deprecation candidates, but should be treated with greater care, utilizing the remaining steps of this guide. + + - Buckets used for testing, prefixed with "test-", generally correspond to infrequent tests of Airflow jobs and other scripts that take place during new feature development. The buckets themselves should generally remain in existence (unless the corresponding job/script is no longer actively used in production), but any objects they contain from previous rounds of testing can be deleted. Note that deletion of objects from some test buckets may introduce a later need to place raw artifacts inside those buckets in order to test parsing and validation tasks. + - All other buckets are deprecation candidates, but should be treated with greater care, utilizing the remaining steps of this guide. 3. For the non-test buckets that constitute the deprecation candidate list, the path forward relies on investigation of internal project configuration and conversation with data stakeholders. Some data may need to be retained because it is frequently accessed despite being infrequently updated (NTD data or static website assets, for instance). Some data may need to be retained rather than deleted because it represents raw data collected once that can't otherwise be recovered, or to conform with regulatory requirements, or to provide a window for future research access. Each of the following steps should be taken to determine which path to take: - * Search the source code of the [data-infra repository](https://github.com/cal-itp/data-infra), the [data-analyses repository](https://github.com/cal-itp/data-analyses), and the [reports repository](https://github.com/cal-itp/reports) for the name of the bucket, as well as the environment variables [set in Cloud Composer](https://console.cloud.google.com/composer/environments/detail/us-west2/calitp-airflow2-prod/variables?project=cal-itp-data-infra). If you find it referenced anywhere, investigate whether the reference is in active use. For an extra step of safety, you could also search the entire Cal-ITP GitHub organization's source code via GitHub's web user interface. - * Note: [External tables](https://cloud.google.com/bigquery/docs/external-tables) in BigQuery, created from GCS objects via [our `create_external_tables` DAG](https://o1d2fa0877cf3fb10p-tp.appspot.com/dags/create_external_tables/grid) in Airflow, do not produce read or write data that shows up in the GCS request count metric we used in step one. If you find a reference to a deprecation candidate bucket within the [`create_external_tables` subfolder](https://github.com/cal-itp/data-infra/tree/main/airflow/dags/create_external_tables) of the data-infra repository, you should check [BigQuery audit logs](https://cloud.google.com/bigquery/docs/reference/auditlogs/#data_access_data_access) to see whether people are querying the external tables that rely on the deprecation candidate bucket (and if so, eliminate it from the deprecation list). - * Post in `#data-warehouse-devs` and any other relevant channels in Slack (this may vary by domain; for example, if investigating a bucket related to GTFS quality, you may post in `#gtfs-quality`). Ask whether anybody knows of ongoing use of the bucket(s) in question. If there are identifiable stakeholders who aren't active in Slack, like external research partners, reach out to them directly. + + - Search the source code of the [data-infra repository](https://github.com/cal-itp/data-infra), the [data-analyses repository](https://github.com/cal-itp/data-analyses), and the [reports repository](https://github.com/cal-itp/reports) for the name of the bucket, as well as the environment variables [set in Cloud Composer](https://console.cloud.google.com/composer/environments/detail/us-west2/calitp-airflow2-prod/variables?project=cal-itp-data-infra). If you find it referenced anywhere, investigate whether the reference is in active use. For an extra step of safety, you could also search the entire Cal-ITP GitHub organization's source code via GitHub's web user interface. + - Note: [External tables](https://cloud.google.com/bigquery/docs/external-tables) in BigQuery, created from GCS objects via [our `create_external_tables` DAG](https://o1d2fa0877cf3fb10p-tp.appspot.com/dags/create_external_tables/grid) in Airflow, do not produce read or write data that shows up in the GCS request count metric we used in step one. If you find a reference to a deprecation candidate bucket within the [`create_external_tables` subfolder](https://github.com/cal-itp/data-infra/tree/main/airflow/dags/create_external_tables) of the data-infra repository, you should check [BigQuery audit logs](https://cloud.google.com/bigquery/docs/reference/auditlogs/#data_access_data_access) to see whether people are querying the external tables that rely on the deprecation candidate bucket (and if so, eliminate it from the deprecation list). + - Post in `#data-warehouse-devs` and any other relevant channels in Slack (this may vary by domain; for example, if investigating a bucket related to GTFS quality, you may post in `#gtfs-quality`). Ask whether anybody knows of ongoing use of the bucket(s) in question. If there are identifiable stakeholders who aren't active in Slack, like external research partners, reach out to them directly. 4. For each bucket that hasn't been removed from the deprecation list via the investigation in the last step, [label the bucket as "deprecated: true"](https://cloud.google.com/storage/docs/using-bucket-labels) and remove access for any relevant automated users or user groups (we want to intentionally break automated access so that errors occur if the bucket's data was still being referenced undetected). Since we use [uniform bucket-level access](https://cloud.google.com/storage/docs/uniform-bucket-level-access) for the vast majority of our buckets, removing access is [a simple operation on the Permissions page of the bucket](https://cloud.google.com/storage/docs/access-control/using-iam-permissions#bucket-remove). For buckets that have fine-grained access control, those permissions changes need to be made for each object in the bucket via those objects' [Access Control Lists](https://cloud.google.com/storage/docs/access-control/lists). In either case, once permissions changes have been made, inform stakeholders about the newly deprecated buckets via `#data-warehouse-devs` and other relevant channels, and monitor for two weeks for any new code or process breakages related to the changed bucket permissions. 5. After two weeks is up, take the most relevant option of the following two: - * For buckets that must be retained because they represent raw data that can't otherwise be recovered in the future, or for regulatory reasons, or because of potential for future research/analysis access, place a small README inside the bucket explaining its deprecation. After that, [convert the bucket and its objects to the Archive storage class](https://cloud.google.com/storage/docs/changing-storage-classes). Note that future access to Archive class objects will incur higher costs than access to objects in the standard storage class - the Archive storage class is intended for data that will not be accessed frequently, and is cost-optimized for lower costs of storage and higher costs of access. Additionally, if Archive class objects are deleted after less than a year of Archive class storage, the Google Cloud project will still be billed for one year of storage, which is the minimum billable storage duration for Achive class objects. - * For buckets that do not need to be kept long term, like out-of-use transformations of raw data that can be recreated if necessary from the corresponding raw data, simply delete the bucket. + + - For buckets that must be retained because they represent raw data that can't otherwise be recovered in the future, or for regulatory reasons, or because of potential for future research/analysis access, place a small README inside the bucket explaining its deprecation. After that, [convert the bucket and its objects to the Archive storage class](https://cloud.google.com/storage/docs/changing-storage-classes). Note that future access to Archive class objects will incur higher costs than access to objects in the standard storage class - the Archive storage class is intended for data that will not be accessed frequently, and is cost-optimized for lower costs of storage and higher costs of access. Additionally, if Archive class objects are deleted after less than a year of Archive class storage, the Google Cloud project will still be billed for one year of storage, which is the minimum billable storage duration for Achive class objects. + - For buckets that do not need to be kept long term, like out-of-use transformations of raw data that can be recreated if necessary from the corresponding raw data, simply delete the bucket. diff --git a/runbooks/data/deprecation-warehouse-models.md b/runbooks/data/deprecation-warehouse-models.md index 6c8aef240a..031f8406a5 100644 --- a/runbooks/data/deprecation-warehouse-models.md +++ b/runbooks/data/deprecation-warehouse-models.md @@ -18,14 +18,16 @@ ORDER BY timestamp DESC ``` 2. Post a warning in `#data-warehouse-devs` and any other relevant channels in Slack (this may vary by domain; for example, if deprecating a model related to GTFS quality, you may post in `#gtfs-quality`). Add the model to the [deprecation tracking spreadsheet](https://docs.google.com/spreadsheets/d/1jRK-hI1t2akEFA_eiUo8WfLFdYV3VJGLWtXczRRG8r0/edit#gid=0). - * If the model had been recently accessed according to the query in step 1, give a 2-week warning and post suggestions for alternative similar models. - * If the model had not been recently accessed, a 1-week warning, without suggested alternatives, is sufficient. + + - If the model had been recently accessed according to the query in step 1, give a 2-week warning and post suggestions for alternative similar models. + - If the model had not been recently accessed, a 1-week warning, without suggested alternatives, is sufficient. 3. After the warning period is up, soft-delete the model(s): - * Double check whether anyone has been accessing it. If so, perform additional outreach to the relevant users and only proceed once those users have confirmed a migration plan. - * Delete the associated dbt code (SQL *and* YAML). - * [Copy the model(s)](https://cloud.google.com/bigquery/docs/managing-tables#copying_a_single_source_table) to `_deprecated`. You can leave the `_deprecated` copy in its current location (same dataset.) [Delete the original (non-`_deprecated`) model](https://cloud.google.com/bigquery/docs/managing-tables#deleting_a_table). - * If you are deprecating an entire dataset/schema, remove it from Metabase at this point. + + - Double check whether anyone has been accessing it. If so, perform additional outreach to the relevant users and only proceed once those users have confirmed a migration plan. + - Delete the associated dbt code (SQL *and* YAML). + - [Copy the model(s)](https://cloud.google.com/bigquery/docs/managing-tables#copying_a_single_source_table) to `_deprecated`. You can leave the `_deprecated` copy in its current location (same dataset.) [Delete the original (non-`_deprecated`) model](https://cloud.google.com/bigquery/docs/managing-tables#deleting_a_table). + - If you are deprecating an entire dataset/schema, remove it from Metabase at this point. 4. Wait again. If the model(s) had been recently accessed according to the query in step 1, wait 2 weeks; otherwise wait only 1 week. If during this second waiting period someone objects or identifies a problem, work with them to identify a path forward. This may involve helping them use the `_deprecated` copy while they make a longer-term plan and delaying hard deletion until they have migrated to something else. diff --git a/runbooks/infrastructure/disk-space.md b/runbooks/infrastructure/disk-space.md index dcb6237cf9..c50da7f513 100644 --- a/runbooks/infrastructure/disk-space.md +++ b/runbooks/infrastructure/disk-space.md @@ -1,4 +1,5 @@ # Disk Space Usage + We have an [alert](https://monitoring.calitp.org/alerting/grafana/Geo72Nf4z/view) that will trigger when any Kubernetes volume is more than 80% full. The resolution will depend on the affected service. > After following any of these specific runbooks, check the general [Kubernetes dashboard](https://monitoring.calitp.org/d/oWe9aYxmk/1-kubernetes-deployment-statefulset-daemonset-metrics) in Grafana to verify the disk space consumption decreased to a safe level. The alert should also show as resolved in Slack after a couple minutes. @@ -6,6 +7,7 @@ We have an [alert](https://monitoring.calitp.org/alerting/grafana/Geo72Nf4z/view ## ZooKeeper [ZooKeeper](https://zookeeper.apache.org/) is deployed as part of our Sentry Helm chart. While autopurge should be enabled as part of the deployment values, we've had issues with it not working in the past. The following process will remove old logs and snapshots. + 1. Login to a ZooKeeper pod with `kubectl exec --stdin --tty -n -- bash`; the alert will tell you which volume is more than 80% full. For example, typically `kubectl exec --stdin --tty sentry-zookeeper-clickhouse-0 -n sentry -- bash` for cleaning up the Sentry ZooKeeper disk. **You will need to repeat this process for each pod in the StatefulSet.** (In the default Sentry Helm chart configuration, this means `sentry-zookeeper-clickhouse-1` and `sentry-zookeeper-clickhouse-2` as well). 2. Execute the cleanup script `./opt/bitnami/zookeeper/bin/zkCleanup.sh -n `; `count` must be at least 3. 1. If the executable does not exist in this location, you can find it with `find . -name zkCleanup.sh`. @@ -13,6 +15,7 @@ We have an [alert](https://monitoring.calitp.org/alerting/grafana/Geo72Nf4z/view Additional sections may be added to this runbook over time. ## Kafka (also failing consumers) + The [Kafka](https://kafka.apache.org/) pods themselves can also have unbound disk space usage if they are not properly configured to drop old data quickly enough. This can cascade into a variety of issues, as well as [snuba](https://getsentry.github.io/snuba/architecture/overview.html) workers being unable to actually pull events from Kafka, leading to a scenario that cannot recover without intervention. This list of steps is for resetting one particular consumer group for one particular topic, so it may need to be performed multiple times. > The sensitive values referenced here are stored in Vaultwarden; the Helm chart does not yet support using only Secrets. @@ -20,10 +23,10 @@ The [Kafka](https://kafka.apache.org/) pods themselves can also have unbound dis 0. As a temporary measure, you can increase the capacity of the persistent volume of the pod having issues. You can either edit the persistent volume YAML directly, or `helm upgrade sentry apps/charts/sentry -n sentry -f apps/values/sentry_sensitive.yaml -f apps/charts/sentry/values.yaml --debug` after setting a larger volume size in `values.yaml`. Either way, you will likely have to restart the pod to let the change take effect. 1. Check if there are any failing consumer pods in [Workloads](https://console.cloud.google.com/kubernetes/workload?project=cal-itp-data-infra); you can use the logs to identify the topic and potentially the consumer group. 2. Check the consumer groups and/or topics, and reset the offsets to the latest as appropriate. This [GitHub issue](https://github.com/getsentry/self-hosted/issues/478#issuecomment-666254392) contains very helpful information. For example, to reset the `snuba-events-subscriptions-consumers` consumer that is failing to handle the `snuba-commit-log` topic: - * `kubectl exec --stdin --tty sentry-kafka-0 -n sentry -- bash` - * `/opt/bitnami/kafka/bin/kafka-consumer-groups.sh --bootstrap-server 127.0.0.1:9092 --list` - * `/opt/bitnami/kafka/bin/kafka-consumer-groups.sh --bootstrap-server 127.0.0.1:9092 --group snuba-events-subscriptions-consumers -describe` - * `/opt/bitnami/kafka/bin/kafka-consumer-groups.sh --bootstrap-server 127.0.0.1:9092 --group snuba-events-subscriptions-consumers --topic snuba-commit-log --reset-offsets --to-latest --execute` - * If you hit `Error: Assignments can only be reset if the group 'snuba-post-processor' is inactive, but the current state is Stable.`, you need to stop the consumers on the topic (by deleting the pods and/or deployment), resetting the offset, and starting the pods again (via `helm upgrade sentry apps/charts/sentry -n sentry -f apps/values/sentry_sensitive.yaml -f apps/charts/sentry/values.yaml --debug` if you deleted the deployment). + - `kubectl exec --stdin --tty sentry-kafka-0 -n sentry -- bash` + - `/opt/bitnami/kafka/bin/kafka-consumer-groups.sh --bootstrap-server 127.0.0.1:9092 --list` + - `/opt/bitnami/kafka/bin/kafka-consumer-groups.sh --bootstrap-server 127.0.0.1:9092 --group snuba-events-subscriptions-consumers -describe` + - `/opt/bitnami/kafka/bin/kafka-consumer-groups.sh --bootstrap-server 127.0.0.1:9092 --group snuba-events-subscriptions-consumers --topic snuba-commit-log --reset-offsets --to-latest --execute` + - If you hit `Error: Assignments can only be reset if the group 'snuba-post-processor' is inactive, but the current state is Stable.`, you need to stop the consumers on the topic (by deleting the pods and/or deployment), resetting the offset, and starting the pods again (via `helm upgrade sentry apps/charts/sentry -n sentry -f apps/values/sentry_sensitive.yaml -f apps/charts/sentry/values.yaml --debug` if you deleted the deployment). 3. (Optional) If disk space is still maxed out and the consumers fail to recover even after increasing the disk space, stop the failing Kafka pod and delete its underlying PV, then repeat the steps again. **This will lose the in-flight data** but is preferable to the worker continuing to exist in a bad state. 4. (Optional) Check the existing `logRetentionHours` in [values.yaml](../../kubernetes/apps/charts/sentry/values.yaml); it should be set but may need to be shorter. diff --git a/runbooks/infrastructure/rotating-littlepay-aws-keys.md b/runbooks/infrastructure/rotating-littlepay-aws-keys.md index 2d7bad6e82..b92610d9a9 100644 --- a/runbooks/infrastructure/rotating-littlepay-aws-keys.md +++ b/runbooks/infrastructure/rotating-littlepay-aws-keys.md @@ -1,4 +1,5 @@ # Rotating LittlePay AWS Account Keys + > Some of this is taken from the provided Littlepay documentation, with Cal-ITP specific content added. LittlePay requests that clients accessing their raw data feeds through S3 rotate the IAM keys every 90 days. They provide general instructions for doing so with the `aws` CLI tool. The following gives additional context on the Cal-ITP setup, and should be used in conjunction with those instructions. @@ -8,13 +9,15 @@ LittlePay requests that clients accessing their raw data feeds through S3 rotate You'll need to install the `aws` CLI locally, and [configure a profile](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html) for each Littlepay `merchant_id` (aka `account name` or `instance` in Littlepay). Access keys can be found in one of the two following locations. -* [Google Cloud Secret Manager](https://console.cloud.google.com/security/secret-manager?project=cal-itp-data-infra) - Search for secrets of the format `LITTLEPAY_AWS_IAM__ACCESS_KEY`; secret names are uppercase, and we have removed hyphens (or convert to underscore) in the `merchant_id` historically. -* (Deprecated) [VaultWarden](https://vaultwarden.jarv.us/#/vault) - Search for "_aws_" and you should see several entries with names that match the format "_Cal-ITP Littlepay AWS IAM Keys ()_"). + +- [Google Cloud Secret Manager](https://console.cloud.google.com/security/secret-manager?project=cal-itp-data-infra) - Search for secrets of the format `LITTLEPAY_AWS_IAM__ACCESS_KEY`; secret names are uppercase, and we have removed hyphens (or convert to underscore) in the `merchant_id` historically. +- (Deprecated) [VaultWarden](https://vaultwarden.jarv.us/#/vault) - Search for "_aws_" and you should see several entries with names that match the format "_Cal-ITP Littlepay AWS IAM Keys (\)_"). You can confirm you have access by listing the keys for the specific instance using the profile: `aws iam list-access-keys --user-name --profile `. **Note that the username may not be exactly the same as the `merchant_id`; check the key JSON or error output for the username.** For example, `An error occurred (AccessDenied) when calling the ListAccessKeys operation: User: arn:aws:iam::840817857296:user/system/sbmtd-default is not authorized to perform: iam:ListAccessKeys on resource: user sbmtd because no identity-based policy allows the iam:ListAccessKeys action` indicates that the username is `sbmtd-default`. ## Create and store new credentials + Secret Manager allows multiple versions of secrets, so create a new version of each secret (the access key ID and the secret key) and paste the new key as the value. `aws iam create-access-key --user-name --profile ` @@ -24,10 +27,12 @@ You could also use the CLI to create new versions of secrets, but the web UI is `gcloud secrets versions add ...` ## Disable old credentials and test new credentials + Disable (*do not destroy*) the old versions in Secret Manager and test the `sync_littlepay` Airflow DAG, which is the primary integration with Littlepay in the v2 pipeline. (Legacy) You may also need to change the credentials in any Data Transfer jobs that are still configured to read Littlepay data. ## Delete old credentials from AWS + Assuming all tests succeed, delete the old credentials from AWS. `aws iam delete-access-key --access-key-id --profile ` diff --git a/runbooks/pipeline/sentry-triage.md b/runbooks/pipeline/sentry-triage.md index c58ade8b06..127dd166fd 100644 --- a/runbooks/pipeline/sentry-triage.md +++ b/runbooks/pipeline/sentry-triage.md @@ -1,19 +1,23 @@ # Sentry triage + > Important Sentry concepts: -> * [Issue states](https://docs.sentry.io/product/issues/states-triage/) -> * [Fingerprinting and grouping](https://docs.sentry.io/product/sentry-basics/grouping-and-fingerprints/) -> * [Merging issues](https://docs.sentry.io/product/data-management-settings/event-grouping/merging-issues/) +> +> - [Issue states](https://docs.sentry.io/product/issues/states-triage/) +> - [Fingerprinting and grouping](https://docs.sentry.io/product/sentry-basics/grouping-and-fingerprints/) +> - [Merging issues](https://docs.sentry.io/product/data-management-settings/event-grouping/merging-issues/) Once a day, the person responsible for triage should check Sentry for new and current issues. There are two separate things to check: -* All **new issues** from the past 24 hours. An issue is a top-level error/failure/warning, and a new issue represents something we have't seen before (as opposed to a new event instance of an issue that's been occurring for a while). These should be top priority to investigate since they represent net-new problems. - * To identify: use the ["Daily Triage (new issues)" Issues search](https://sentry.calitp.org/organizations/sentry/issues/searches/5/?environment=cal-itp-data-infra&project=2) - * The search criteria is: `is:unresolved firstSeen:-24h` on the [`Issues` page](https://sentry.calitp.org/organizations/sentry/issues/) - * Note that this search will not identify regressions that are reappearing as "new"; regressions should appear on the observed events search noted below. You can also use the [`For Review` Issue search tab](https://sentry.calitp.org/organizations/sentry/issues/?environment=cal-itp-data-infra&project=2&query=is%3Aunresolved+is%3Afor_review) to check for regressions. +- All **new issues** from the past 24 hours. An issue is a top-level error/failure/warning, and a new issue represents something we have't seen before (as opposed to a new event instance of an issue that's been occurring for a while). These should be top priority to investigate since they represent net-new problems. + + - To identify: use the ["Daily Triage (new issues)" Issues search](https://sentry.calitp.org/organizations/sentry/issues/searches/5/?environment=cal-itp-data-infra&project=2) + - The search criteria is: `is:unresolved firstSeen:-24h` on the [`Issues` page](https://sentry.calitp.org/organizations/sentry/issues/) + - Note that this search will not identify regressions that are reappearing as "new"; regressions should appear on the observed events search noted below. You can also use the [`For Review` Issue search tab](https://sentry.calitp.org/organizations/sentry/issues/?environment=cal-itp-data-infra&project=2&query=is%3Aunresolved+is%3Afor_review) to check for regressions. + +- All **observed events** from the past 24 hours. An event is an instance of an issue, so these may not be *new* but we want to monitor all currently-active errors. For this monitoring we suppress some of the noisy `RTFetchException` and `CalledProcessError` events that tend to happen intermittently most days. -* All **observed events** from the past 24 hours. An event is an instance of an issue, so these may not be *new* but we want to monitor all currently-active errors. For this monitoring we suppress some of the noisy `RTFetchException` and `CalledProcessError` events that tend to happen intermittently most days. - * To identify: use the ["Daily triage (all events)" Discover search](https://sentry.calitp.org/organizations/sentry/discover/results/?environment=cal-itp-data-infra&id=1&project=2&statsPeriod=24h) - * The search criteria is: `event.type:error (!message:RTFetchException OR count > 15) (!message:CalledProcessError OR count > 1) (!message:DbtTestWarn)` on the [`Discover` page](https://sentry.calitp.org/organizations/sentry/discover/queries/) + - To identify: use the ["Daily triage (all events)" Discover search](https://sentry.calitp.org/organizations/sentry/discover/results/?environment=cal-itp-data-infra&id=1&project=2&statsPeriod=24h) + - The search criteria is: `event.type:error (!message:RTFetchException OR count > 15) (!message:CalledProcessError OR count > 1) (!message:DbtTestWarn)` on the [`Discover` page](https://sentry.calitp.org/organizations/sentry/discover/queries/) Categorize the issues/events identified and perform relevant steps if the issue is not already assigned (particularly for the second search, existing issues may already be assigned so you may not need to do anything new). @@ -21,21 +25,23 @@ Categorize the issues/events identified and perform relevant steps if the issue When creating GitHub issues from Sentry: -* Verify that no secrets or other sensitive information is contained in the generated issue body. Sentry's data masking is not perfect (and we may make a configuration mistake), so it's good to double-check. +- Verify that no secrets or other sensitive information is contained in the generated issue body. Sentry's data masking is not perfect (and we may make a configuration mistake), so it's good to double-check. -* Clean up the issue so that someone looking at it later will understand what the error actually is. The auto-generated issues will only contain the exception text and a link back to Sentry; making a more human-friendly issue title and description is helpful. +- Clean up the issue so that someone looking at it later will understand what the error actually is. The auto-generated issues will only contain the exception text and a link back to Sentry; making a more human-friendly issue title and description is helpful. ## Issue types Most issues fall into a few broad categories. ### External, and a retry does not handle it (most common) + This category includes external issues that a retry cannot resolve. For example, an RT feed that intermittently throws 500s. For RT fetch issues specifically: -* *If the issue occurred <10 times:* Do nothing (don't "ignore" or "resolve" in Sentry, just leave the issue as-is.) -* *If the issue occurred more than 10 times:* Check the [feed-level Metabase dashboard](https://dashboards.calitp.org/dashboard/112-feed-level-v2?date_filter=past3days~&text=Bay%20Area%20511%20Regional%20Alerts&text=Bay%20Area%20511%20Regional%20TripUpdates&text=Bay%20Area%20511%20Regional%20VehiclePositions) to see whether it seems like the feed has had a sustained outage. If they have had an outage lasting more than 6 hours, plan to check again the following day. Once a feed has been down for a full 24 hours we should notify a relevant customer success manager to monitor and, if the outage lasts long enough, consider contacting the agency. +- *If the issue occurred \<10 times:* Do nothing (don't "ignore" or "resolve" in Sentry, just leave the issue as-is.) +- *If the issue occurred more than 10 times:* Check the [feed-level Metabase dashboard](https://dashboards.calitp.org/dashboard/112-feed-level-v2?date_filter=past3days~&text=Bay%20Area%20511%20Regional%20Alerts&text=Bay%20Area%20511%20Regional%20TripUpdates&text=Bay%20Area%20511%20Regional%20VehiclePositions) to see whether it seems like the feed has had a sustained outage. If they have had an outage lasting more than 6 hours, plan to check again the following day. Once a feed has been down for a full 24 hours we should notify a relevant customer success manager to monitor and, if the outage lasts long enough, consider contacting the agency. ### Bug, or external issue handleable by retry + This category includes dbt test/model failures/errors, Python/SQL code bugs, or external API calls that we are not retrying properly. 1. Create a GitHub issue to fix the bug (or add a retry) and assign if there is a clear owner. @@ -45,15 +51,16 @@ This category includes dbt test/model failures/errors, Python/SQL code bugs, or 2. In the eventual PR that should fix the issue, resolving the GitHub issue should also resolve the Sentry issue. You can also reference a Sentry issue to close directly via the PR description, e.g. `fixes CAL-ITP-DATA-INFRA-D5`. ### A fingerprinting error (i.e. too little or too much grouping) + This category primarily includes unhandled data processing exceptions (e.g. RTFetchException, CalledProcessError) whose fingerprint results in issues being improperly grouped together (for example, the same RTFetchException occurring on different feeds) or failing to be grouped together (for example, an exception message containing a Python object hash that is different in every exception instance). -* Too little grouping (i.e. too granular fingerprint) - 1. Merge the issues together. ![](sentry_merging.png) - 2. Create a GitHub issue to update the fingerprint, linking to the now-merged issue. -* Too much grouping (i.e. too vague fingerprint) - 1. Create a GitHub issue to update the fingerprint, usually adding additional values to the fingerprint to distinguish between different errors. - 2. For example, you may want to split up an issue by feed URL, which would mean adding the feed URL to the fingerprint. - 3. When the new fingerprint has been deployed, _resolve_ the existing issue since it should no longer appear. +- Too little grouping (i.e. too granular fingerprint) + 1. Merge the issues together. ![](sentry_merging.png) + 2. Create a GitHub issue to update the fingerprint, linking to the now-merged issue. +- Too much grouping (i.e. too vague fingerprint) + 1. Create a GitHub issue to update the fingerprint, usually adding additional values to the fingerprint to distinguish between different errors. + 2. For example, you may want to split up an issue by feed URL, which would mean adding the feed URL to the fingerprint. + 3. When the new fingerprint has been deployed, _resolve_ the existing issue since it should no longer appear. ## Additional Triage Task: Friday Performance Check diff --git a/services/gtfs-rt-archiver-v3/README.md b/services/gtfs-rt-archiver-v3/README.md index 4c3450582c..6220e499a8 100644 --- a/services/gtfs-rt-archiver-v3/README.md +++ b/services/gtfs-rt-archiver-v3/README.md @@ -7,41 +7,46 @@ This is the third iteration of our [GTFS Realtime (RT)](https://gtfs.org/realtim > [huey](https://github.com/coleifer/huey) is a minimal/lightweight task queue library that we use to enqueue tasks for asynchronous/parallel execution by workers. The full archiver application is composed of three pieces: + 1. A ticker pod that creates fetch tasks every 20 seconds, based on the latest download configurations - * Configurations are fetched from GCS and cached for 5 minutes; they are generated upstream by [generate_gtfs_download_configs](../../airflow/dags/airtable_loader_v2/generate_gtfs_download_configs.py) - * Fetches are enqueued as Huey tasks + - Configurations are fetched from GCS and cached for 5 minutes; they are generated upstream by [generate_gtfs_download_configs](../../airflow/dags/airtable_loader_v2/generate_gtfs_download_configs.py) + - Fetches are enqueued as Huey tasks 2. A Redis instance holding the Huey queue - * We deploy a single instance per environment namespace -(e.g. `gtfs-rt-v3`, `gtfs-rt-v3-test`) with _no disk space_ and _no horizontal scaling_; we do not care about persistence because only fresh fetch tasks are relevant anyways. - * In addition, the RT archiver relies on having low I/O latency with Redis to -minimize the latency of fetch starts. Due to these considerations, these Redis -instances should **NOT** be used for any other applications. + - We deploy a single instance per environment namespace + (e.g. `gtfs-rt-v3`, `gtfs-rt-v3-test`) with _no disk space_ and _no horizontal scaling_; we do not care about persistence because only fresh fetch tasks are relevant anyways. + - In addition, the RT archiver relies on having low I/O latency with Redis to + minimize the latency of fetch starts. Due to these considerations, these Redis + instances should **NOT** be used for any other applications. 3. Some number (greater than 1) of consumer pods that execute enqueued fetch tasks, making HTTP requests and saving the raw responses (and metadata such as headers) to GCS - * Each consumer pod runs some number of worker threads - * As of 2023-04-10, the production archiver has 6 consumer pods each managing 24 worker threads + - Each consumer pod runs some number of worker threads + - As of 2023-04-10, the production archiver has 6 consumer pods each managing 24 worker threads These deployments are defined in the [relevant kubernetes manifests](../../kubernetes/apps/manifests/gtfs-rt-archiver-v3) and overlaid with kustomize per-environment (e.g. [gtfs-rt-archiver-v3-test](../../kubernetes/apps/overlays/gtfs-rt-archiver-v3-test)). ## Observability ### Metrics + We've created a [Grafana dashboard](https://monitoring.calitp.org/d/AqZT_PA4k/gtfs-rt-archiver) to display the [metrics](./gtfs_rt_archiver_v3/metrics.py) for this application, based on our desired goals of capturing data to the fullest extent possible and being able to track 20-second update frequencies in the feeds. Counts of task successes is our overall sign of health (i.e. we are capturing enough data) while other metrics such as task delay or download time are useful for identifying bottlenecks or the need for increased resources. ### Alerts + There are two important alerts defined in Grafana based on these metrics. -* [Minimum task successes](https://monitoring.calitp.org/alerting/grafana/nrbFSw0Vz/view) -* [Expiring tasks](https://monitoring.calitp.org/alerting/grafana/O595SQA4k/view) + +- [Minimum task successes](https://monitoring.calitp.org/alerting/grafana/nrbFSw0Vz/view) +- [Expiring tasks](https://monitoring.calitp.org/alerting/grafana/O595SQA4k/view) Both of these tasks can fire if the archiver is only partially degraded, but the first alert is our best catch-all detection mechanism for any downtime. There are other potential issues (e.g. outdated download configs) that are flagged in the dashboard but do not currently have configured alerts. ### Error reporting + We log errors and exceptions (both caught and uncaught) to our [Sentry instance](https://sentry.calitp.org/) via the [Python SDK for Sentry](https://github.com/getsentry/sentry-python). Common problems include: -* Failure to connect to Redis following a node upgrade; this is typically fixed by [restarting the archiver](#restarting-the-archiver). -* `RTFetchException`, a custom class specific to failures during feed download; these can be provider-side (i.e. the agency/vendor) or consumer-side (i.e. us) and are usually fixed (if possible) by [changing download configurations](#fixing-download-configurations). Common examples (and HTTP error code if relevant) include: - * Missing or invalid authentication (401/403) - * Changed URLs (404) - * Intermittent outages/errors (may be a ConnectionError or a 500 response) +- Failure to connect to Redis following a node upgrade; this is typically fixed by [restarting the archiver](#restarting-the-archiver). +- `RTFetchException`, a custom class specific to failures during feed download; these can be provider-side (i.e. the agency/vendor) or consumer-side (i.e. us) and are usually fixed (if possible) by [changing download configurations](#fixing-download-configurations). Common examples (and HTTP error code if relevant) include: + - Missing or invalid authentication (401/403) + - Changed URLs (404) + - Intermittent outages/errors (may be a ConnectionError or a 500 response) ## Operations/maintenance @@ -50,12 +55,15 @@ We log errors and exceptions (both caught and uncaught) to our [Sentry instance] > These `kubectl` commands assume your shell is in the `kubernetes` directory, but you could run them from root and just prepend `kubernetes/` to the file paths. ### Restarting the archiver + Rolling restarts with `kubectl` use the following syntax. + ```shell kubectl rollout restart deployment.apps/ -n ``` So for example, to restart all 3 deployments in test, you would run the following. + ```shell kubectl rollout restart deployment.apps/redis -n gtfs-rt-v3-test kubectl rollout restart deployment.apps/gtfs-rt-archiver-ticker -n gtfs-rt-v3-test @@ -63,12 +71,15 @@ kubectl rollout restart deployment.apps/gtfs-rt-archiver-consumer -n gtfs-rt-v3- ``` ### Deploying configuration changes + Environment-agnostic configurations live in [app vars](../../kubernetes/apps/manifests/gtfs-rt-archiver-v3/archiver-app-vars.yaml) while environment-specific configurations live in [channel vars](../../kubernetes/apps/overlays/gtfs-rt-archiver-v3-test/archiver-channel-vars.yaml). You can edit these files and deploy the changes with `kubectl`. + ``` kubectl apply -k apps/overlays/gtfs-rt-archiver-v3- ``` For example, you can apply the configmap values in [test](../../kubernetes/apps/overlays/gtfs-rt-archiver-v3-test/archiver-channel-vars.yaml) with the following. + ``` kubectl apply -k apps/overlays/gtfs-rt-archiver-v3-test ``` @@ -76,7 +87,9 @@ kubectl apply -k apps/overlays/gtfs-rt-archiver-v3-test Running `apply` will also deploy the archiver from scratch if it is not deployed yet, as long as the proper namespace exists. ### Deploying code changes + Code changes require building and pushing a new Docker image, as well as applying `kubectl` changes to point the deployment at the new image. + 1. Make code changes and increment version in `pyproject.toml` 1. Ex. `poetry version 2023.4.10` 2. Change image tag version in the environments `kustomization.yaml`. @@ -88,6 +101,7 @@ Code changes require building and pushing a new Docker image, as well as applyin 1. Currently, the image is built/pushed on merges to main but the Kubernetes manifests are not applied. ### Changing download configurations + GTFS download configurations (for both Schedule and RT) are sourced from the [GTFS Dataset table](https://airtable.com/appPnJWrQ7ui4UmIl/tbl5V6Vjs4mNQgYbc) in the California Transit Airtable base, and we have [specific documentation](https://docs.google.com/document/d/1IO8x9-31LjwmlBDH0Jri-uWI7Zygi_IPc9nqd7FPEQM/edit#heading=h.b2yta6yeugar) for modifying the table. (Both of these Airtable links require authentication/access to Airtable.) You may need to make URL or authentication adjustments in this table. This data is downloaded daily into our infrastructure and will propagate to the GTFS Schedule and RT downloads; you may execute the [Airtable download job](https://o1d2fa0877cf3fb10p-tp.appspot.com/dags/airtable_loader_v2/grid) manually after making edits to "deploy" the changes more quickly. Another possible intervention is updating or adding authentication information in [Secret Manager](https://console.cloud.google.com/security/secret-manager). You may create new versions of existing secrets, or add entirely new secrets. Secrets must be tagged with `gtfs_rt: true` to be loaded as secrets in the archiver; secrets are refreshed every 5 minutes by the ticker. diff --git a/warehouse/README.md b/warehouse/README.md index 4fc58e01e2..bfb6034736 100644 --- a/warehouse/README.md +++ b/warehouse/README.md @@ -3,25 +3,36 @@ This dbt project is intended to be the source of truth for the cal-itp-data-infra BigQuery warehouse. ## Setting up the project in your JupyterHub personal server + If you are developing dbt models in JupyterHub, the following pieces are already configured/installed. -* Libraries such as gdal and graphviz -* The `gcloud` CLI -* `poetry` + +- Libraries such as gdal and graphviz +- The `gcloud` CLI +- `poetry` > You may have already authenticated gcloud and the GitHub CLI (gh) if you followed the -[JupyterHub setup docs](https://docs.calitp.org/data-infra/analytics_tools/jupyterhub.html). If not, follow those instructions before proceeding. +> [JupyterHub setup docs](https://docs.calitp.org/data-infra/analytics_tools/jupyterhub.html). If not, follow those instructions before proceeding. ### Clone and install the warehouse project + 1. [Clone](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository) the `data-infra` repo via `git clone git@github.com:cal-itp/data-infra.git` if you haven't already. Use [SSH](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account), not HTTPS. If you haven't made a folder/directory for your git repos yet, you can create one with `mkdir git` (within your home directory, usually). + 1. You may be prompted to accept GitHub key's fingerprint if you are cloning a repository for the first time. + 2. The rest of these instructions assume you are in the `warehouse/` directory of the repository. + 1. You will need to `cd` to it via `cd /data-infra/warehouse/` or similar; for example, if you had created your directory with `mkdir git`, you will navigate to the warehouse directory with `cd git/data-infra/warehouse/`. + 3. Execute `poetry install` to create a virtual environment and install requirements. + 4. Execute `poetry run dbt deps` to install the dbt dependencies defined in `packages.yml` (such as `dbt_utils`). + 5. Ensure that `DBT_PROFILES_DIR` is set to something like `~/.dbt/`; in JupyterHub, it should already be set to `/home/jovyan/.dbt/`. You can check with `echo $DBT_PROFILES_DIR`. + 6. Execute `poetry run dbt init` to create the `$DBT_PROFILES_DIR` directory and a pre-built `profiles.yml` file; you will be prompted to enter a personal `schema` which is used as a prefix for your personal development environment schemas. The output should look similar to the following: + ``` ➜ poetry run dbt init 19:14:32 Running with dbt=1.4.5 @@ -30,10 +41,13 @@ are already configured/installed. maximum_bytes_billed (the maximum number of bytes allowed per BigQuery query; default is 2 TB) [2000000000000]: 19:14:35 Profile calitp_warehouse written to /Users/andrewvaccaro/.dbt/profiles.yml using project's profile_template.yml and your supplied values. Run 'dbt debug' to validate the connection. ``` + See [the dbt docs on profiles.yml](https://docs.getdbt.com/dbt-cli/configure-your-profile) for more background on this file. > Note: This default profile template will set a maximum bytes billed of 2 TB; no models should fail with the default lookbacks in our development environment, even with a full refresh. You can override this limit during the init, or change it later by calling init again and choosing to overwrite (or editing the profiles.yml directly). + 7. Check whether `~/.dbt/profiles.yml` was successfully created, e.g. `cat ~/.dbt/profiles.yml`. If you encountered an error, you may create it by hand and fill it with the same content: + ```yaml calitp_warehouse: outputs: @@ -61,7 +75,9 @@ are already configured/installed. spark.dynamicAllocation.maxExecutors: "16" target: dev ``` + 8. Finally, test your connection to our staging BigQuery project with `poetry run dbt debug`. You should see output similar to the following. + ``` ➜ warehouse git:(jupyterhub-dbt) ✗ poetry run dbt debug 16:50:15 Running with dbt=1.4.5 @@ -115,36 +131,38 @@ Once you have performed the setup above, you are good to go run Some additional helpful commands: -* `poetry run dbt test` -- will test all the models (this executes SQL in the warehouse to check tables); for this to work, you first need to `dbt run` to generate all the tables to be tested -* `poetry run dbt compile` -- will compile all the models (generate SQL, with references resolved) but won't execute anything in the warehouse; useful for visualizing what dbt will actually execute -* `poetry run dbt docs generate` -- will generate the dbt documentation -* `poetry run dbt docs serve` -- will "serve" the dbt docs locally so you can access them via `http://localhost:8080`; note that you must `docs generate` before you can `docs serve` +- `poetry run dbt test` -- will test all the models (this executes SQL in the warehouse to check tables); for this to work, you first need to `dbt run` to generate all the tables to be tested +- `poetry run dbt compile` -- will compile all the models (generate SQL, with references resolved) but won't execute anything in the warehouse; useful for visualizing what dbt will actually execute +- `poetry run dbt docs generate` -- will generate the dbt documentation +- `poetry run dbt docs serve` -- will "serve" the dbt docs locally so you can access them via `http://localhost:8080`; note that you must `docs generate` before you can `docs serve` ### Incremental model considerations + We make heavy use of [incremental models](https://docs.getdbt.com/docs/build/incremental-models) in the Cal-ITP warehouse since we have large data volumes, but that data arrives in a relatively consistent pattern (i.e. temporal). **In development**, there is a maximum lookback defined for incremental runs. The purpose of this is to handle situations where a developer may not have executed a model for a period of time. It's easy to handle full refreshes with a maximum lookback; we simply template in `N days ago` rather than the "true" start of the data for full refreshes. However, we also template in `MAX(N days ago, max DT of existing table)` for developer incremental runs; otherwise, going a month without executing a model would mean that a naive incremental implementation would then read in that full month of data. This means that your development environment can end up with gaps of data; if you've gone a month without executing a model, and then you execute a regular `run` that reads in the past `N` (7 currently) days of data, you will have a ~23 day gap. If this gap is unacceptable, you can resolve this in one of two ways. -* If you are able to develop and test with only recent data, execute a `--full-refresh` on your model(s) and all parents. This will drop the existing tables and re-build them with the last 7 days of data. -* If you need historical data for your analysis, copy the production table with `CREATE TABLE . COPY .. COPY . Note: These instructions assume you are on macOS, but are largely similar for -> other operating systems. Most *nix OSes will have a package manager that you +> other operating systems. Most \*nix OSes will have a package manager that you > should use instead of Homebrew. > > Note: if you get `Operation not permitted` when attempting to use the terminal, > you may need to [fix your terminal permissions](https://osxdaily.com/2018/10/09/fix-operation-not-permitted-terminal-error-macos/) > > You can enable [displaying hidden folders/files in macOS Finder](https://www.macworld.com/article/671158/how-to-show-hidden-files-on-a-mac.html) - but generally, we recommend using the terminal when possible for editing - these files. Generally, `nano ~/.dbt/profiles.yml` will be the easiest method - for editing your personal profiles file. `nano` is a simple terminal-based - text editor; you use the arrows keys to navigate and the hotkeys displayed - at the bottom to save and exit. Reading an [online tutorial](https://www.howtogeek.com/howto/42980/the-beginners-guide-to-nano-the-linux-command-line-text-editor/) - may be useful if you haven't used a terminal-based editor before. +> but generally, we recommend using the terminal when possible for editing +> these files. Generally, `nano ~/.dbt/profiles.yml` will be the easiest method +> for editing your personal profiles file. `nano` is a simple terminal-based +> text editor; you use the arrows keys to navigate and the hotkeys displayed +> at the bottom to save and exit. Reading an [online tutorial](https://www.howtogeek.com/howto/42980/the-beginners-guide-to-nano-the-linux-command-line-text-editor/) +> may be useful if you haven't used a terminal-based editor before. ### Install Homebrew (if you haven't) @@ -180,11 +198,11 @@ If you prefer to install dbt locally and use your own development environment, y addition upon its completion. On an OSX device using zshell, for instance, that line should be added to the ~/.zshrc file. 2. Restart your terminal and confirm `poetry --version` works. 3. Ensure you have set the environment variable `DBT_PROFILES_DIR=~/.dbt/` in your `~/.zshrc`. You can either restart your terminal after setting it, or run `source ~/.zshrc`. -4. Follow the [warehouse setup instructions](#Set up the warehouse dbt project) +4. Follow the \[warehouse setup instructions\](#Set up the warehouse dbt project) 5. If this doesn’t work because of an error with Python version, you may need to install Python 3.9 - 2. `brew install python@3.9` - 3. `brew link python@3.9` - 4. After restarting the terminal, confirm with `python3 --version` and retry `poetry install` + 2\. `brew install python@3.9` + 3\. `brew link python@3.9` + 4\. After restarting the terminal, confirm with `python3 --version` and retry `poetry install` ### Dataproc configuration diff --git a/warehouse/scripts/templates/ci_report.md b/warehouse/scripts/templates/ci_report.md index 9afaf19475..bbbab33452 100644 --- a/warehouse/scripts/templates/ci_report.md +++ b/warehouse/scripts/templates/ci_report.md @@ -1,17 +1,18 @@ Warehouse report 📦 {% if new_models or modified_or_downstream_incremental_models %} + ### Checks/potential follow-ups Checks indicate the following action items may be necessary. {% if new_models -%} + - [ ] For new models, do they all have a surrogate primary key that is tested to be not-null and unique? -{%- endif %} -{% if modified_or_downstream_incremental_models -%} + {%- endif %} + {% if modified_or_downstream_incremental_models -%} - [ ] For modified incremental models (or incremental models whose parents are modified), does the PR description identify whether a full refresh is needed for these tables? -{%- endif %} -{% endif %} - + {%- endif %} + {% endif %} {% if new_models %} @@ -38,7 +39,7 @@ Checks indicate the following action items may be necessary. Legend (in order of precedence) | Resource type | Indicator | Resolution | -|------------------------------------------------|-------------|---------------------------------------| +| ---------------------------------------------- | ----------- | ------------------------------------- | | Large table-materialized model | Orange | Make the model incremental | | Large model without partitioning or clustering | Orange | Add partitioning and/or clustering | | View with more than one child | Yellow | Materialize as a table or incremental |