Upgrade jupyter-book to v1.0.0 (#3265)

cal-itp · Feb 7, 2024 · 4a092b4 · 4a092b4
1 parent 472d9ea
commit 4a092b4
Show file tree

Hide file tree

Showing 24 changed files with 791 additions and 385 deletions.
diff --git a/docs/analytics_new_analysts/01-data-analysis-intro.md b/docs/analytics_new_analysts/01-data-analysis-intro.md
@@ -4,28 +4,30 @@
 
 Below are Python tutorials covering the basics of data cleaning and wrangling. [Chris Albon's guide](https://chrisalbon.com/#python) is particularly helpful. Rather than reinventing the wheel, this tutorial instead highlights specific methods and operations that might make your life easier as a data analyst.
 
-- [Import and export data in Python](#import-and-export-data-in-python)
-- [Merge tabular and geospatial data](#merge-tabular-and-geospatial-data)
+- [Import and export data in Python](#data-analysis-import-and-export-data-in-python)
+- [Merge tabular and geospatial data](#data-analysis-merge-tabular-and-geospatial-data)
 - [Functions](#functions)
 - [Grouping](#grouping)
 - [Aggregating](#aggregating)
 - [Export aggregated output](#export-aggregated-output)
 
 ## Getting Started
 
-```
+```python
 import numpy as np
 import pandas as pd
 import geopandas as gpd
 ```
 
-## Import and Export Data in Python
+(data-analysis-import-and-export-data-in-python)=
+
+## Import and Export Data in Python for Data Analysis
 
 ### **Local files**
 
 We import a tabular dataframe `my_csv.csv` and an Excel spreadsheet `my_excel.xlsx`.
 
-```
+```python
 df = pd.read_csv('./folder/my_csv.csv')
 
 df = pd.read_excel('./folder/my_excel.xlsx', sheet_name = 'Sheet1')
@@ -35,7 +37,7 @@ df = pd.read_excel('./folder/my_excel.xlsx', sheet_name = 'Sheet1')
 
 The data we use outside of the warehouse can be stored in GCS buckets.
 
-```
+```python
 # Read from GCS
 df = pd.read_csv('gs://calitp-analytics-data/data-analyses/bucket-name/df_csv.csv')
 
@@ -45,7 +47,9 @@ df.to_csv('gs://calitp-analytics-data/data-analyses/bucket-name/df_csv.csv')
 
 Refer to the [Data Management best practices](data-management-page) and [Basics of Working with Geospatial Data](geo-intro) for additional information on importing various file types.
 
-## Merge Tabular and Geospatial Data
+(data-analysis-merge-tabular-and-geospatial-data)=
+
+## Merge Tabular and Geospatial Data for Data Analysis
 
 Merging data from multiple sources creates one large dataframe (df) to perform data analysis. Let's say there are 3 sources of data that need to be merged:
 
@@ -81,7 +85,7 @@ Dataframe #3: `council_boundaries` (geospatial)
 
 First, merge `paunch_locations` with `council_population` using the `CD` column, which they have in common.
 
-```
+```python
 merge1 = pd.merge(
     paunch_locations,
     council_population,
@@ -96,7 +100,7 @@ merge1 = pd.merge(
 
 Next, merge `merge1` and `council_boundaries`. Columns don't have to have the same names to be matched on, as long as they hold the same values.
 
-```
+```python
 merge2 = pd.merge(
     merge1,
     council_boundaries,
@@ -123,6 +127,8 @@ Here are some things to know about `merge2`:
 | 5     | Pawnee | $4             | 1   | (x5, y5)   | Leslie Knope    | 1,500      | polygon    |
 | 6     | Pawnee | $6             | 2   | (x6, y6)   | Jeremy Jamm     | 2,000      | polygon    |
 
+(functions)=
+
 ## Functions
 
 A function is a set of instructions to *do something*. It can be as simple as changing values in a column or as complicated as a series of steps to clean, group, aggregate, and plot the data.
@@ -142,7 +148,7 @@ Lambda functions are quick and dirty. You don't even have to name the function!
 
 ### **If-Else Statements**
 
-```
+```python
 # Create column called duration. If Songs > 10, duration is 'long'.
 # Otherwise, duration is 'short'.
 df['duration'] = df.apply(lambda row: 'long' if row.Songs > 10
@@ -174,7 +180,7 @@ df
 
 ### **Other Lambda Functions**
 
-```
+```python
 # Split the band name at the spaces
 # [1] means we want to extract the second word
 # [0:2] means we want to start at the first character
@@ -197,7 +203,7 @@ You should use a full function when a function is too complicated to be a lambda
 
 `df.apply` is one common usage of a function.
 
-```
+```python
 def years_active(row):
     if row.Band == 'Mouse Rat':
         return '2009-2014'
@@ -218,13 +224,15 @@ df
 | Jet Black Pope             | 4     | 2008      |
 | Nothing Rhymes with Orange | 6     | 2008      |
 
+(grouping)=
+
 ## Grouping
 
 Sometimes it's necessary to create a new column to group together certain values of a column. Here are two ways to accomplish this:
 
 <b>Method #1</b>: Write a function using if-else statement and apply it using a lambda function.
 
-```
+```python
 # The function is called elected_year, and it operates on every row.
 def elected_year(row):
     # For each row, if Council_Member says 'Leslie Knope', then return 2012
@@ -252,7 +260,7 @@ council_population
 
 <b>Method #2</b>: Loop over every value, fill in the new column value, then attach that new column.
 
-```
+```python
 # Create a list to store the new column
 sales_group = []
 
@@ -283,13 +291,15 @@ paunch_locations
 | 6     | Pawnee       | $6             | 2   | (x6, y6) | high        |
 | 7     | Indianapolis | $7             |     | (x7, y7) | high        |
 
+(aggregating)=
+
 ## Aggregating
 
 One of the most common form of summary statistics is aggregating by groups. In Excel, it's called a pivot table. In ArcGIS, it's doing a dissolve and calculating summary statistics. There are two ways to do it in Python: `groupby` and `agg` or `pivot_table`.
 
 To answer the question of how many Paunch Burger locations there are per Council District and the sales generated per resident,
 
-```
+```python
 # Method #1: groupby and agg
 pivot = (merge2.groupby(['CD'])
         .agg({'Sales_millions': 'sum',
@@ -322,13 +332,15 @@ pivot = merge2.pivot_table(
 | 2   | $8.5           | 2     | Jeremy Jamm     | 2,000      |
 | 3   | $2.5           | 1     | Douglass Howser | 2,250      |
 
+(export-aggregated-output)=
+
 ## Export Aggregated Output
 
 Python can do most of the heavy lifting for data cleaning, transformations, and general wrangling. But, for charts or tables, it might be preferable to finish in Excel so that visualizations conform to the corporate style guide.
 
 Dataframes can be exported into Excel and written into multiple sheets.
 
-```
+```python
 import xlsxwriter
 
 # initiate a writer
@@ -345,7 +357,7 @@ writer.save()
 
 Geodataframes can be exported as a shapefile or GeoJSON to visualize in ArcGIS/QGIS.
 
-```
+```python
 gdf.to_file(driver = 'ESRI Shapefile', filename = '../folder/my_shapefile.shp' )
 
 gdf.to_file(driver = 'GeoJSON', filename = '../folder/my_geojson.geojson')

diff --git a/docs/analytics_new_analysts/02-data-analysis-intermediate.md b/docs/analytics_new_analysts/02-data-analysis-intermediate.md
@@ -10,12 +10,14 @@ After polishing off the [intro tutorial](pandas-intro), you're ready to devour s
 
 ## Getting Started
 
-```
+```python
 import numpy as np
 import pandas as pd
 import geopandas as gpd
 ```
 
+(create-a-new-column-using-a-dictionary-to-map-the-values)=
+
 ### Create a New Column Using a Dictionary to Map the Values
 
 Sometimes, you want to create a new column by converting one set of values into a different set of values. We could write a function or we could use the map function to add a new column. For our `df`, we want a new column that shows the state.
@@ -33,7 +35,7 @@ Sometimes, you want to create a new column by converting one set of values into
 
 [Quick refresher on functions](pandas-intro)
 
-```
+```python
 # Create a function called state_abbrev.
 def state_abbrev(row):
     # The find function returns the index of where 'Indiana' is found in
@@ -56,7 +58,7 @@ df['State'] = df.apply(state_abbrev, axis = 1)
 
 But, writing a function could take up a lot of space, especially with all the if-elif-else statements. Alternatively, a dictionary would also work. We could use a dictionary and map the four different city-state values into the state abbreviation.
 
-```
+```python
 state_abbrev1 = {'Eagleton, Indiana': 'IN', 'South Carolina': 'SC',
                 'Michigan': 'MI', 'Partridge, Minnesota': 'MN'}
 
@@ -65,7 +67,7 @@ df['State'] = df.Birthplace.map(state_abbrev1)
 
 But, if we wanted to avoid writing out all the possible combinations, we would first extract the *state* portion of the city-state text. Then we could map the state's full name with its abbreviation.
 
-```
+```python
 # The split function splits at the comma and expand the columns.
 # Everything is stored in a new df called 'fullname'.
 fullname = df['Birthplace'].str.split(",", expand = True)
@@ -101,11 +103,13 @@ All 3 methods would give us this `df`:
 | Ann Perkins   | Michigan             | MI    |
 | Ben Wyatt     | Partridge, Minnesota | MN    |
 
+(loop-over-columns-with-a-dictionary)=
+
 ### Loop over Columns with a Dictionary
 
 If there are operations or data transformations that need to be performed on multiple columns, the best way to do that is with a loop.
 
-```
+```python
 columns = ['colA', 'colB', 'colC']
 
 for c in columns:
@@ -115,6 +119,8 @@ for c in columns:
     df[c] = df[c] * 0.5
 ```
 
+(loop-over-dataframes-with-a-dictionary)=
+
 ### Loop over Dataframes with a Dictionary
 
 It's easier and more efficient to use a loop to do the same operations over the different dataframes (df). Here, we want to find the number of Pawnee businesses and Tom Haverford businesses located in each Council District.
@@ -138,7 +144,7 @@ This type of question is perfect for a loop. Each df will be spatially joined to
 | Entertainment 720 | x2        | y2       | 1              | Point(x2, y2) |
 | Rent-A-Swag       | x3        | y3       | 4              | Point(x3, y3) |
 
-```
+```python
 # Save our existing dfs into a dictionary. The business df is named
 # 'pawnee"; the tom df is named 'tom'.
 dfs = {'pawnee': business, 'tom': tom}
@@ -166,7 +172,7 @@ for key, value in dfs.items():
 
 Now, our `summary_dfs` dictionary contains 2 items, which are the 2 dataframes with everything aggregated.
 
-```
+```python
 # To view the contents of this dictionary
 for key, value in summary_dfs.items():
     display(key)