Skip to content

Commit

Permalink
Upgrade jupyter-book to v1.0.0 (#3265)
Browse files Browse the repository at this point in the history
  • Loading branch information
SorenSpicknall authored Feb 7, 2024
1 parent 472d9ea commit 4a092b4
Show file tree
Hide file tree
Showing 24 changed files with 791 additions and 385 deletions.
46 changes: 29 additions & 17 deletions docs/analytics_new_analysts/01-data-analysis-intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,28 +4,30 @@

Below are Python tutorials covering the basics of data cleaning and wrangling. [Chris Albon's guide](https://chrisalbon.com/#python) is particularly helpful. Rather than reinventing the wheel, this tutorial instead highlights specific methods and operations that might make your life easier as a data analyst.

- [Import and export data in Python](#import-and-export-data-in-python)
- [Merge tabular and geospatial data](#merge-tabular-and-geospatial-data)
- [Import and export data in Python](#data-analysis-import-and-export-data-in-python)
- [Merge tabular and geospatial data](#data-analysis-merge-tabular-and-geospatial-data)
- [Functions](#functions)
- [Grouping](#grouping)
- [Aggregating](#aggregating)
- [Export aggregated output](#export-aggregated-output)

## Getting Started

```
```python
import numpy as np
import pandas as pd
import geopandas as gpd
```

## Import and Export Data in Python
(data-analysis-import-and-export-data-in-python)=

## Import and Export Data in Python for Data Analysis

### **Local files**

We import a tabular dataframe `my_csv.csv` and an Excel spreadsheet `my_excel.xlsx`.

```
```python
df = pd.read_csv('./folder/my_csv.csv')

df = pd.read_excel('./folder/my_excel.xlsx', sheet_name = 'Sheet1')
Expand All @@ -35,7 +37,7 @@ df = pd.read_excel('./folder/my_excel.xlsx', sheet_name = 'Sheet1')

The data we use outside of the warehouse can be stored in GCS buckets.

```
```python
# Read from GCS
df = pd.read_csv('gs://calitp-analytics-data/data-analyses/bucket-name/df_csv.csv')

Expand All @@ -45,7 +47,9 @@ df.to_csv('gs://calitp-analytics-data/data-analyses/bucket-name/df_csv.csv')

Refer to the [Data Management best practices](data-management-page) and [Basics of Working with Geospatial Data](geo-intro) for additional information on importing various file types.

## Merge Tabular and Geospatial Data
(data-analysis-merge-tabular-and-geospatial-data)=

## Merge Tabular and Geospatial Data for Data Analysis

Merging data from multiple sources creates one large dataframe (df) to perform data analysis. Let's say there are 3 sources of data that need to be merged:

Expand Down Expand Up @@ -81,7 +85,7 @@ Dataframe #3: `council_boundaries` (geospatial)

First, merge `paunch_locations` with `council_population` using the `CD` column, which they have in common.

```
```python
merge1 = pd.merge(
paunch_locations,
council_population,
Expand All @@ -96,7 +100,7 @@ merge1 = pd.merge(

Next, merge `merge1` and `council_boundaries`. Columns don't have to have the same names to be matched on, as long as they hold the same values.

```
```python
merge2 = pd.merge(
merge1,
council_boundaries,
Expand All @@ -123,6 +127,8 @@ Here are some things to know about `merge2`:
| 5 | Pawnee | $4 | 1 | (x5, y5) | Leslie Knope | 1,500 | polygon |
| 6 | Pawnee | $6 | 2 | (x6, y6) | Jeremy Jamm | 2,000 | polygon |

(functions)=

## Functions

A function is a set of instructions to *do something*. It can be as simple as changing values in a column or as complicated as a series of steps to clean, group, aggregate, and plot the data.
Expand All @@ -142,7 +148,7 @@ Lambda functions are quick and dirty. You don't even have to name the function!

### **If-Else Statements**

```
```python
# Create column called duration. If Songs > 10, duration is 'long'.
# Otherwise, duration is 'short'.
df['duration'] = df.apply(lambda row: 'long' if row.Songs > 10
Expand Down Expand Up @@ -174,7 +180,7 @@ df

### **Other Lambda Functions**

```
```python
# Split the band name at the spaces
# [1] means we want to extract the second word
# [0:2] means we want to start at the first character
Expand All @@ -197,7 +203,7 @@ You should use a full function when a function is too complicated to be a lambda

`df.apply` is one common usage of a function.

```
```python
def years_active(row):
if row.Band == 'Mouse Rat':
return '2009-2014'
Expand All @@ -218,13 +224,15 @@ df
| Jet Black Pope | 4 | 2008 |
| Nothing Rhymes with Orange | 6 | 2008 |

(grouping)=

## Grouping

Sometimes it's necessary to create a new column to group together certain values of a column. Here are two ways to accomplish this:

<b>Method #1</b>: Write a function using if-else statement and apply it using a lambda function.

```
```python
# The function is called elected_year, and it operates on every row.
def elected_year(row):
# For each row, if Council_Member says 'Leslie Knope', then return 2012
Expand Down Expand Up @@ -252,7 +260,7 @@ council_population

<b>Method #2</b>: Loop over every value, fill in the new column value, then attach that new column.

```
```python
# Create a list to store the new column
sales_group = []

Expand Down Expand Up @@ -283,13 +291,15 @@ paunch_locations
| 6 | Pawnee | $6 | 2 | (x6, y6) | high |
| 7 | Indianapolis | $7 | | (x7, y7) | high |

(aggregating)=

## Aggregating

One of the most common form of summary statistics is aggregating by groups. In Excel, it's called a pivot table. In ArcGIS, it's doing a dissolve and calculating summary statistics. There are two ways to do it in Python: `groupby` and `agg` or `pivot_table`.

To answer the question of how many Paunch Burger locations there are per Council District and the sales generated per resident,

```
```python
# Method #1: groupby and agg
pivot = (merge2.groupby(['CD'])
.agg({'Sales_millions': 'sum',
Expand Down Expand Up @@ -322,13 +332,15 @@ pivot = merge2.pivot_table(
| 2 | $8.5 | 2 | Jeremy Jamm | 2,000 |
| 3 | $2.5 | 1 | Douglass Howser | 2,250 |

(export-aggregated-output)=

## Export Aggregated Output

Python can do most of the heavy lifting for data cleaning, transformations, and general wrangling. But, for charts or tables, it might be preferable to finish in Excel so that visualizations conform to the corporate style guide.

Dataframes can be exported into Excel and written into multiple sheets.

```
```python
import xlsxwriter

# initiate a writer
Expand All @@ -345,7 +357,7 @@ writer.save()

Geodataframes can be exported as a shapefile or GeoJSON to visualize in ArcGIS/QGIS.

```
```python
gdf.to_file(driver = 'ESRI Shapefile', filename = '../folder/my_shapefile.shp' )

gdf.to_file(driver = 'GeoJSON', filename = '../folder/my_geojson.geojson')
Expand Down
20 changes: 13 additions & 7 deletions docs/analytics_new_analysts/02-data-analysis-intermediate.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,14 @@ After polishing off the [intro tutorial](pandas-intro), you're ready to devour s

## Getting Started

```
```python
import numpy as np
import pandas as pd
import geopandas as gpd
```

(create-a-new-column-using-a-dictionary-to-map-the-values)=

### Create a New Column Using a Dictionary to Map the Values

Sometimes, you want to create a new column by converting one set of values into a different set of values. We could write a function or we could use the map function to add a new column. For our `df`, we want a new column that shows the state.
Expand All @@ -33,7 +35,7 @@ Sometimes, you want to create a new column by converting one set of values into

[Quick refresher on functions](pandas-intro)

```
```python
# Create a function called state_abbrev.
def state_abbrev(row):
# The find function returns the index of where 'Indiana' is found in
Expand All @@ -56,7 +58,7 @@ df['State'] = df.apply(state_abbrev, axis = 1)

But, writing a function could take up a lot of space, especially with all the if-elif-else statements. Alternatively, a dictionary would also work. We could use a dictionary and map the four different city-state values into the state abbreviation.

```
```python
state_abbrev1 = {'Eagleton, Indiana': 'IN', 'South Carolina': 'SC',
'Michigan': 'MI', 'Partridge, Minnesota': 'MN'}

Expand All @@ -65,7 +67,7 @@ df['State'] = df.Birthplace.map(state_abbrev1)

But, if we wanted to avoid writing out all the possible combinations, we would first extract the *state* portion of the city-state text. Then we could map the state's full name with its abbreviation.

```
```python
# The split function splits at the comma and expand the columns.
# Everything is stored in a new df called 'fullname'.
fullname = df['Birthplace'].str.split(",", expand = True)
Expand Down Expand Up @@ -101,11 +103,13 @@ All 3 methods would give us this `df`:
| Ann Perkins | Michigan | MI |
| Ben Wyatt | Partridge, Minnesota | MN |

(loop-over-columns-with-a-dictionary)=

### Loop over Columns with a Dictionary

If there are operations or data transformations that need to be performed on multiple columns, the best way to do that is with a loop.

```
```python
columns = ['colA', 'colB', 'colC']

for c in columns:
Expand All @@ -115,6 +119,8 @@ for c in columns:
df[c] = df[c] * 0.5
```

(loop-over-dataframes-with-a-dictionary)=

### Loop over Dataframes with a Dictionary

It's easier and more efficient to use a loop to do the same operations over the different dataframes (df). Here, we want to find the number of Pawnee businesses and Tom Haverford businesses located in each Council District.
Expand All @@ -138,7 +144,7 @@ This type of question is perfect for a loop. Each df will be spatially joined to
| Entertainment 720 | x2 | y2 | 1 | Point(x2, y2) |
| Rent-A-Swag | x3 | y3 | 4 | Point(x3, y3) |

```
```python
# Save our existing dfs into a dictionary. The business df is named
# 'pawnee"; the tom df is named 'tom'.
dfs = {'pawnee': business, 'tom': tom}
Expand Down Expand Up @@ -166,7 +172,7 @@ for key, value in dfs.items():

Now, our `summary_dfs` dictionary contains 2 items, which are the 2 dataframes with everything aggregated.

```
```python
# To view the contents of this dictionary
for key, value in summary_dfs.items():
display(key)
Expand Down
Loading

0 comments on commit 4a092b4

Please sign in to comment.