From c0a6661aa8bb0d46ca0ea187453d7827f0a339af Mon Sep 17 00:00:00 2001 From: amandaha8 Date: Wed, 27 Sep 2023 16:28:08 +0000 Subject: [PATCH 1/7] updated shared_utils and switch instead of checkout --- docs/analytics_onboarding/overview.md | 2 +- docs/analytics_tools/python_libraries.md | 6 +++--- docs/analytics_tools/saving_code.md | 2 +- docs/analytics_tools/storing_data.md | 12 ++++++------ 4 files changed, 11 insertions(+), 11 deletions(-) diff --git a/docs/analytics_onboarding/overview.md b/docs/analytics_onboarding/overview.md index 6cfcebbeac..617b2464d6 100644 --- a/docs/analytics_onboarding/overview.md +++ b/docs/analytics_onboarding/overview.md @@ -34,7 +34,7 @@ - [ ] **calitp-data-analysis** - Cal-ITP's internal Python library for analysis | ([Docs](calitp-data-analysis)) - [ ] **siuba** - Recommended data analysis library | ([Docs](siuba)) -- [ ] [**shared_utils**](https://github.com/cal-itp/data-analyses/tree/main/_shared_utils) - A shared utilities library for the analytics team | ([Docs](shared-utils)) +- [ ] [**shared_utils 1**](https://github.com/cal-itp/data-analyses/tree/main/_shared_utils) and [**shared_utils 2**](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis)- A shared utilities library for the analytics team | ([Docs](shared-utils)) **Caltrans Employee Resources:** diff --git a/docs/analytics_tools/python_libraries.md b/docs/analytics_tools/python_libraries.md index c1c1d47f0c..3bf5a2d332 100644 --- a/docs/analytics_tools/python_libraries.md +++ b/docs/analytics_tools/python_libraries.md @@ -36,7 +36,7 @@ The following libraries are available and recommended for use by Cal-ITP data an ## shared utils -A set of shared utility functions can also be installed, similarly to any Python library. The [shared_utils](https://github.com/cal-itp/data-analyses/shared_utils) are stored here. Generalized functions for analysis are added as collaborative work evolves so we aren't constantly reinventing the wheel. +A set of shared utility functions can also be installed, similarly to any Python library. The shared_utils are stored in two places: [here in `data-analyses`](https://github.com/cal-itp/data-analyses/shared_utils), which houses functions that are more likely to be updated and [here in `data-infra`](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis), which contains functions that are rarely changed. Generalized functions for analysis are added as collaborative work evolves so we aren't constantly reinventing the wheel. ### In terminal @@ -53,10 +53,10 @@ A set of shared utility functions can also be installed, similarly to any Python ### In notebook ```python -import shared_utils +from calitp_data_analysis import geography_utils #example of using shared_utils -shared_utils.geography_utils.WGS84 +geography_utils.WGS84 ``` See [data-analyses/example_reports](https://github.com/cal-itp/data-analyses/tree/main/example_report) for examples on how to use `shared_utils` for general functions, charts, and maps. diff --git a/docs/analytics_tools/saving_code.md b/docs/analytics_tools/saving_code.md index 3483986546..61889eeeb6 100644 --- a/docs/analytics_tools/saving_code.md +++ b/docs/analytics_tools/saving_code.md @@ -109,7 +109,7 @@ If you discover merge conflicts and they are within a single notebook that only These are helpful Git commands an analyst might need, listed in no particular order. -- During collaboration, if another analyst already created a remote branch, and you want to work off of the same branch: `git fetch origin`, `git checkout -b our-project-branch origin/our-project-branch` +- During collaboration, if another analyst already created a remote branch, and you want to work off of the same branch: `git fetch origin`, `git switch -c our-project-branch origin/our-project-branch` - To discard the changes you made to a file, `git checkout my-notebook.ipynb`, and you can revert back to the version that was last committed. - Temporarily stash changes, move to a different branch, and come back and retain those changes: `git stash`, `git switch some-other-branch`, do stuff on the other branch, `git switch original-branch`, `git stash pop` - Rename files and retain the version history associated (`mv` is move, and renaming is moving the file path): `git mv old-notebook.ipynb new-notebook.ipynb` diff --git a/docs/analytics_tools/storing_data.md b/docs/analytics_tools/storing_data.md index 09178f8524..873fcdf419 100644 --- a/docs/analytics_tools/storing_data.md +++ b/docs/analytics_tools/storing_data.md @@ -112,12 +112,12 @@ gdf.to_parquet("./my-geoparqet.parquet") fs.put("./my-geoparquet.parquet", f"{GCS_FOLDER}my-geoparquet.parquet") ``` -Or, use the `shared_utils` package. +Or, use the `shared_utils` package that lives in [data-infra](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis) ```python -import shared_utils +from calitp_data_analysis import utils -shared_utils.utils.geoparquet_gcs_export( +utils.geoparquet_gcs_export( gdf, GCS_FOLDER, "my-geoparquet" @@ -135,18 +135,18 @@ Refer to the [data catalogs doc](catalogue-cloud-storage) to list a GeoJSON, and Use the `shared_utils` package to read in or export geojsons. ```python -import shared_utils +from calitp_data_analysis import utils GCS_FOLDER = "gs://calitp-analytics-data/data-analyses/task-subfolder/" -gdf = shared_utils.utils.read_geojson( +gdf = utils.read_geojson( GCS_FOLDER, "my-geojson.geojson", geojson_type = "geojson", save_locally = True ) -shared_utils.utils.geojson_gcs_export( +utils.geojson_gcs_export( gdf, GCS_FOLDER, "my-geojson.geojson", From d4f2c336fd3c0c7d99a3a92666bf3c8953cba894 Mon Sep 17 00:00:00 2001 From: amandaha8 Date: Thu, 28 Sep 2023 15:41:07 +0000 Subject: [PATCH 2/7] double checked all checkout and shared_utils were changed --- docs/analytics_onboarding/overview.md | 2 +- docs/analytics_tools/python_libraries.md | 2 +- docs/analytics_tools/storing_data.md | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/analytics_onboarding/overview.md b/docs/analytics_onboarding/overview.md index 617b2464d6..767319c1c0 100644 --- a/docs/analytics_onboarding/overview.md +++ b/docs/analytics_onboarding/overview.md @@ -34,7 +34,7 @@ - [ ] **calitp-data-analysis** - Cal-ITP's internal Python library for analysis | ([Docs](calitp-data-analysis)) - [ ] **siuba** - Recommended data analysis library | ([Docs](siuba)) -- [ ] [**shared_utils 1**](https://github.com/cal-itp/data-analyses/tree/main/_shared_utils) and [**shared_utils 2**](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis)- A shared utilities library for the analytics team | ([Docs](shared-utils)) +- [ ] [**shared_utils**](https://github.com/cal-itp/data-analyses/tree/main/_shared_utils) and [**here**](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis) - A shared utilities library for the analytics team | ([Docs](shared-utils)) **Caltrans Employee Resources:** diff --git a/docs/analytics_tools/python_libraries.md b/docs/analytics_tools/python_libraries.md index 3bf5a2d332..8c0e5ccd59 100644 --- a/docs/analytics_tools/python_libraries.md +++ b/docs/analytics_tools/python_libraries.md @@ -36,7 +36,7 @@ The following libraries are available and recommended for use by Cal-ITP data an ## shared utils -A set of shared utility functions can also be installed, similarly to any Python library. The shared_utils are stored in two places: [here in `data-analyses`](https://github.com/cal-itp/data-analyses/shared_utils), which houses functions that are more likely to be updated and [here in `data-infra`](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis), which contains functions that are rarely changed. Generalized functions for analysis are added as collaborative work evolves so we aren't constantly reinventing the wheel. +A set of shared utility functions can also be installed, similarly to any Python library. The shared_utils are stored in two places: \[here\] in `data-analyses` (https://github.com/cal-itp/data-analyses/shared_utils), which houses functions that are more likely to be updated. Shared functions that are updated less frequently are housed \[here\] in the `calitp_data_analysis` pacakage in `data-infra`(https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis). Generalized functions for analysis are added as collaborative work evolves so we aren't constantly reinventing the wheel. ### In terminal diff --git a/docs/analytics_tools/storing_data.md b/docs/analytics_tools/storing_data.md index 873fcdf419..ba1e59406a 100644 --- a/docs/analytics_tools/storing_data.md +++ b/docs/analytics_tools/storing_data.md @@ -112,7 +112,7 @@ gdf.to_parquet("./my-geoparqet.parquet") fs.put("./my-geoparquet.parquet", f"{GCS_FOLDER}my-geoparquet.parquet") ``` -Or, use the `shared_utils` package that lives in [data-infra](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis) +Or, use the `calitp_data_analysis` package that lives in [data-infra](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis) ```python from calitp_data_analysis import utils @@ -132,7 +132,7 @@ Refer to the [data catalogs doc](catalogue-cloud-storage) to list a zipped shape Refer to the [data catalogs doc](catalogue-cloud-storage) to list a GeoJSON, and read in the GeoJSON with the `intake` method. GeoJSONs saved in GCS cannot be read in directly using `geopandas`. -Use the `shared_utils` package to read in or export geojsons. +Use the `calitp_data_analysis` package to read in or export geojsons. ```python from calitp_data_analysis import utils From 0e7d15f7e1cf67fe7efbd80b651a25219dca16a1 Mon Sep 17 00:00:00 2001 From: amandaha8 Date: Thu, 28 Sep 2023 18:07:58 +0000 Subject: [PATCH 3/7] fixed typo --- docs/analytics_tools/python_libraries.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/analytics_tools/python_libraries.md b/docs/analytics_tools/python_libraries.md index 8c0e5ccd59..2fae2c4824 100644 --- a/docs/analytics_tools/python_libraries.md +++ b/docs/analytics_tools/python_libraries.md @@ -36,7 +36,7 @@ The following libraries are available and recommended for use by Cal-ITP data an ## shared utils -A set of shared utility functions can also be installed, similarly to any Python library. The shared_utils are stored in two places: \[here\] in `data-analyses` (https://github.com/cal-itp/data-analyses/shared_utils), which houses functions that are more likely to be updated. Shared functions that are updated less frequently are housed \[here\] in the `calitp_data_analysis` pacakage in `data-infra`(https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis). Generalized functions for analysis are added as collaborative work evolves so we aren't constantly reinventing the wheel. +A set of shared utility functions can also be installed, similarly to any Python library. The shared_utils are stored in two places: \[here\] in `data-analyses` (https://github.com/cal-itp/data-analyses/shared_utils), which houses functions that are more likely to be updated. Shared functions that are updated less frequently are housed \[here\] in the `calitp_data_analysis` package in `data-infra`(https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis). Generalized functions for analysis are added as collaborative work evolves so we aren't constantly reinventing the wheel. ### In terminal From b1cf968437aecd0395d49708afc51f0bad6c90e1 Mon Sep 17 00:00:00 2001 From: amandaha8 Date: Fri, 29 Sep 2023 17:02:41 +0000 Subject: [PATCH 4/7] rm colons, added pygis --- docs/analytics_new_analysts/overview.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/analytics_new_analysts/overview.md b/docs/analytics_new_analysts/overview.md index 778cc1197f..b38f32e488 100644 --- a/docs/analytics_new_analysts/overview.md +++ b/docs/analytics_new_analysts/overview.md @@ -4,7 +4,7 @@ This section is geared towards data analysts who are new to Python. The following tutorials highlight the most relevant Python skills used at Cal ITP. Use them to guide you through completing [practice exercises #1-9](https://github.com/cal-itp/data-analyses/tree/main/starter_kit). -## Content: +## Content - [Data Analysis: Introduction](pandas-intro) - [Data Analysis: Intermediate](pandas-intermediate) @@ -15,7 +15,7 @@ This section is geared towards data analysts who are new to Python. The followin - [Working with Geospatial Data: Intermediate](geo-intermediate) - [Working with Geospatial Data: Advanced](geo-advanced) -## Additional Resources: +## Additional Resources - If you are new to Python, take a look at [all the Python tutorials](https://www.linkedin.com/learning/search?keywords=python&u=36029164) available through Caltrans. There are many introductory Python courses [such as this one.](https://www.linkedin.com/learning/python-essential-training-18764650/getting-started-with-python?autoplay=true&u=36029164) - [Joris van den Bossche's Geopandas Tutorial](https://github.com/jorisvandenbossche/geopandas-tutorial) @@ -25,8 +25,9 @@ This section is geared towards data analysts who are new to Python. The followin - [Python Courses, compiled by our team](https://docs.google.com/spreadsheets/d/1Omow8F0SUiMx1jyG7GpbwnnJ5yWqlLeMH7SMtKxwG80/edit?usp=sharing) - [Why Dask?](https://docs.dask.org/en/stable/why.html) - [10 Minutes to Dask](https://docs.dask.org/en/stable/10-minutes-to-dask.html) +- [PyGIS](https://pygis.io/docs/a_intro.html) -### Books: +### Books - [The Performance Stat Potential](https://www.brookings.edu/book/the-performancestat-potential/) - [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) From d4d240cb4fac376e47ccae515f5c1c50483b6f27 Mon Sep 17 00:00:00 2001 From: tiffanychu90 Date: Fri, 29 Sep 2023 17:16:24 +0000 Subject: [PATCH 5/7] (docs): clearer code blocks, update analytics info --- .../01-data-analysis-intro.md | 42 ++++++++++++++----- .../02-data-analysis-intermediate.md | 2 + .../05-spatial-analysis-basics.md | 14 ++++--- .../06-spatial-analysis-intro.md | 24 ++++++++--- .../07-spatial-analysis-intermediate.md | 5 ++- .../08-spatial-analysis-advanced.md | 12 +++--- docs/analytics_onboarding/overview.md | 2 +- docs/analytics_tools/jupyterhub.md | 2 + docs/analytics_tools/python_libraries.md | 5 +-- docs/analytics_tools/storing_data.md | 2 +- docs/analytics_welcome/how_we_work.md | 13 +++--- 11 files changed, 81 insertions(+), 42 deletions(-) diff --git a/docs/analytics_new_analysts/01-data-analysis-intro.md b/docs/analytics_new_analysts/01-data-analysis-intro.md index 301d1c97c9..9d5a1613bb 100644 --- a/docs/analytics_new_analysts/01-data-analysis-intro.md +++ b/docs/analytics_new_analysts/01-data-analysis-intro.md @@ -82,8 +82,13 @@ Dataframe #3: `council_boundaries` (geospatial) First, merge `paunch_locations` with `council_population` using the `CD` column, which they have in common. ``` -merge1 = pd.merge(paunch_locations, council_population, on = 'CD', - how = 'inner', validate = 'm:1') +merge1 = pd.merge( + paunch_locations, + council_population, + on = 'CD', + how = 'inner', + validate = 'm:1' +) # m:1 many-to-1 merge means that CD appears multiple times in # paunch_locations, but only once in council_population. @@ -92,8 +97,14 @@ merge1 = pd.merge(paunch_locations, council_population, on = 'CD', Next, merge `merge1` and `council_boundaries`. Columns don't have to have the same names to be matched on, as long as they hold the same values. ``` -merge2 = pd.merge(merge1, council_boundaries, left_on = 'CD', - right_on = 'District', how = 'left', validate = 'm:1') +merge2 = pd.merge( + merge1, + council_boundaries, + left_on = 'CD', + right_on = 'District', + how = 'left', + validate = 'm:1' +) ``` Here are some things to know about `merge2`: @@ -247,7 +258,7 @@ sales_group = [] for row in paunch_locations['Sales_millions']: # If sales are more than $3M, but less than $5M, tag as moderate. - if (row >= 3) & (row <= 5) : + if (row >= 3) and (row <= 5): sales_group.append('moderate') # If sales are more than $5M, tag as high. elif row >=5: @@ -256,6 +267,7 @@ for row in paunch_locations['Sales_millions']: else: sales_group.append('low') + paunch_locations['sales_group'] = sales_group paunch_locations @@ -279,14 +291,22 @@ To answer the question of how many Paunch Burger locations there are per Council ``` # Method #1: groupby and agg -pivot = merge2.groupby(['CD', 'Geometry_y']).agg({'Sales_millions': 'sum', - 'Store': 'count', 'Population': 'mean'}).reset_index() +pivot = (merge2.groupby(['CD', 'Geometry_y']) + .agg({'Sales_millions': 'sum', + 'Store': 'count', + 'Population': 'mean'} + ).reset_index() + ) # Method #2: pivot table -pivot = merge2.pivot_table(index= ['CD', 'Geometry_y'], - values = ['Sales_millions', 'Store', 'Population'], - aggfunc= {'Sales_millions': 'sum', 'Store': 'count', - 'Population': 'mean'}).reset_index() +pivot = merge2.pivot_table( + index= ['CD', 'Geometry_y'], + values = ['Sales_millions', 'Store', 'Population'], + aggfunc= { + 'Sales_millions': 'sum', + 'Store': 'count', + 'Population': 'mean'} + ).reset_index() # to only find one type of summary statistic, use aggfunc = 'sum' diff --git a/docs/analytics_new_analysts/02-data-analysis-intermediate.md b/docs/analytics_new_analysts/02-data-analysis-intermediate.md index 736113f44c..a5e1c0baeb 100644 --- a/docs/analytics_new_analysts/02-data-analysis-intermediate.md +++ b/docs/analytics_new_analysts/02-data-analysis-intermediate.md @@ -154,8 +154,10 @@ for key, value in dfs.items(): # Use f string to define a variable join_df (result of our spatial join) ## join_{key} would be join_pawnee or join_tom in the loop join_df = "join_{key}" + # Spatial join join_df = gpd.sjoin(value, council_district, how = 'inner', op = 'intersects') + # Calculate summary stats with groupby, agg, then save it into summary_dfs, # naming it 'pawnee' or 'tom'. summary_dfs[key] = join.groupby('ID').agg( diff --git a/docs/analytics_new_analysts/05-spatial-analysis-basics.md b/docs/analytics_new_analysts/05-spatial-analysis-basics.md index f268045d0d..ef292c0a81 100644 --- a/docs/analytics_new_analysts/05-spatial-analysis-basics.md +++ b/docs/analytics_new_analysts/05-spatial-analysis-basics.md @@ -42,15 +42,17 @@ gdf.to_file(driver = 'ESRI Shapefile', filename = '../folder/my_shapefile.shp' ) To read in our dataframe (df) and geodataframe (gdf) from GCS: ``` -df = pd.read_csv('gs://calitp-analytics-data/data-analyses/bucket-name/my-csv.csv') -gdf = gpd.read_file('gs://calitp-analytics-data/data-analyses/bucket-name/my-geojson.geojson') -gdf = gpd.read_parquet('gs://calitp-analytics-data/data-analyses/bucket-name/my-geoparquet.parquet', engine= 'auto') -gdf = gpd.read_file('gs://calitp-analytics-data/data-analyses/bucket-name/my-shapefile.zip') +GCS_BUCKET = 'gs://calitp-analytics-data/data-analyses/bucket_name/' + +df = pd.read_csv(f'{GCS_BUCKET}my-csv.csv') +gdf = gpd.read_file(f'{GCS_BUCKET}my-geojson.geojson') +gdf = gpd.read_parquet(f'{GCS_BUCKET}my-geoparquet.parquet') +gdf = gpd.read_file(f'{GCS_BUCKET}my-shapefile.zip') # Write a file to GCS -gdf.to_file('gs://calitp-analytics-data/data-analyses/bucket-name/my-geojson.geojson', driver='GeoJSON') +gdf.to_file(f'{GCS_BUCKET}my-geojson.geojson', driver='GeoJSON') -#Using shared utils +# Using calitp_data_analysis GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/" FILE_NAME = "test_geoparquet" utils.geoparquet_gcs_export(gdf, GCS_FILE_PATH, FILE_NAME) diff --git a/docs/analytics_new_analysts/06-spatial-analysis-intro.md b/docs/analytics_new_analysts/06-spatial-analysis-intro.md index 600abea2d3..79aafcb97f 100644 --- a/docs/analytics_new_analysts/06-spatial-analysis-intro.md +++ b/docs/analytics_new_analysts/06-spatial-analysis-intro.md @@ -115,16 +115,30 @@ The `join` gdf looks like this. We lost Stores 4 (Eagleton) and 7 (Indianapolis) We want to count the number of Paunch Burger locations and their total sales within each District. ``` -summary = join.pivot_table(index = ['District', 'Geometry_y], + +summary = join.pivot_table( + index = ['District'], values = ['Store', 'Sales_millions'], - aggfunc = {'Store': 'count', 'Sales_millions': 'sum'}).reset_index() + aggfunc = {'Store': 'count', 'Sales_millions': 'sum'} +).reset_index() OR -summary = join.groupby(['District', 'Geometry_y']).agg({'Store': 'count', - 'Sales_millions': 'sum'}).reset_index() +summary = (join.groupby(['District']) + .agg({ + 'Store': 'count', + 'Sales_millions': 'sum'} + ).reset_index() + ) + +# Make sure to merge in district geometry again +summary = pd.merge( + gdf, + summary, + on = 'District', + how = 'inner' +) -summary.rename(column = {'Geometry_y': 'Geometry'}, inplace = True) summary ``` diff --git a/docs/analytics_new_analysts/07-spatial-analysis-intermediate.md b/docs/analytics_new_analysts/07-spatial-analysis-intermediate.md index 33f390e7b8..b9569caaca 100644 --- a/docs/analytics_new_analysts/07-spatial-analysis-intermediate.md +++ b/docs/analytics_new_analysts/07-spatial-analysis-intermediate.md @@ -52,7 +52,8 @@ Then, create the `geometry` column. We use a lambda function and apply it to al df.rename(columns = {'X': 'longitude', 'Y':'latitude'}, inplace=True) # Create geometry column -gdf = gpd.points_from_xy(df.longitude, df.latitude, crs="EPSG:4326") +geom_col = gpd.points_from_xy(df.longitude, df.latitude, crs="EPSG:4326") +gdf = gpd.GeoDataFrame(df, geometry=geom_col, crs = "EPSG:4326") # Project to different CRS. Pawnee is in Indiana, so we'll use EPSG:2965. # In Southern California, use EPSG:2229. @@ -152,8 +153,10 @@ for key, value in boundaries.items(): # Define new variables using f string join_df = f"{key}_join" agg_df = f"{key}_summary" + # Spatial join, but don't save it into the results dictionary join_df = gpd.sjoin(df, value, how = 'inner', predicate = 'intersects') + # Aggregate and save results into results dictionary results[agg_df] = join_df.groupby('ID').agg( {'Business': 'count', 'Sales_millions': 'sum'}) diff --git a/docs/analytics_new_analysts/08-spatial-analysis-advanced.md b/docs/analytics_new_analysts/08-spatial-analysis-advanced.md index abeec1dbd8..f4eb6ae9a0 100644 --- a/docs/analytics_new_analysts/08-spatial-analysis-advanced.md +++ b/docs/analytics_new_analysts/08-spatial-analysis-advanced.md @@ -23,12 +23,12 @@ from geoalchemy2 import WKTElement There are six possible geometric shapes that are represented in geospatial data. [More description here.](http://postgis.net/workshops/postgis-intro/geometries.html#representing-real-world-objects) -- Point -- MultiPoint: collection of points -- LineString -- MultiLineString: collection of linestrings, which are disconnected from each other -- Polygon -- MultiPolygon: collection of polygons, which can be disconnected or overlapping from each other +- `Point` +- `MultiPoint`: collection of points +- `LineString` +- `MultiLineString`: collection of linestrings, which are disconnected from each other +- `Polygon` +- `MultiPolygon`: collection of polygons, which can be disconnected or overlapping from each other The ArcGIS equivalent of these are just points, lines, and polygons. diff --git a/docs/analytics_onboarding/overview.md b/docs/analytics_onboarding/overview.md index 767319c1c0..7c797b3b8e 100644 --- a/docs/analytics_onboarding/overview.md +++ b/docs/analytics_onboarding/overview.md @@ -27,7 +27,7 @@ - [ ] **[notebooks.calitp.org](https://notebooks.calitp.org/)** - JupyterHub cloud-based notebooks for querying Python, SQL, R | ([Docs](jupyterhub-intro)) - [ ] **[dashboards.calitp.org](https://dashboards.calitp.org/)** - Metabase business insights & dashboards | ([Docs](metabase)) - [ ] **[dbt-docs.calitp.org](https://dbt-docs.calitp.org/)** - Documentation for the Cal-ITP data warehouse. -- [ ] **[analysis.calitp.org](https://analysis.calitp.org/)** - The Cal-ITP analytics portfolio website. | (Docs WIP) +- [ ] **[analysis.calitp.org](https://analysis.calitp.org/)** - The Cal-ITP analytics portfolio website. - [ ] [**Google BigQuery**](https://console.cloud.google.com/bigquery) - Viewing the data warehouse and querying SQL **Python Libraries:** diff --git a/docs/analytics_tools/jupyterhub.md b/docs/analytics_tools/jupyterhub.md index 06f7d26169..87a087efc9 100644 --- a/docs/analytics_tools/jupyterhub.md +++ b/docs/analytics_tools/jupyterhub.md @@ -60,6 +60,8 @@ gcloud config set project cal-itp-data-infra gcloud auth application-default login ``` +If you are still not able to connect, make sure you have the suite of permissions associated with other analysts. + ### Increasing the Query Limit By default, there is a query limit set within the Jupyter Notebook. Most queries should be within that limit, and running into `DatabaseError: 500 Query exceeded limit for bytes billed` should be a red flag to investigate whether such a large query is needed for the analysis. To increase the query limit, add and execute the following in your notebook: diff --git a/docs/analytics_tools/python_libraries.md b/docs/analytics_tools/python_libraries.md index 2fae2c4824..57524d0cb8 100644 --- a/docs/analytics_tools/python_libraries.md +++ b/docs/analytics_tools/python_libraries.md @@ -36,7 +36,7 @@ The following libraries are available and recommended for use by Cal-ITP data an ## shared utils -A set of shared utility functions can also be installed, similarly to any Python library. The shared_utils are stored in two places: \[here\] in `data-analyses` (https://github.com/cal-itp/data-analyses/shared_utils), which houses functions that are more likely to be updated. Shared functions that are updated less frequently are housed \[here\] in the `calitp_data_analysis` package in `data-infra`(https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis). Generalized functions for analysis are added as collaborative work evolves so we aren't constantly reinventing the wheel. +A set of shared utility functions can also be installed, similarly to any Python library. The `shared_utils` are stored in two places: [here](https://github.com/cal-itp/data-analyses/shared_utils) in `data-analyses`, which houses functions that are more likely to be updated. Shared functions that are updated less frequently are housed [here](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis) in the `calitp_data_analysis` package in `data-infra`. Generalized functions for analysis are added as collaborative work evolves so we aren't constantly reinventing the wheel. ### In terminal @@ -55,11 +55,10 @@ A set of shared utility functions can also be installed, similarly to any Python ```python from calitp_data_analysis import geography_utils -#example of using shared_utils geography_utils.WGS84 ``` -See [data-analyses/example_reports](https://github.com/cal-itp/data-analyses/tree/main/example_report) for examples on how to use `shared_utils` for general functions, charts, and maps. +See [data-analyses/starter_kit](https://github.com/cal-itp/data-analyses/tree/main/starter_kit) for examples on how to use `shared_utils` for general functions, charts, and maps. (calitp-data-analysis)= diff --git a/docs/analytics_tools/storing_data.md b/docs/analytics_tools/storing_data.md index ba1e59406a..8f50c07a89 100644 --- a/docs/analytics_tools/storing_data.md +++ b/docs/analytics_tools/storing_data.md @@ -63,7 +63,7 @@ import geopandas as gpd import gcsfs import pandas as pd -from calitp_data.storage import get_fs +from calitp_data_analysis import get_fs fs = get_fs() ``` diff --git a/docs/analytics_welcome/how_we_work.md b/docs/analytics_welcome/how_we_work.md index 555ef52605..c99aeea4b8 100644 --- a/docs/analytics_welcome/how_we_work.md +++ b/docs/analytics_welcome/how_we_work.md @@ -11,11 +11,10 @@ The section below outlines our team's primary meetings and their purposes, as we **New analysts**, look out for these meetings to be added to your calendar. -| Name | Cadence | Description | -| ------------------------ | -------------------- | ----------------------------------------------------------------------------------------------------------------- | -| **Technical Onboarding** | Week 1
40 Mins | To ensure access to tools and go over best practices. | -| **Data Office Hours** | Tues
50 Mins | An opportunity to learn about our software tools, ask technical questions and get code debugging help from peers. | -| **Analyst Team Meeting** | Thurs
45 Mins | Branch meeting to share your screen and discuss what you've been working on. | +| Name | Cadence | Description | +| ------------------------ | -------------------- | ---------------------------------------------------------------------------- | +| **Technical Onboarding** | Week 1
40 Mins | To ensure access to tools and go over best practices. | +| **Analyst Team Meeting** | Thurs
45 Mins | Branch meeting to share your screen and discuss what you've been working on. | (slack-intro)= @@ -52,8 +51,6 @@ The screencast below introduces: ### GitHub Analytics Repo -#### Using the data-analyses Repo - -This is our main data analysis repository, for sharing quick reports and works in progress. Get set up on GitHub and clone the data-analyses repository [using this link](committing-from-jupyterhub). +The `data-analyses` repo is our main data analysis repository, for sharing quick reports and works in progress. Get set up on GitHub and clone the data-analyses repository [using this link](committing-from-jupyterhub). For collaborative short-term tasks, create a new folder and work off a separate branch. From b2b531d07ed5a7aae9ae0e9edc8d80bda5f2ccc2 Mon Sep 17 00:00:00 2001 From: tiffanychu90 Date: Fri, 29 Sep 2023 17:37:16 +0000 Subject: [PATCH 6/7] (docs): clean up code blocks --- .../01-data-analysis-intro.md | 14 +++++++------- .../05-spatial-analysis-basics.md | 2 +- .../06-spatial-analysis-intro.md | 12 +++++++++--- 3 files changed, 17 insertions(+), 11 deletions(-) diff --git a/docs/analytics_new_analysts/01-data-analysis-intro.md b/docs/analytics_new_analysts/01-data-analysis-intro.md index 9d5a1613bb..755ecc6ccd 100644 --- a/docs/analytics_new_analysts/01-data-analysis-intro.md +++ b/docs/analytics_new_analysts/01-data-analysis-intro.md @@ -291,7 +291,7 @@ To answer the question of how many Paunch Burger locations there are per Council ``` # Method #1: groupby and agg -pivot = (merge2.groupby(['CD', 'Geometry_y']) +pivot = (merge2.groupby(['CD']) .agg({'Sales_millions': 'sum', 'Store': 'count', 'Population': 'mean'} @@ -300,7 +300,7 @@ pivot = (merge2.groupby(['CD', 'Geometry_y']) # Method #2: pivot table pivot = merge2.pivot_table( - index= ['CD', 'Geometry_y'], + index= ['CD'], values = ['Sales_millions', 'Store', 'Population'], aggfunc= { 'Sales_millions': 'sum', @@ -316,11 +316,11 @@ pivot = merge2.pivot_table( `pivot` looks like this: -| CD | Geometry_y | Sales_millions | Store | Council_Member | Population | -| --- | ---------- | -------------- | ----- | --------------- | ---------- | -| 1 | polygon | $9 | 2 | Leslie Knope | 1,500 | -| 2 | polygon | $8.5 | 2 | Jeremy Jamm | 2,000 | -| 3 | polygon | $2.5 | 1 | Douglass Howser | 2,250 | +| CD | Sales_millions | Store | Council_Member | Population | +| --- | -------------- | ----- | --------------- | ---------- | +| 1 | $9 | 2 | Leslie Knope | 1,500 | +| 2 | $8.5 | 2 | Jeremy Jamm | 2,000 | +| 3 | $2.5 | 1 | Douglass Howser | 2,250 | ## Export Aggregated Output diff --git a/docs/analytics_new_analysts/05-spatial-analysis-basics.md b/docs/analytics_new_analysts/05-spatial-analysis-basics.md index ef292c0a81..8f9b533c87 100644 --- a/docs/analytics_new_analysts/05-spatial-analysis-basics.md +++ b/docs/analytics_new_analysts/05-spatial-analysis-basics.md @@ -95,7 +95,7 @@ gdf.crs gdf = gdf.to_crs('EPSG:2229') ``` -Sometimes, the gdf does not have a CRS set and you will need to be manually set it. This might occur if you create the `geometry` column from latitude and longitude points. More on this in the [intermediate tutorial](geo-intermediate): +Sometimes, the gdf does not have a CRS set and you will need to be manually set it. This might occur if you create the `geometry` column from latitude and longitude points. More on this in the [intermediate tutorial](geo-intermediate). There are [lots of different CRS available](https://epsg.io). The most common ones used for California are: diff --git a/docs/analytics_new_analysts/06-spatial-analysis-intro.md b/docs/analytics_new_analysts/06-spatial-analysis-intro.md index 79aafcb97f..8e7e3310e9 100644 --- a/docs/analytics_new_analysts/06-spatial-analysis-intro.md +++ b/docs/analytics_new_analysts/06-spatial-analysis-intro.md @@ -160,7 +160,7 @@ summary.to_file(driver = 'ESRI Shapefile', ## Buffers -Buffers are areas of a certain distance around a given point, line, or polygon. Buffers are used to determine proximity . A 5 mile buffer around a point would be a circle of 5 mile radius centered at the point. This [ESRI page](http://desktop.arcgis.com/en/arcmap/10.3/tools/analysis-toolbox/buffer.htm) shows how buffers for points, lines, and polygons look. +Buffers are areas of a certain distance around a given point, line, or polygon. Buffers are used to determine proximity. A 5 mile buffer around a point would be a circle of 5 mile radius centered at the point. This [ESRI page](http://desktop.arcgis.com/en/arcmap/10.3/tools/analysis-toolbox/buffer.htm) shows how buffers for points, lines, and polygons look. Some examples of questions that buffers help answer are: @@ -216,11 +216,17 @@ homes_buffer['geometry'] = homes.geometry.buffer(two_miles) Do a spatial join between `locations` and `homes_buffer`. Repeat the process of spatial join and aggregation in Python as illustrated in the previous section (spatial join and dissolve in ArcGIS). ``` -sjoin = gpd.sjoin(locations, homes_buffer, how = 'inner', predicate = 'intersects') +sjoin = gpd.sjoin( + locations, + homes_buffer, + how = 'inner', + predicate = 'intersects' +) + sjoin ``` -`sjoin` looks like this. +`sjoin` looks like this (without `Geometry_y` column). Using `how='left' or how='inner'` will show `Geometry_x` as the resulting geometry column, while using `how = 'right'` would show `Geometry_y` as the resulting geometry column. Only one geometry column is returned in a spatial join, but you have the flexibility in determining which one it is by changing `how=`. - Geometry_x is the point geometry from our left df `locations`. - Geometry_y is the polygon geometry from our right df `homes_buffer`. From c0f6dad0c81f24f21f19166b2888d185e5097520 Mon Sep 17 00:00:00 2001 From: tiffanychu90 Date: Fri, 29 Sep 2023 17:40:36 +0000 Subject: [PATCH 7/7] move pygis link --- docs/analytics_new_analysts/overview.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/analytics_new_analysts/overview.md b/docs/analytics_new_analysts/overview.md index b38f32e488..0d455a4a66 100644 --- a/docs/analytics_new_analysts/overview.md +++ b/docs/analytics_new_analysts/overview.md @@ -22,10 +22,10 @@ This section is geared towards data analysts who are new to Python. The followin - [Practical Python for Data Science by Jill Cates](https://www.practicalpythonfordatascience.com/intro.html) - [Ben-Gurion University of the Negev - Geometric operations](https://geobgu.xyz/py/geopandas2.html) - [Geographic Thinking for Data Scientists](https://geographicdata.science/book/notebooks/01_geo_thinking.html) +- [PyGIS Geospatial Tutorials](https://pygis.io/docs/a_intro.html) - [Python Courses, compiled by our team](https://docs.google.com/spreadsheets/d/1Omow8F0SUiMx1jyG7GpbwnnJ5yWqlLeMH7SMtKxwG80/edit?usp=sharing) - [Why Dask?](https://docs.dask.org/en/stable/why.html) - [10 Minutes to Dask](https://docs.dask.org/en/stable/10-minutes-to-dask.html) -- [PyGIS](https://pygis.io/docs/a_intro.html) ### Books