Merge pull request #28 from UI-Research/web-scraping-workshop

Web scraping workshop
UI-Research · Apr 26, 2024 · 8bf1a71 · 8bf1a71
2 parents 69e70e7 + 31adc62
commit 8bf1a71
Show file tree

Hide file tree

Showing 6 changed files with 232 additions and 48 deletions.
diff --git a/misc-resources/web-scraping/workshop_2024/session1_intro-to-web-scraping.html b/misc-resources/web-scraping/workshop_2024/session1_intro-to-web-scraping.html
@@ -438,6 +438,14 @@ <h2>What are some drawbacks of web scraping?</h2>
 <li>Web scraping code can be brittle as websites change over time</li>
 </ul>
 </section>
+<section id="why-is-this-web-scraping-bootcamp-being-taught-in-python" class="slide level2">
+<h2>Why is this web scraping bootcamp being taught in Python?</h2>
+<ul>
+<li>Python ecosystem more mature, flexible, and better-suited for dynamic web pages</li>
+<li>Functionality in R is growing and evolving (e.g.&nbsp;the <code>rvest</code> package)</li>
+<li>We may consider R tools for future versions of this workshop</li>
+</ul>
+</section>
 <section id="what-questions-should-i-be-asking-at-the-outset" class="slide level2">
 <h2>What questions should I be asking at the outset?</h2>
 <ul>

diff --git a/misc-resources/web-scraping/workshop_2024/session1_intro-to-web-scraping.qmd b/misc-resources/web-scraping/workshop_2024/session1_intro-to-web-scraping.qmd
@@ -34,6 +34,11 @@ format:
 - Depending on the task and site layout, complexity can vary widely
 - Web scraping code can be brittle as websites change over time 
 
+## Why is this web scraping bootcamp being taught in Python?
+- Python ecosystem more mature, flexible, and better-suited for dynamic web pages
+- Functionality in R is growing and evolving (e.g. the `rvest` package)
+- We may consider R tools for future versions of this workshop
+
 ## What questions should I be asking at the outset?
 - Can I get the data without web scraping? (e.g. Is there an API or download option? Can you contact the site owner to request access?)
 - Am I legally allowed to scrape the website? Are there any site/rate limits or responsible web scraping considerations?

diff --git a/misc-resources/web-scraping/workshop_2024/session4.ipynb b/misc-resources/web-scraping/workshop_2024/session4.ipynb
@@ -165,7 +165,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 3,
+      "execution_count": 1,
       "id": "b58719e7",
       "metadata": {},
       "outputs": [],
@@ -186,7 +186,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 4,
+      "execution_count": null,
       "id": "f09491bc",
       "metadata": {},
       "outputs": [],
@@ -726,7 +726,7 @@
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
-      "version": "3.9.7"
+      "version": "3.10.8"
     }
   },
   "nbformat": 4,

diff --git a/misc-resources/web-scraping/workshop_2024/session5.ipynb b/misc-resources/web-scraping/workshop_2024/session5.ipynb
@@ -18,30 +18,17 @@
     "\n",
     "Okay, but what happens when its not a nice clean training example? In today's lesson, we'll go over some of the ways that web scraping can get messy and work around solutions. \n",
     "\n",
-    "- Thinking through how any one pass through your loop might be different than others\n",
-    "    - Does the webpage layout look different for certain options?\n",
-    "    - Show example for Philadelphia County UJS - there are no evictions available for just that county: https://ujsportal.pacourts.us/\n",
-    "- Error handling to deal with slow websites or edge cases\n",
-    "    - try/except logic\n",
-    "    - time.sleep()\n",
-    "- Picking up where you leave off by adding function arguments\n",
-    "- Workshop\n",
-    "    - Build upon session 4 example by adding error handling and pickup-where-you-left-off functionality"
+    "**TO ADD**\n",
+    "    - time.sleep()"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 14,
    "metadata": {},
    "outputs": [],
    "source": [
-    "from utils import *\n",
-    "# Setup\n",
-    "# Launch driver\n",
-    "url = \"https://www.kff.org/interactive/subsidy-calculator/\"\n",
-    "service = Service(executable_path=ChromeDriverManager().install())\n",
-    "driver = webdriver.Chrome(service=service)\n",
-    "driver.get(url)"
+    "from utils import *"
    ]
   },
   {
@@ -161,51 +148,233 @@
     "\n",
     "How can we create safety nets within the code so that when something stops running (say you're computer went to sleep) the code automatically keeps running from where we left of? \n",
     "\n",
-    "First, we need to think through how to get back to the state/county/zip code that we were on. To do this, it'll be helpful to have a counter running along with our code to tell us what number we're on. "
+    "A few important things that we need to handle are: \n",
+    "\n",
+    "1) keeping track of how many iterations we've already done to know where to start if the code gets interrupted\n",
+    "2) skipping to the correct spot in the list that we're iterating over when we start again (in this case the right county)\n",
+    "3) continuing to add values to the dictionary on top of what we've already scraped\n",
+    "\n",
+    "A useful tool for this is to include a counter variable in the `for` loops we wrote in session 4 to tell us which number loop we're on. We want it increase by one every time we move to a new county. \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "Now that we added a counter variable, we need to think through how to get back to the state/county/zip code that we were on. To do this, we'll write a function that skips ahead `n` rows in our state/county/zip code JSON file to start where we left off if the counter is > 0, otherwise it'll just read in the whole file. \n",
+    "\n",
+    "**NOTE: NOT SURE HOW TO TURN THIS INTO A TASK**"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 15,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "ename": "NameError",
+     "evalue": "name 'json' is not defined",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
+      "Cell \u001b[0;32mIn [15], line 22\u001b[0m\n\u001b[1;32m     18\u001b[0m             state_counties_zipcodes \u001b[38;5;241m=\u001b[39m json\u001b[38;5;241m.\u001b[39mload(file)\n\u001b[1;32m     20\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m(state_counties_zipcodes)\n\u001b[0;32m---> 22\u001b[0m test \u001b[38;5;241m=\u001b[39m \u001b[43mskip_counties\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m5\u001b[39;49m\u001b[43m)\u001b[49m\n",
+      "Cell \u001b[0;32mIn [15], line 13\u001b[0m, in \u001b[0;36mskip_counties\u001b[0;34m(counter)\u001b[0m\n\u001b[1;32m     11\u001b[0m             \u001b[38;5;28mnext\u001b[39m(file)  \u001b[38;5;66;03m# Skip n rows\u001b[39;00m\n\u001b[1;32m     12\u001b[0m         \u001b[38;5;66;03m# Read and parse JSON from the current position\u001b[39;00m\n\u001b[0;32m---> 13\u001b[0m         state_counties_zipcodes \u001b[38;5;241m=\u001b[39m \u001b[43mjson\u001b[49m\u001b[38;5;241m.\u001b[39mload(file)\n\u001b[1;32m     14\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m: \n\u001b[1;32m     15\u001b[0m    \u001b[38;5;66;03m# open JSON with state, county, and zip data\u001b[39;00m\n\u001b[1;32m     16\u001b[0m     \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28mopen\u001b[39m(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mdata/zip_data_small.json\u001b[39m\u001b[38;5;124m'\u001b[39m) \u001b[38;5;28;01mas\u001b[39;00m file:\n\u001b[1;32m     17\u001b[0m         \u001b[38;5;66;03m# Read and parse JSON from the current position\u001b[39;00m\n",
+      "\u001b[0;31mNameError\u001b[0m: name 'json' is not defined"
+     ]
+    }
+   ],
    "source": [
     "# Helper function to access the nth county in the state_counties_zipcodes\n",
     "# dictionary if the counter is not 0\n",
-    "def skip_counties(n):\n",
+    "def skip_counties(counter):\n",
     "    ###--- Get list of all state and zip codes ---###\n",
     "\n",
-    "    # Assuming you have a CSV file with columns 'State' and 'ZIP Code'\n",
-    "    csv_file_path = 'data/tate.csv'\n",
-    "\n",
-    "    # read into a dataframe\n",
-    "    if n != 0: \n",
-    "        raw_csv = pd.read_csv(csv_file_path, skiprows=lambda x: x > 0 and x <= n, dtype={'zipcode': str})\n",
-    "        # filter out new york because the page is different\n",
-    "        raw_csv = raw_csv[~raw_csv['state_abbr'].isin(['ny', 'vt'])]\n",
+    "    # skip to relevant county\n",
+    "    if counter != 0: \n",
+    "        # open JSON with state, county, and zip data\n",
+    "        with open('data/zip_data_small.json') as file:\n",
+    "            for n in range(counter):\n",
+    "                next(file)  # Skip n rows\n",
+    "            # Read and parse JSON from the current position\n",
+    "            state_counties_zipcodes = json.load(file)\n",
     "    else: \n",
-    "       raw_csv = pd.read_csv(csv_file_path, dtype={'zipcode': str})\n",
-    "       # filter out new york because the page is different \n",
-    "       raw_csv = raw_csv[~raw_csv['state_abbr'].isin(['ny', 'vt'])]\n",
-    "    # Create a nested dictionary\n",
-    "    state_counties_zipcodes = {}\n",
+    "       # open JSON with state, county, and zip data\n",
+    "        with open('data/zip_data_small.json') as file:\n",
+    "            # Read and parse JSON from the current position\n",
+    "            state_counties_zipcodes = json.load(file)\n",
+    "            \n",
+    "    return(state_counties_zipcodes)\n",
     "\n",
-    "    for index, row in raw_csv.iterrows():\n",
-    "        state = row['state_abbr']\n",
-    "        county = row['county']\n",
-    "        zipcode = row['zipcode']\n",
+    "test = skip_counties(5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "Now we're going to take the nested loop code that we wrote in session 4 and turn it into a function. The value add of turning it into a function is that we can have a counter value as the input. This way, if the \n",
     "\n",
-    "        if state not in state_counties_zipcodes:\n",
-    "            state_counties_zipcodes[state] = {}\n",
+    "### TASK 2\n",
     "\n",
-    "        state_counties_zipcodes[state][county] = zipcode\n",
-    "    return(state_counties_zipcodes)\n"
+    "1) Turn the code below into a function called `run_entire_loop` that takes a counter value as the input and returns a counter and the output file name\n",
+    "\n",
+    "2) Within the function, if the counter isn't at 0 (meaning that we're not at the beginning of the loop), we'll want to read in the `output.json` file and assign it to `premium_val_dict`. This lets us keep adding to the list of values that we've already scraped. Write code (or pseudo code) where you think this belongs\n",
+    "\n",
+    "3) Call the `skip_counties` function that we defined above to skip to the correct row from where we left off\n",
+    "\n",
+    "4) Increase the value of counter when we loop through each county and print the value that we're on to the console"
    ]
   },
   {
-   "cell_type": "raw",
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "    \n",
+    "    url = \"https://www.kff.org/interactive/subsidy-calculator/\"\n",
+    "    service = Service(executable_path=ChromeDriverManager().install())\n",
+    "   \n",
+    "    driver = webdriver.Chrome(service=service)\n",
+    "    driver.get(url)\n",
+    "    \n",
+    "    # NOTE: STILL NEED TO SAVE VALUES TO DICT\n",
+    "    premium_val_dict = {} # initialize empty dictionary to capture the scraped values\n",
+    "    # Set THRESHOLD at number of values you already have scraped + 1\n",
+    "    age_values = [14, 17, 20, 19, 39] # indexes for 14, 20, 40, and 60 years\n",
+    "\n",
+    "    \n",
+    "\n",
+    " \n",
+    "    for state, counties in state_counties_zipcodes.items():\n",
+    "    # set the state as the top key in the dictionary\n",
+    "        state_dict = premium_val_dict.setdefault(state, {})\n",
+    "    # loop through county, zip pairs\n",
+    "        for county, zip_code in counties.items():\n",
+    "        # initialize empty list to store premium plan values \n",
+    "            premium_val_list = []\n",
+    "\n",
+    "            \n",
+    "            # set up top half of page\n",
+    "            setup_page(state=state, driver = driver, county = county, zipcode=zip_code)\n",
+    "            \n",
+    "            for age in age_values:\n",
+    "                \n",
+    "                # scrape plan value\n",
+    "                number = scrape_data(age = age, driver = driver)\n",
+    "\n",
+    "                # for each zipcode, create a list of all of the premium plan costs for each age\n",
+    "                # this will be saved with the zipcode key in the dictionary\n",
+    "                premium_val_list.append(number)\n",
+    "               \n",
+    "            # at the end of looping through all ages in the zip code add premium values to dictionary\n",
+    "            state_dict[county] = premium_val_list\n",
+    "\n",
+    "            # Save the dictionary as a JSON file at the end of each loop\n",
+    "            output_filename = f'output.json'\n",
+    "            with open(output_filename, 'w') as json_file:\n",
+    "                json.dump(premium_val_dict, json_file, indent=2)  # 'indent' for pretty formatting (optional)\n",
+    "\n",
+    "\n",
+    " "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def run_entire_loop(counter):\n",
+    "    url = \"https://www.kff.org/interactive/subsidy-calculator/\"\n",
+    "    service = Service(executable_path=ChromeDriverManager().install())\n",
+    "   \n",
+    "    driver = webdriver.Chrome(service=service)\n",
+    "    driver.get(url)\n",
+    "    \n",
+    "    # NOTE: STILL NEED TO SAVE VALUES TO DICT\n",
+    "    premium_val_dict = {} # initialize empty dictionary to capture the scraped values\n",
+    "    # Set THRESHOLD at number of values you already have scraped + 1\n",
+    "    age_values = [14, 17, 20, 19, 39] # indexes for 14, 20, 40, and 60 years\n",
+    "\n",
+    "    # Read the JSON file\n",
+    "    if counter != 0:\n",
+    "        with open(f'output.json', 'r') as file:\n",
+    "            premium_val_dict = json.load(file)\n",
+    "\n",
+    "    # if counter is not 0, skip to correct spot in \n",
+    "    # state_counties_zipcodes dictionary\n",
+    "    state_counties_zipcodes = skip_counties(counter)\n",
+    " \n",
+    "    for state, counties in state_counties_zipcodes.items():\n",
+    "    # set the state as the top key in the dictionary\n",
+    "        state_dict = premium_val_dict.setdefault(state, {})\n",
+    "    # loop through county, zip pairs\n",
+    "        for county, zip_code in counties.items():\n",
+    "        # initialize empty list to store premium plan values \n",
+    "            premium_val_list = []\n",
+    "\n",
+    "            counter += 1\n",
+    "\n",
+    "            print(counter)\n",
+    "            \n",
+    "            # set up top half of page\n",
+    "            setup_page(state=state, driver = driver, county = county, zipcode=zip_code)\n",
+    "            \n",
+    "            for age in age_values:\n",
+    "                \n",
+    "                # scrape plan value\n",
+    "                number = scrape_data(age = age, driver = driver)\n",
+    "\n",
+    "                # for each zipcode, create a list of all of the premium plan costs for each age\n",
+    "                # this will be saved with the zipcode key in the dictionary\n",
+    "                premium_val_list.append(number)\n",
+    "               \n",
+    "            # at the end of looping through all ages in the zip code add premium values to dictionary\n",
+    "            state_dict[county] = premium_val_list\n",
+    "\n",
+    "            # Save the dictionary as a JSON file at the end of each loop\n",
+    "            output_filename = f'output.json'\n",
+    "            with open(output_filename, 'w') as json_file:\n",
+    "                json.dump(premium_val_dict, json_file, indent=2)  # 'indent' for pretty formatting (optional)\n",
+    "\n",
+    "\n",
+    "    return(counter, output_filename)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "## Putting it all together "
+   ]
+  },
+  {
+   "cell_type": "markdown",
    "metadata": {},
    "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if __name__ == '__main__':\n",
+    "    \n",
+    "    TOTAL_FILES = 10\n",
+    "    counter = 0\n",
+    "    while counter < TOTAL_FILES:\n",
+    "        try: \n",
+    "            counter, file = run_entire_loop(counter)\n",
+    "        except: \n",
+    "              time.sleep(60)"
+   ]
   }
  ],
  "metadata": {
@@ -224,7 +393,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.7"
+   "version": "3.10.8"
   }
  },
  "nbformat": 4,

diff --git a/misc-resources/web-scraping/workshop_2024/utils.py b/misc-resources/web-scraping/workshop_2024/utils.py
@@ -1,5 +1,6 @@
 from bs4 import BeautifulSoup
 import os
+import json
 from selenium import webdriver
 from selenium.webdriver.common.by import By
 from selenium.webdriver.chrome.service import Service 

diff --git a/requirements.txt b/requirements.txt
@@ -2,4 +2,5 @@ jupyter-book==0.13.0
 pandas
 numpy
 boto3
-openpyxl
+openpyxl
+lxml_html_clean