diff --git a/site/_toc.yml b/site/_toc.yml index 2f48335..2f0f119 100644 --- a/site/_toc.yml +++ b/site/_toc.yml @@ -8,6 +8,7 @@ chapters: - file: content/installation - file: content/intro-to-python - file: content/intro-to-pandas +- file: content/web-scraping-static - file: content/python-aws - file: content/technical-support - file: content/resources diff --git a/site/content/web-scraping-static.ipynb b/site/content/web-scraping-static.ipynb new file mode 100644 index 0000000..71c0c34 --- /dev/null +++ b/site/content/web-scraping-static.ipynb @@ -0,0 +1,627 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "a0bb74c8", + "metadata": {}, + "source": [ + "# Web Scraping with Static Pages\n", + "Today we will be learning how to scrape text from a static webpage. By static, we mean that the webpage does not change its content based on user input (e.g. clicks, textboxes, etc.). We will cover the following concepts today:\n", + "- Inspecting a webpage\n", + "- What are HTML tags and why are they important?\n", + "- How to use the `requests` library to get the HTML content of a webpage\n", + "- How to use the `BeautifulSoup` library to parse the HTML content and extract just the parts we want\n", + "- Getting the final output into a workable format using `Pandas`" + ] + }, + { + "cell_type": "markdown", + "id": "cc3c1e50", + "metadata": {}, + "source": [ + "## Example\n", + "\n", + "Our example web page will be the Wikipedia page listing National Parks in the United States: https://en.wikipedia.org/wiki/List_of_national_parks_of_the_United_States. We'll use this example to showcase a few different approaches to scraping text from a static webpage.\n", + "Let's say we wanted to generate a list of National Parks and their state/territory, which would look something like this:\n", + "\n", + "| Park | State/Territory |\n", + "|-----|-----------------|\n", + "| Acadia | Maine |\n", + "| American Samoa | American Samoa |\n", + "| Arches | Utah |\n", + "| ... | ... |" + ] + }, + { + "cell_type": "markdown", + "id": "7a508082", + "metadata": {}, + "source": [ + "## Inspecting a Web Page to learn more about it\n", + "By right clicking on a web page and selecting \"Inspect\" or \"Inspect Element\" you can see the HTML and CSS that makes up the page. \n", + "You can also right click on the specific text or data elements that you want to extract and select \"Inspect\" to see the HTML and CSS that makes up that specific element.\n", + "Let's start by right clicking on \"Acadia\" and clicking \"Inspect\". We will see several HTML tags (indicated with <> symbols), including the one corresponding to where the name appears:\n", + "\n", + "```html\n", + "
`: Paragraph\n",
+ "- ``: Link\n",
+ "- ``, `
`, `
`, `
`, ` `, ` `: Table, table row, table header, table cell\n",
+ "- ``: Image\n",
+ "\n",
+ "\n",
+ "So turning back to our example above, we see that national park names are enclosed in a ` ` tag and the corresponding names seem to be enclosed in ` `. We can also see both of these tags are nested within a ` ` tag (table row), which is itself nested within a ` ` tag. This is a common structure for tables in HTML and helpful to know for when we start extracting data."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bfff03dd",
+ "metadata": {},
+ "source": [
+ "## Making a request to a webpage\n",
+ "To get the HTML content from this page, we can use the `requests` library in Python. The `requests.get()` function will return a `Response` object, which contains the HTML content of the webpage. \n",
+ "\n",
+ "When we print the response object, the number will tell us if the request was successful. See [this link](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) more detailed information on the possible numbers, but generally any response in the 200s means the request was successful."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "bd9383aa",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "
`, we can use the `find_all()` function to extract all instances of these tags. Let's start here to see what happens:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "db876bb3",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[ Name\n",
+ " , Image\n",
+ " , Location\n",
+ " , Date established as park[12]\n",
+ " , Area (2023)[8]\n",
+ " , Recreation visitors (2022)[11]\n",
+ " , Description\n",
+ " , Acadia\n",
+ " , American Samoa\n",
+ " , Arches\n",
+ " , Badlands\n",
+ " , Biscayne\n",
+ " , Black Canyon of the Gunnison\n",
+ " , Bryce Canyon\n",
+ " , Canyonlands\n",
+ " , Capitol Reef\n",
+ " , Crater Lake\n",
+ " , Cuyahoga Valley\n",
+ " , Gates of the Arctic\n",
+ " , Gateway Arch\n",
+ " , Great Basin\n",
+ " , Great Sand Dunes\n",
+ " , Guadalupe Mountains\n",
+ " , Hot Springs\n",
+ " , Indiana Dunes\n",
+ " , Katmai\n",
+ " , Kenai Fjords\n",
+ " , Kobuk Valley\n",
+ " , Lake Clark\n",
+ " , Lassen Volcanic\n",
+ " , Mount Rainier\n",
+ " , New River Gorge\n",
+ " , North Cascades\n",
+ " , Petrified Forest\n",
+ " , Pinnacles\n",
+ " , Saguaro\n",
+ " , Shenandoah\n",
+ " , Theodore Roosevelt\n",
+ " , Virgin Islands\n",
+ " , Voyageurs\n",
+ " , White Sands\n",
+ " , Wind Cave\n",
+ " , Zion\n",
+ " , State , Total parks , Exclusive parks , Shared parks\n",
+ " , , ]\n"
+ ]
+ }
+ ],
+ "source": [
+ "from bs4 import BeautifulSoup\n",
+ "soup = BeautifulSoup(response.content, 'lxml')\n",
+ "table_headers = soup.find_all('th')\n",
+ "print(table_headers)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b347eeb0",
+ "metadata": {},
+ "source": [
+ "We can see that the output definitely seems to contain the national park names! But also probably some other stuff we don't want. One useful trick is to try to narrow your search to *just* the table or object that contains the info you want; in this case, note that the HTML of the table we want actually contains a caption. We can search for that caption and then use the `find_parent` function to find the table that contains it. Then, within just that table, we can once again search for the table headers. Let's try this now:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "efa5e01a",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[ Name\n",
+ " , Image\n",
+ " , Location\n",
+ " , Date established as park[12]\n",
+ " , Area (2023)[8]\n",
+ " , Recreation visitors (2022)[11]\n",
+ " , Description\n",
+ " , Acadia\n",
+ " , American Samoa\n",
+ " , Arches\n",
+ " , Badlands\n",
+ " , Biscayne\n",
+ " , Black Canyon of the Gunnison\n",
+ " , Bryce Canyon\n",
+ " , Canyonlands\n",
+ " , Capitol Reef\n",
+ " , Crater Lake\n",
+ " , Cuyahoga Valley\n",
+ " , Gates of the Arctic\n",
+ " , Gateway Arch\n",
+ " , Great Basin\n",
+ " , Great Sand Dunes\n",
+ " , Guadalupe Mountains\n",
+ " , Hot Springs\n",
+ " , Indiana Dunes\n",
+ " , Katmai\n",
+ " , Kenai Fjords\n",
+ " , Kobuk Valley\n",
+ " , Lake Clark\n",
+ " , Lassen Volcanic\n",
+ " , Mount Rainier\n",
+ " , New River Gorge\n",
+ " , North Cascades\n",
+ " , Petrified Forest\n",
+ " , Pinnacles\n",
+ " , Saguaro\n",
+ " , Shenandoah\n",
+ " , Theodore Roosevelt\n",
+ " , Virgin Islands\n",
+ " , Voyageurs\n",
+ " , White Sands\n",
+ " , Wind Cave\n",
+ " , Zion\n",
+ " ]\n"
+ ]
+ }
+ ],
+ "source": [
+ "from bs4 import BeautifulSoup\n",
+ "soup = BeautifulSoup(response.content, 'lxml')\n",
+ "# Find the caption\n",
+ "caption = soup.find('caption', text=\"List of U.S. national parks\\n\")\n",
+ "# Find its parent table\n",
+ " = caption.find_parent('table')\n",
+ "# Find the table headers that make up that table\n",
+ "table_headers = table.find_all('th')\n",
+ "print(table_headers)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ac484917",
+ "metadata": {},
+ "source": [
+ "Okay, now we've narrowed things down nicely. But how do we actually pull the text from this word salad? We can do this using the `get_text()` function and a simple Python `for` loop:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "f9745494",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "['Name\\n', 'Image\\n', 'Location\\n', 'Date established as park[12]\\n', 'Area (2023)[8]\\n', 'Recreation visitors (2022)[11]\\n', 'Description\\n', 'Acadia\\n', 'American Samoa\\n', 'Arches\\n', 'Badlands\\n', 'Biscayne\\n', 'Black Canyon of the Gunnison\\n', 'Bryce Canyon\\n', 'Canyonlands\\n', 'Capitol Reef\\n', 'Crater Lake\\n', 'Cuyahoga Valley\\n', 'Gates of the Arctic\\n', 'Gateway Arch\\n', 'Great Basin\\n', 'Great Sand Dunes\\n', 'Guadalupe Mountains\\n', 'Hot Springs\\n', 'Indiana Dunes\\n', 'Katmai\\n', 'Kenai Fjords\\n', 'Kobuk Valley\\n', 'Lake Clark\\n', 'Lassen Volcanic\\n', 'Mount Rainier\\n', 'New River Gorge\\n', 'North Cascades\\n', 'Petrified Forest\\n', 'Pinnacles\\n', 'Saguaro\\n', 'Shenandoah\\n', 'Theodore Roosevelt\\n', 'Virgin Islands\\n', 'Voyageurs\\n', 'White Sands\\n', 'Wind Cave\\n', 'Zion\\n']\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Create an empty list that we will store text in\n",
+ "table_header_text = []\n",
+ "# Loop over all of the table headers BeautifulSoup has found for us\n",
+ "for header in table_headers:\n",
+ " # Add the text of the header to our list using get_text()\n",
+ " table_header_text.append(header.get_text())\n",
+ "\n",
+ "print(table_header_text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ba4ff78e",
+ "metadata": {},
+ "source": [
+ "Nearly done! Two more quick things to clean this up. First, we remove all of the extraneous matches at the beginning of our list. Second, we remove the newline character from the end of each string. \n",
+ "\n",
+ "(Remember, Python starts indexing at 0, and we want to exclude the first 7 items.)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "6eaf04f5",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "['Acadia', 'American Samoa', 'Arches', 'Badlands', 'Biscayne', 'Black Canyon of the Gunnison', 'Bryce Canyon', 'Canyonlands', 'Capitol Reef', 'Crater Lake', 'Cuyahoga Valley', 'Gates of the Arctic', 'Gateway Arch', 'Great Basin', 'Great Sand Dunes', 'Guadalupe Mountains', 'Hot Springs', 'Indiana Dunes', 'Katmai', 'Kenai Fjords', 'Kobuk Valley', 'Lake Clark', 'Lassen Volcanic', 'Mount Rainier', 'New River Gorge', 'North Cascades', 'Petrified Forest', 'Pinnacles', 'Saguaro', 'Shenandoah', 'Theodore Roosevelt', 'Virgin Islands', 'Voyageurs', 'White Sands', 'Wind Cave', 'Zion']\n",
+ "36 entries\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Get items in our list starting from the 8th item\n",
+ "national_park_names = table_header_text[7:]\n",
+ "# Strip whitespace from the beginning and end of each item\n",
+ "national_park_names = [park.strip() for park in national_park_names]\n",
+ "print(national_park_names)\n",
+ "print(f'{len(national_park_names)} entries')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1a5fcf9f",
+ "metadata": {},
+ "source": [
+ "Woohoo! Now let's move on to the state/territory names, which we recall live inside `` tags. Conveniently, we don't have to start from the `soup` object that contains *all* the webpage's HTML, but can start again from the `table` object we created above which contains just the table we want. Let's try this now:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "6958a7c2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "table_cells = table.find_all('td')\n",
+ "# print(table_cells)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "184fdada",
+ "metadata": {},
+ "source": [
+ "The print statement is commented out because the output returned is so long, but try it yourself to see! This is not uncommon for very HTML-rich pages.\n",
+ "\n",
+ "We can see that the returned HTML definitely contains the state/territory names, but also has a lot of other extraneous text. Again, this is a useful real-life example, because there are in practice lots of other elements that might share the type of tag with the data you want. How can we get more specific?\n",
+ "\n",
+ "Yet again, we can use the `find_parent` trick. Note that beneath each state/territory name is a set of coordinates that have a unique `` tag, unlike the other text in ` ` tags. So we can:\n",
+ "1. Search for the small tags\n",
+ "2. Find all td tags that are parents of these small tags\n",
+ "3. Extract just the location from each td tag"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "adb910ac",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "['Maine', 'American Samoa', 'Utah', 'South Dakota', 'Texas', 'Florida', 'Colorado', 'Utah', 'Utah', 'Utah', 'New Mexico', 'California', 'South Carolina', 'Oregon', 'Ohio', 'California', 'Alaska', 'Florida', 'Florida', 'Alaska', 'Missouri', 'Montana', 'Alaska', 'Arizona', 'Wyoming', 'Nevada', 'Colorado', 'North Carolina', 'Texas', 'Hawaii', 'Hawaii', 'Arkansas', 'Indiana', 'Michigan', 'California', 'Alaska', 'Alaska', 'California', 'Alaska', 'Alaska', 'California', 'Kentucky', 'Colorado', 'Washington (state)', 'West Virginia', 'Washington (state)', 'Washington (state)', 'Arizona', 'California', 'California', 'Colorado', 'Arizona', 'California', 'Virginia', 'North Dakota', 'United States Virgin Islands', 'Minnesota', 'New Mexico', 'South Dakota', 'Alaska', 'Wyoming', 'California', 'Utah']\n",
+ "63 entries\n"
+ ]
+ }
+ ],
+ "source": [
+ "#1. Get small tags\n",
+ "small_tags = table.find_all('small')\n",
+ "td_tags = []\n",
+ "#2. Loop over small tags and get their parent td tags\n",
+ "for small_tag in small_tags:\n",
+ " td_tags.append(small_tag.find_parent('td'))\n",
+ "\n",
+ "#3. Extract just the title text from the td tags\n",
+ "states_and_territories = []\n",
+ "for tag in td_tags:\n",
+ " states_and_territories.append(tag.a['title']) # The .a['title'] is needed to get just the text under the 'title' attribute; otherwise this would include the coordinates too!\n",
+ "\n",
+ "print(states_and_territories)\n",
+ "print(f'{len(states_and_territories)} entries')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1422a81c",
+ "metadata": {},
+ "source": [
+ "And we're 100% fini...OH NO! We have a problem! Why do we have 63 locations but only 37 parks? Well, it looks like some of the national parks are missing, specifically those that have symbols next to them on Wikipedia. Upon inspection, these are actually in ` ` tags, not ` ` tags. However, it appears they all contain `scope=\"row\"`. This is one more nifty feature of Beautiful Soup - the ability to feed in custom attributes that fit the quirks of our use case. Here's the syntax for how we do it:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "0c714c01",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "['Big Bend †', 'Carlsbad Caverns *', 'Channel Islands †', 'Congaree †', 'Death Valley †', 'Denali †', 'Dry Tortugas †', 'Everglades ‡', 'Glacier ‡', 'Glacier Bay ‡', 'Grand Canyon *', 'Grand Teton †', 'Great Smoky Mountains ‡', 'Haleakalā †', 'Hawaiʻi Volcanoes ‡', 'Isle Royale †', 'Joshua Tree †', 'Kings Canyon †', 'Mammoth Cave ‡', 'Mesa Verde *', 'Olympic ‡', 'Redwood *', 'Rocky Mountain †', 'Sequoia †', 'Wrangell–St.\\xa0Elias *', 'Yellowstone ‡', 'Yosemite *']\n"
+ ]
+ }
+ ],
+ "source": [
+ "national_parks_extra_tags = table.find_all('td', attrs={'scope': 'row'})\n",
+ "national_parks_extra = []\n",
+ "for park in national_parks_extra_tags:\n",
+ " national_parks_extra.append(park.get_text().strip())\n",
+ "print(national_parks_extra)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b2329e16",
+ "metadata": {},
+ "source": [
+ "Now, let's append the two lists of national parks together and re-sort them in alphabetical order. Finally, we've got both of our lists of 63 entries."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "74ccc614",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "['Acadia', 'American Samoa', 'Arches', 'Badlands', 'Big Bend †', 'Biscayne', 'Black Canyon of the Gunnison', 'Bryce Canyon', 'Canyonlands', 'Capitol Reef', 'Carlsbad Caverns *', 'Channel Islands †', 'Congaree †', 'Crater Lake', 'Cuyahoga Valley', 'Death Valley †', 'Denali †', 'Dry Tortugas †', 'Everglades ‡', 'Gates of the Arctic', 'Gateway Arch', 'Glacier Bay ‡', 'Glacier ‡', 'Grand Canyon *', 'Grand Teton †', 'Great Basin', 'Great Sand Dunes', 'Great Smoky Mountains ‡', 'Guadalupe Mountains', 'Haleakalā †', 'Hawaiʻi Volcanoes ‡', 'Hot Springs', 'Indiana Dunes', 'Isle Royale †', 'Joshua Tree †', 'Katmai', 'Kenai Fjords', 'Kings Canyon †', 'Kobuk Valley', 'Lake Clark', 'Lassen Volcanic', 'Mammoth Cave ‡', 'Mesa Verde *', 'Mount Rainier', 'New River Gorge', 'North Cascades', 'Olympic ‡', 'Petrified Forest', 'Pinnacles', 'Redwood *', 'Rocky Mountain †', 'Saguaro', 'Sequoia †', 'Shenandoah', 'Theodore Roosevelt', 'Virgin Islands', 'Voyageurs', 'White Sands', 'Wind Cave', 'Wrangell–St.\\xa0Elias *', 'Yellowstone ‡', 'Yosemite *', 'Zion']\n",
+ "63 entries\n"
+ ]
+ }
+ ],
+ "source": [
+ "national_parks_combined = sorted(national_park_names + national_parks_extra)\n",
+ "print(national_parks_combined)\n",
+ "print(f'{len(national_parks_combined)} entries')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bc9d3d28",
+ "metadata": {},
+ "source": [
+ "## Manipulating the text into a useful format\n",
+ "\n",
+ "As a final step, we will convert these two lists we've scraped into a pandas dataframe (for R users, analogous to a tibble). This example will be light on the pandas-specific code, but this is often an important part of web scraping. BeautifulSoup is a powerful tool, but it sometimes outputs data in a format that is not immediately useful, even with some of the tricks we applied above. Pandas can help us clean and manipulate this data into a more useful format.\n",
+ "\n",
+ "Note: Pandas even has a built-in function to read HTML tables directly from a webpage, which can be a nice starting point for certain examples (like this one believe it or not). BeautifulSoup is your workhorse for static webpage scraping, but this is worth knowing about if you're a pandas user. You'd be surprised how far the two lines below will get you:\n",
+ "```python\n",
+ "df = pd.read_html('https://en.wikipedia.org/wiki/List_of_national_parks_of_the_United_States')\n",
+ "df[0][['Name', 'Location']]\n",
+ "```\n",
+ "Web scraping with pandas is outside the scope of our lesson here, but worth exploring. Back to our example:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "0ed87934",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " National Park State\n",
+ "0 Acadia Maine\n",
+ "1 American Samoa American Samoa\n",
+ "2 Arches Utah\n",
+ "3 Badlands South Dakota\n",
+ "4 Big Bend † Texas\n",
+ ".. ... ...\n",
+ "58 Wind Cave South Dakota\n",
+ "59 Wrangell–St. Elias * Alaska\n",
+ "60 Yellowstone ‡ Wyoming\n",
+ "61 Yosemite * California\n",
+ "62 Zion Utah\n",
+ "\n",
+ "[63 rows x 2 columns]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Convert this into a pandas dataframe\n",
+ "import pandas as pd\n",
+ "df = pd.DataFrame({\n",
+ " 'National Park': national_parks_combined,\n",
+ " 'State': states_and_territories\n",
+ "})\n",
+ "print(df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a53053fb",
+ "metadata": {},
+ "source": [
+ "## BONUS: \n",
+ "Some of our more adventurous participants may have caught early on that national parks can be in multiple states and territories. \n",
+ "\n",
+ "For instance, the Great Smoky Mountains are in both North Carolina and Tennessee. Let's tweak the above code to handle that, starting at #3."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "8fd731d9",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "National Park Great Smoky Mountains ‡\n",
+ "State North Carolina\n",
+ "Name: 27, dtype: object"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df.iloc[27]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "29034453",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'Acadia': ['Maine'], 'American Samoa': ['American Samoa'], 'Arches': ['Utah'], 'Badlands': ['South Dakota'], 'Big Bend †': ['Texas'], 'Biscayne': ['Florida'], 'Black Canyon of the Gunnison': ['Colorado'], 'Bryce Canyon': ['Utah'], 'Canyonlands': ['Utah'], 'Capitol Reef': ['Utah'], 'Carlsbad Caverns *': ['New Mexico'], 'Channel Islands †': ['California'], 'Congaree †': ['South Carolina'], 'Crater Lake': ['Oregon'], 'Cuyahoga Valley': ['Ohio'], 'Death Valley †': ['California', 'Nevada'], 'Denali †': ['Alaska'], 'Dry Tortugas †': ['Florida'], 'Everglades ‡': ['Florida'], 'Gates of the Arctic': ['Alaska'], 'Gateway Arch': ['Missouri'], 'Glacier Bay ‡': ['Montana'], 'Glacier ‡': ['Alaska'], 'Grand Canyon *': ['Arizona'], 'Grand Teton †': ['Wyoming'], 'Great Basin': ['Nevada'], 'Great Sand Dunes': ['Colorado'], 'Great Smoky Mountains ‡': ['North Carolina', 'Tennessee'], 'Guadalupe Mountains': ['Texas'], 'Haleakalā †': ['Hawaii'], 'Hawaiʻi Volcanoes ‡': ['Hawaii'], 'Hot Springs': ['Arkansas'], 'Indiana Dunes': ['Indiana'], 'Isle Royale †': ['Michigan'], 'Joshua Tree †': ['California'], 'Katmai': ['Alaska'], 'Kenai Fjords': ['Alaska'], 'Kings Canyon †': ['California'], 'Kobuk Valley': ['Alaska'], 'Lake Clark': ['Alaska'], 'Lassen Volcanic': ['California'], 'Mammoth Cave ‡': ['Kentucky'], 'Mesa Verde *': ['Colorado'], 'Mount Rainier': ['Washington'], 'New River Gorge': ['West Virginia'], 'North Cascades': ['Washington'], 'Olympic ‡': ['Washington'], 'Petrified Forest': ['Arizona'], 'Pinnacles': ['California'], 'Redwood *': ['California'], 'Rocky Mountain †': ['Colorado'], 'Saguaro': ['Arizona'], 'Sequoia †': ['California'], 'Shenandoah': ['Virginia'], 'Theodore Roosevelt': ['North Dakota'], 'Virgin Islands': ['U.S. Virgin Islands'], 'Voyageurs': ['Minnesota'], 'White Sands': ['New Mexico'], 'Wind Cave': ['South Dakota'], 'Wrangell–St.\\xa0Elias *': ['Alaska'], 'Yellowstone ‡': ['Wyoming', 'Montana', 'Idaho'], 'Yosemite *': ['California'], 'Zion': ['Utah']}\n",
+ " 0 1 2\n",
+ "Acadia Maine None None\n",
+ "American Samoa American Samoa None None\n",
+ "Arches Utah None None\n",
+ "Badlands South Dakota None None\n",
+ "Big Bend † Texas None None\n",
+ "... ... ... ...\n",
+ "Wind Cave South Dakota None None\n",
+ "Wrangell–St. Elias * Alaska None None\n",
+ "Yellowstone ‡ Wyoming Montana Idaho\n",
+ "Yosemite * California None None\n",
+ "Zion Utah None None\n",
+ "\n",
+ "[63 rows x 3 columns]\n"
+ ]
+ }
+ ],
+ "source": [
+ "states_and_territories = {}\n",
+ "# Enumerate allows us to loop over a list and get the index (in this case, \"park_number\") of the item as well\n",
+ "for park_number, tag in enumerate(td_tags):\n",
+ " # Before, we were just getting the first 'a' tag, now let's get all of them for a given table cell\n",
+ " a_tags = tag.find_all('a')\n",
+ " # Get the titles of each location in a list, except for the last one which is extraneous info\n",
+ " locations = [a.text for a in a_tags][:-1]\n",
+ " # Create a dictionary where the key is the national park name and the value is the list of locations\n",
+ " dict_key = national_parks_combined[park_number]\n",
+ " states_and_territories[dict_key] = locations\n",
+ "\n",
+ "print(states_and_territories)\n",
+ "print(pd.DataFrame.from_dict(states_and_territories, \n",
+ " orient='index'))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "82b18f93",
+ "metadata": {},
+ "source": [
+ "### Looking Ahead\n",
+ "Next week we'll introduce Selenium for scraping dynamic content. We'll be scraping this website, so a quick perusal to familiarize yourself could be helpful: https://www.kff.org/interactive/subsidy-calculator/"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.7"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}