diff --git a/site/_toc.yml b/site/_toc.yml index 2f48335..2f0f119 100644 --- a/site/_toc.yml +++ b/site/_toc.yml @@ -8,6 +8,7 @@ chapters: - file: content/installation - file: content/intro-to-python - file: content/intro-to-pandas +- file: content/web-scraping-static - file: content/python-aws - file: content/technical-support - file: content/resources diff --git a/site/content/web-scraping-static.ipynb b/site/content/web-scraping-static.ipynb new file mode 100644 index 0000000..71c0c34 --- /dev/null +++ b/site/content/web-scraping-static.ipynb @@ -0,0 +1,627 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "a0bb74c8", + "metadata": {}, + "source": [ + "# Web Scraping with Static Pages\n", + "Today we will be learning how to scrape text from a static webpage. By static, we mean that the webpage does not change its content based on user input (e.g. clicks, textboxes, etc.). We will cover the following concepts today:\n", + "- Inspecting a webpage\n", + "- What are HTML tags and why are they important?\n", + "- How to use the `requests` library to get the HTML content of a webpage\n", + "- How to use the `BeautifulSoup` library to parse the HTML content and extract just the parts we want\n", + "- Getting the final output into a workable format using `Pandas`" + ] + }, + { + "cell_type": "markdown", + "id": "cc3c1e50", + "metadata": {}, + "source": [ + "## Example\n", + "\n", + "Our example web page will be the Wikipedia page listing National Parks in the United States: https://en.wikipedia.org/wiki/List_of_national_parks_of_the_United_States. We'll use this example to showcase a few different approaches to scraping text from a static webpage.\n", + "Let's say we wanted to generate a list of National Parks and their state/territory, which would look something like this:\n", + "\n", + "| Park | State/Territory |\n", + "|-----|-----------------|\n", + "| Acadia | Maine |\n", + "| American Samoa | American Samoa |\n", + "| Arches | Utah |\n", + "| ... | ... |" + ] + }, + { + "cell_type": "markdown", + "id": "7a508082", + "metadata": {}, + "source": [ + "## Inspecting a Web Page to learn more about it\n", + "By right clicking on a web page and selecting \"Inspect\" or \"Inspect Element\" you can see the HTML and CSS that makes up the page. \n", + "You can also right click on the specific text or data elements that you want to extract and select \"Inspect\" to see the HTML and CSS that makes up that specific element.\n", + "Let's start by right clicking on \"Acadia\" and clicking \"Inspect\". We will see several HTML tags (indicated with <> symbols), including the one corresponding to where the name appears:\n", + "\n", + "```html\n", + " == $0\n", + " Acadia\n", + "\n", + "```\n", + "\n", + "Similarly, for its state, we have:\n", + "\n", + "```html\n", + "\n", + " Maine\n", + "
\n", + "[...]\n", + "```\n", + "So what does any of this mean?" + ] + }, + { + "cell_type": "markdown", + "id": "9ce013c6", + "metadata": {}, + "source": [ + "## HTML Tags\n", + "Websites are built using HTML tags. Tags are used to create the structure of a website and indicate different headings, paragraphs, lists, links, images, and more. \n", + "\n", + "Tags are signified with an opening tag, like `

`, and a closing tag, like `

`. The closing tag is the same as the opening tag, but with a forward slash `/` before the tag name. The text between the opening and closing tags is the content of the tag. A few common HTML tags are listed below:\n", + "- `

`, `

`, `

`, `

`, `

`, `
`: Headings in decreasing order of size\n", + "- `

`: Paragraph\n", + "- ``: Link\n", + "- `