Skip to content

Commit

Permalink
deploy: 7ebd233
Browse files Browse the repository at this point in the history
  • Loading branch information
judah-axelrod committed Apr 29, 2024
1 parent b7263ff commit 5bb3567
Show file tree
Hide file tree
Showing 17 changed files with 888 additions and 19 deletions.
105 changes: 105 additions & 0 deletions _sources/content/web-scraping-intro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Intro to Web Scraping

## What is web scraping?

- Web scraping is the use of programming to extract structured text or data from a website
- It is generally used to automate tasks that would take too long (or be too error-prone) to feasibly do manually
- There are two main categories of web scraping tasks: (1) Collecting text data from one or more web pages and (2) automating the download of a number of files from a website

## How does Urban use web scraping?

- Collecting thousands of community college course descriptions from the [FLDOE website](https://www.fldoe.org/)
- Downloading hundreds of CSV files from the [Centers for Medicare & Medicaid Services website](https://data.cms.gov/tools/mapping-medicare-disparities-by-population) that all required clicking different dropdowns from a menu of options
- Collecting the contact info for all notaries in Mississippi by clicking through thousands of pages on the [Secretary of State website](https://www.sos.ms.gov/notarysearch/notarysearch.aspx)
- Pulling voting history information from the [North Carolina State Board of Election website](https://vt.ncsbe.gov/RegLkup/) by searching for thousands of registered voters

## What are some drawbacks of web scraping?
- Not all sites can be legally or responsibly scraped
- Repeated requests to a website can lead to rate limiting (i.e. capping the number of requests over a certain period of time)
- Depending on the task and site layout, complexity can vary widely
- Web scraping code can be brittle as websites change over time

## Why is this web scraping bootcamp being taught in Python?
- Python ecosystem more mature, flexible, and better-suited for dynamic web pages
- Functionality in R is growing and evolving (e.g. the `rvest` package)
- We may consider R tools for future versions of this workshop

## What questions should I be asking at the outset?
- Can I get the data without web scraping? (e.g. Is there an API or download option? Can you contact the site owner to request access?)
- Am I legally allowed to scrape the website? Are there any site/rate limits or responsible web scraping considerations?
- How many datasets or pieces of text need to be scraped?
- Is webpage layout consistent or unstandardized?
- Are there Captchas, pop-ups, or ads blocking the content you want?
- Does the webpage have slow or inconsistent load times?
- What tools/packages are needed for the job? (We will learn this throughout the workshop!)

## What are the variables that affect how difficult a web scraping task is?
1. How many different websites or pages are involved in the web scraping process?
2. Does the website have dynamic content or only static content?
3. Is it straightforward to extract the info we want once we reach the desired webpage?

## 1. Different Webpages
- Intuitively, scraping information from one website is simpler than doing so from many websites
- If the layouts of the sites are different, difficulty vastly increases
- **Rule of thumb**: Think of this as a unique web scraping task for each uniquely structured website
- Web crawlers such as `scrapy` exist to traverse many websites and grab all relevant information, but without easy ways to filter through that metadata, this can quickly become infeasible
- For jobs that take a long time to run (e.g. more than a few hours), gracefully logging and handling issues can add complexity

## 2. Static vs Dynamic Content
- For a static page like a [Wikipedia article](https://en.wikipedia.org/wiki/Urban_Institute), packages like `BeautifulSoup` or `pandas` can grab HTML text without too much complexity by parsing HTML tags
- For pages with dynamic content like clickable buttons or dropdown menus, the `Selenium` package is needed and the complexity goes up
- **Rule of thumb**: Would a human user need to take any actions (besides scrolling up or down) to navigate to the desired info, or is it immediately available on the webpage?

## 3. Identifying Desired Information
- Possible future task: scraping area median income from [HUD website](https://www.huduser.gov/portal/datasets/il.html)
- Upside: Only one webpage, can use `Selenium` to navigate dropdowns
- Downside: Numbers we want to grab can be in different places within each webpage


## Responsible Web Scraping Guidelines
1. Check the robots.txt file - let's look at an example: [https://www.urban.org/robots.txt](https://www.urban.org/robots.txt)
2. Consult Urban's [Automated Data Collection Guidelines](https://urbanorg.account.box.com/login?redirect_url=https%3A%2F%2Furbanorg.app.box.com%2Fs%2Fmam9kpf48mu92f4ktpyuw218yf8j45w0).
3. Use Headers (we'll see this in action next week)

`headers = {'user-agent': 'Urban Institute Research Data Collector ([your_e-mail]@urban.org)'}`

4. Use Site Monitor to ensure web scraping does not strain the website

## Site Monitor
- A tool created by Urban to ensure responsible web scraping practices
- The actual code for Site Monitor lives [here](https://github.com/UrbanInstitute/SiteMonitor/blob/master/site_monitor.py) in this GitHub repository
- Example code to test strain on a website
```
from site_monitor import *
import requests
sm = SiteMonitor(burn_in=20)
for i in range(100):
print(i)
url = "https://flscns.fldoe.org/PbInstituteCourseSearch.aspx"
response = requests.get(url)
delay = sm.track_request(response)
# Display the report of response times in graph format
sm.report('display')
```
## Example Output from Site Monitor
![](images/site_monitor_output.png){fig-align='center'}

## A note on \~AI\~
- We don't expect you to understand 100% of the code throughout this bootcamp.
- We want to emphasize the idea of concepts > syntax.
- Urban is still working through testing of its guidelines for use of AI, and while things are murky, we want to focus on building Python and web scraping intuition.
- This workshop will not use Copilot or ChatGPT, though we acknowledge the utility of those tools *if you are asking from a place of conceptual understanding*.

## Homework: Installations for Next Time
- Install Python via Anaconda - see guidance from PUG's Python Installation training [here](https://ui-research.github.io/python-at-urban/content/installation.html)
- Install the following Python packages: `requests`, `beautifulsoup4`, `lxml`, and `selenium`
- Launch a new Jupyter Notebook if you've never done so before - see guidance from PUG's Intro to Python training [here](https://ui-research.github.io/python-at-urban/content/intro-to-python.html)
- If you have any issues, please use the #python-users channel and we'd love to help. Someone else probably has the same question!
- Sign up for GitHub using [this guide](https://ui-research.github.io/reproducibility-at-urban/git-installation.html) if you haven't so that you can access these workshop materials!

## Next Session
- How to scrape text from static webpages using BeautifulSoup
- Diving into some Python code!
5 changes: 5 additions & 0 deletions content/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,11 @@ <h1 class="site-logo" id="site-title">Python Users Group</h1>
Intro to Pandas
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="web-scraping-intro.html">
Intro to Web Scraping
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="web-scraping-static.html">
Web Scraping with Static Pages
Expand Down
5 changes: 5 additions & 0 deletions content/installation.html
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,11 @@ <h1 class="site-logo" id="site-title">Python Users Group</h1>
Intro to Pandas
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="web-scraping-intro.html">
Intro to Web Scraping
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="web-scraping-static.html">
Web Scraping with Static Pages
Expand Down
11 changes: 8 additions & 3 deletions content/intro-to-pandas.html
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
<script async="async" src="../_static/sphinx-thebe.js"></script>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Web Scraping with Static Pages" href="web-scraping-static.html" />
<link rel="next" title="Intro to Web Scraping" href="web-scraping-intro.html" />
<link rel="prev" title="Intro to Python" href="intro-to-python.html" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="docsearch:language" content="None">
Expand Down Expand Up @@ -136,6 +136,11 @@ <h1 class="site-logo" id="site-title">Python Users Group</h1>
Intro to Pandas
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="web-scraping-intro.html">
Intro to Web Scraping
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="web-scraping-static.html">
Web Scraping with Static Pages
Expand Down Expand Up @@ -3848,10 +3853,10 @@ <h3>Collapsing Data<a class="headerlink" href="#collapsing-data" title="Permalin
<p class="prev-next-title">Intro to Python</p>
</div>
</a>
<a class='right-next' id="next-link" href="web-scraping-static.html" title="next page">
<a class='right-next' id="next-link" href="web-scraping-intro.html" title="next page">
<div class="prev-next-info">
<p class="prev-next-subtitle">next</p>
<p class="prev-next-title">Web Scraping with Static Pages</p>
<p class="prev-next-title">Intro to Web Scraping</p>
</div>
<i class="fas fa-angle-right"></i>
</a>
Expand Down
5 changes: 5 additions & 0 deletions content/intro-to-python.html
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,11 @@ <h1 class="site-logo" id="site-title">Python Users Group</h1>
Intro to Pandas
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="web-scraping-intro.html">
Intro to Web Scraping
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="web-scraping-static.html">
Web Scraping with Static Pages
Expand Down
5 changes: 5 additions & 0 deletions content/python-aws.html
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,11 @@ <h1 class="site-logo" id="site-title">Python Users Group</h1>
Intro to Pandas
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="web-scraping-intro.html">
Intro to Web Scraping
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="web-scraping-static.html">
Web Scraping with Static Pages
Expand Down
5 changes: 5 additions & 0 deletions content/resources.html
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,11 @@ <h1 class="site-logo" id="site-title">Python Users Group</h1>
Intro to Pandas
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="web-scraping-intro.html">
Intro to Web Scraping
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="web-scraping-static.html">
Web Scraping with Static Pages
Expand Down
5 changes: 5 additions & 0 deletions content/technical-support.html
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,11 @@ <h1 class="site-logo" id="site-title">Python Users Group</h1>
Intro to Pandas
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="web-scraping-intro.html">
Intro to Web Scraping
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="web-scraping-static.html">
Web Scraping with Static Pages
Expand Down
20 changes: 14 additions & 6 deletions content/web-scraping-dynamic.html
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,11 @@ <h1 class="site-logo" id="site-title">Python Users Group</h1>
Intro to Pandas
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="web-scraping-intro.html">
Intro to Web Scraping
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="web-scraping-static.html">
Web Scraping with Static Pages
Expand Down Expand Up @@ -613,14 +618,17 @@ <h2>Getting Started<a class="headerlink" href="#getting-started" title="Permalin
<div class="cell_output docutils container">
<div class="output traceback highlight-ipythontb notranslate"><div class="highlight"><pre><span></span><span class="gt">---------------------------------------------------------------------------</span>
<span class="ne">ModuleNotFoundError</span><span class="g g-Whitespace"> </span>Traceback (most recent call last)
<span class="n">Cell</span> <span class="n">In</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">line</span> <span class="mi">3</span>
<span class="g g-Whitespace"> </span><span class="mi">1</span> <span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
<span class="g g-Whitespace"> </span><span class="mi">2</span> <span class="kn">import</span> <span class="nn">os</span>
<span class="ne">----&gt; </span><span class="mi">3</span> <span class="kn">from</span> <span class="nn">selenium</span> <span class="kn">import</span> <span class="n">webdriver</span>
<span class="n">Cell</span> <span class="n">In</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">line</span> <span class="mi">6</span>
<span class="g g-Whitespace"> </span><span class="mi">4</span> <span class="kn">from</span> <span class="nn">selenium.webdriver.common.by</span> <span class="kn">import</span> <span class="n">By</span>
<span class="g g-Whitespace"> </span><span class="mi">5</span> <span class="kn">from</span> <span class="nn">selenium.webdriver.chrome.service</span> <span class="kn">import</span> <span class="n">Service</span>

<span class="ne">ModuleNotFoundError</span>: No module named &#39;selenium&#39;
<span class="ne">----&gt; </span><span class="mi">6</span> <span class="kn">from</span> <span class="nn">webdriver_manager.chrome</span> <span class="kn">import</span> <span class="n">ChromeDriverManager</span>
<span class="g g-Whitespace"> </span><span class="mi">7</span> <span class="c1">## NOTE: Some users may want to try a Firefox Driver instead;</span>
<span class="g g-Whitespace"> </span><span class="mi">8</span> <span class="c1">## Can comment above two lines and uncomment the below two lines</span>
<span class="g g-Whitespace"> </span><span class="mi">9</span> <span class="c1"># from selenium.webdriver.firefox.service import Service</span>
<span class="g g-Whitespace"> </span><span class="mi">10</span> <span class="c1"># from webdriver_manager.firefox import GeckoDriverManager</span>
<span class="g g-Whitespace"> </span><span class="mi">11</span> <span class="kn">from</span> <span class="nn">selenium.webdriver.support</span> <span class="kn">import</span> <span class="n">expected_conditions</span> <span class="k">as</span> <span class="n">EC</span>

<span class="ne">ModuleNotFoundError</span>: No module named &#39;webdriver_manager&#39;
</pre></div>
</div>
</div>
Expand Down
Loading

0 comments on commit 5bb3567

Please sign in to comment.