+ +
+ +
+

Intro to Web Scraping#

+
+

What is web scraping?#

+
    +
  • Web scraping is the use of programming to extract structured text or data from a website

  • +
  • It is generally used to automate tasks that would take too long (or be too error-prone) to feasibly do manually

  • +
  • There are two main categories of web scraping tasks: (1) Collecting text data from one or more web pages and (2) automating the download of a number of files from a website

  • +
+
+
+

How does Urban use web scraping?#

+ +
+
+

What are some drawbacks of web scraping?#

+
    +
  • Not all sites can be legally or responsibly scraped

  • +
  • Repeated requests to a website can lead to rate limiting (i.e. capping the number of requests over a certain period of time)

  • +
  • Depending on the task and site layout, complexity can vary widely

  • +
  • Web scraping code can be brittle as websites change over time

  • +
+
+
+

Why is this web scraping bootcamp being taught in Python?#

+
    +
  • Python ecosystem more mature, flexible, and better-suited for dynamic web pages

  • +
  • Functionality in R is growing and evolving (e.g. the rvest package)

  • +
  • We may consider R tools for future versions of this workshop

  • +
+
+
+

What questions should I be asking at the outset?#

+
    +
  • Can I get the data without web scraping? (e.g. Is there an API or download option? Can you contact the site owner to request access?)

  • +
  • Am I legally allowed to scrape the website? Are there any site/rate limits or responsible web scraping considerations?

  • +
  • How many datasets or pieces of text need to be scraped?

  • +
  • Is webpage layout consistent or unstandardized?

  • +
  • Are there Captchas, pop-ups, or ads blocking the content you want?

  • +
  • Does the webpage have slow or inconsistent load times?

  • +
  • What tools/packages are needed for the job? (We will learn this throughout the workshop!)

  • +
+
+
+

What are the variables that affect how difficult a web scraping task is?#

+
    +
  1. How many different websites or pages are involved in the web scraping process?

  2. +
  3. Does the website have dynamic content or only static content?

  4. +
  5. Is it straightforward to extract the info we want once we reach the desired webpage?

  6. +
+
+
+

1. Different Webpages#

+
    +
  • Intuitively, scraping information from one website is simpler than doing so from many websites

  • +
  • If the layouts of the sites are different, difficulty vastly increases

    +
      +
    • Rule of thumb: Think of this as a unique web scraping task for each uniquely structured website

    • +
    +
  • +
  • Web crawlers such as scrapy exist to traverse many websites and grab all relevant information, but without easy ways to filter through that metadata, this can quickly become infeasible

  • +
  • For jobs that take a long time to run (e.g. more than a few hours), gracefully logging and handling issues can add complexity

  • +
+
+
+

2. Static vs Dynamic Content#

+
    +
  • For a static page like a Wikipedia article, packages like BeautifulSoup or pandas can grab HTML text without too much complexity by parsing HTML tags

  • +
  • For pages with dynamic content like clickable buttons or dropdown menus, the Selenium package is needed and the complexity goes up

    +
      +
    • Rule of thumb: Would a human user need to take any actions (besides scrolling up or down) to navigate to the desired info, or is it immediately available on the webpage?

    • +
    +
  • +
+
+
+

3. Identifying Desired Information#

+
    +
  • Possible future task: scraping area median income from HUD website

  • +
  • Upside: Only one webpage, can use Selenium to navigate dropdowns

  • +
  • Downside: Numbers we want to grab can be in different places within each webpage

  • +
+
+
+

Responsible Web Scraping Guidelines#

+
    +
  1. Check the robots.txt file - let’s look at an example: https://www.urban.org/robots.txt

  2. +
  3. Consult Urban’s Automated Data Collection Guidelines.

  4. +
  5. Use Headers (we’ll see this in action next week)

  6. +
+

headers = {'user-agent': 'Urban Institute Research Data Collector ([your_e-mail]@urban.org)'}

+
    +
  1. Use Site Monitor to ensure web scraping does not strain the website

  2. +
+
+
+

Site Monitor#

+
    +
  • A tool created by Urban to ensure responsible web scraping practices

  • +
  • The actual code for Site Monitor lives here in this GitHub repository

  • +
  • Example code to test strain on a website

  • +
+
from site_monitor import *
+import requests
+
+sm = SiteMonitor(burn_in=20)
+
+for i in range(100):
+	print(i)
+	url = "https://flscns.fldoe.org/PbInstituteCourseSearch.aspx"
+	response = requests.get(url)
+	delay = sm.track_request(response)
+
+# Display the report of response times in graph format
+sm.report('display')
+
+
+
+
+

Example Output from Site Monitor#

+

{fig-align=’center’}

+
+
+

A note on ~AI~#

+
    +
  • We don’t expect you to understand 100% of the code throughout this bootcamp.

  • +
  • We want to emphasize the idea of concepts > syntax.

  • +
  • Urban is still working through testing of its guidelines for use of AI, and while things are murky, we want to focus on building Python and web scraping intuition.

  • +
  • This workshop will not use Copilot or ChatGPT, though we acknowledge the utility of those tools if you are asking from a place of conceptual understanding.

  • +
+
+
+

Homework: Installations for Next Time#

+
    +
  • Install Python via Anaconda - see guidance from PUG’s Python Installation training here

  • +
  • Install the following Python packages: requests, beautifulsoup4, lxml, and selenium

  • +
  • Launch a new Jupyter Notebook if you’ve never done so before - see guidance from PUG’s Intro to Python training here

  • +
  • If you have any issues, please use the #python-users channel and we’d love to help. Someone else probably has the same question!

  • +
  • Sign up for GitHub using this guide if you haven’t so that you can access these workshop materials!

  • +
+
+
+

Next Session#

+
    +
  • How to scrape text from static webpages using BeautifulSoup

  • +
  • Diving into some Python code!

  • +
+
+
+ + + + +
+ +