Intro to Web Scraping
+ +Contents
+Intro to Web Scraping#
+What is web scraping?#
+-
+
Web scraping is the use of programming to extract structured text or data from a website
+It is generally used to automate tasks that would take too long (or be too error-prone) to feasibly do manually
+There are two main categories of web scraping tasks: (1) Collecting text data from one or more web pages and (2) automating the download of a number of files from a website
+
How does Urban use web scraping?#
+-
+
Collecting thousands of community college course descriptions from the FLDOE website
+Downloading hundreds of CSV files from the Centers for Medicare & Medicaid Services website that all required clicking different dropdowns from a menu of options
+Collecting the contact info for all notaries in Mississippi by clicking through thousands of pages on the Secretary of State website
+Pulling voting history information from the North Carolina State Board of Election website by searching for thousands of registered voters
+
What are some drawbacks of web scraping?#
+-
+
Not all sites can be legally or responsibly scraped
+Repeated requests to a website can lead to rate limiting (i.e. capping the number of requests over a certain period of time)
+Depending on the task and site layout, complexity can vary widely
+Web scraping code can be brittle as websites change over time
+
Why is this web scraping bootcamp being taught in Python?#
+-
+
Python ecosystem more mature, flexible, and better-suited for dynamic web pages
+Functionality in R is growing and evolving (e.g. the
rvest
package)
+We may consider R tools for future versions of this workshop
+
What questions should I be asking at the outset?#
+-
+
Can I get the data without web scraping? (e.g. Is there an API or download option? Can you contact the site owner to request access?)
+Am I legally allowed to scrape the website? Are there any site/rate limits or responsible web scraping considerations?
+How many datasets or pieces of text need to be scraped?
+Is webpage layout consistent or unstandardized?
+Are there Captchas, pop-ups, or ads blocking the content you want?
+Does the webpage have slow or inconsistent load times?
+What tools/packages are needed for the job? (We will learn this throughout the workshop!)
+
What are the variables that affect how difficult a web scraping task is?#
+-
+
How many different websites or pages are involved in the web scraping process?
+Does the website have dynamic content or only static content?
+Is it straightforward to extract the info we want once we reach the desired webpage?
+
1. Different Webpages#
+-
+
Intuitively, scraping information from one website is simpler than doing so from many websites
+If the layouts of the sites are different, difficulty vastly increases
+-
+
Rule of thumb: Think of this as a unique web scraping task for each uniquely structured website
+
+Web crawlers such as
scrapy
exist to traverse many websites and grab all relevant information, but without easy ways to filter through that metadata, this can quickly become infeasible
+For jobs that take a long time to run (e.g. more than a few hours), gracefully logging and handling issues can add complexity
+
2. Static vs Dynamic Content#
+-
+
For a static page like a Wikipedia article, packages like
BeautifulSoup
orpandas
can grab HTML text without too much complexity by parsing HTML tags
+For pages with dynamic content like clickable buttons or dropdown menus, the
+Selenium
package is needed and the complexity goes up-
+
Rule of thumb: Would a human user need to take any actions (besides scrolling up or down) to navigate to the desired info, or is it immediately available on the webpage?
+
+
3. Identifying Desired Information#
+-
+
Possible future task: scraping area median income from HUD website
+Upside: Only one webpage, can use
Selenium
to navigate dropdowns
+Downside: Numbers we want to grab can be in different places within each webpage
+
Responsible Web Scraping Guidelines#
+-
+
Check the robots.txt file - let’s look at an example: https://www.urban.org/robots.txt
+Consult Urban’s Automated Data Collection Guidelines.
+Use Headers (we’ll see this in action next week)
+
headers = {'user-agent': 'Urban Institute Research Data Collector ([your_e-mail]@urban.org)'}
-
+
Use Site Monitor to ensure web scraping does not strain the website
+
Site Monitor#
+-
+
A tool created by Urban to ensure responsible web scraping practices
+The actual code for Site Monitor lives here in this GitHub repository
+Example code to test strain on a website
+
from site_monitor import *
+import requests
+
+sm = SiteMonitor(burn_in=20)
+
+for i in range(100):
+ print(i)
+ url = "https://flscns.fldoe.org/PbInstituteCourseSearch.aspx"
+ response = requests.get(url)
+ delay = sm.track_request(response)
+
+# Display the report of response times in graph format
+sm.report('display')
+
Example Output from Site Monitor#
+{fig-align=’center’}
+A note on ~AI~#
+-
+
We don’t expect you to understand 100% of the code throughout this bootcamp.
+We want to emphasize the idea of concepts > syntax.
+Urban is still working through testing of its guidelines for use of AI, and while things are murky, we want to focus on building Python and web scraping intuition.
+This workshop will not use Copilot or ChatGPT, though we acknowledge the utility of those tools if you are asking from a place of conceptual understanding.
+
Homework: Installations for Next Time#
+-
+
Install Python via Anaconda - see guidance from PUG’s Python Installation training here
+Install the following Python packages:
requests
,beautifulsoup4
,lxml
, andselenium
+Launch a new Jupyter Notebook if you’ve never done so before - see guidance from PUG’s Intro to Python training here
+If you have any issues, please use the #python-users channel and we’d love to help. Someone else probably has the same question!
+Sign up for GitHub using this guide if you haven’t so that you can access these workshop materials!
+
Next Session#
+-
+
How to scrape text from static webpages using BeautifulSoup
+Diving into some Python code!
+