deploy: 7ebd233

UI-Research · Apr 29, 2024 · 5bb3567 · 5bb3567
1 parent b7263ff
commit 5bb3567
Show file tree

Hide file tree

Showing 17 changed files with 888 additions and 19 deletions.
diff --git a/_sources/content/web-scraping-intro.md b/_sources/content/web-scraping-intro.md
@@ -0,0 +1,105 @@
+# Intro to Web Scraping
+
+## What is web scraping?
+
+- Web scraping is the use of programming to extract structured text or data from a website 
+- It is generally used to automate tasks that would take too long (or be too error-prone) to feasibly do manually 
+- There are two main categories of web scraping tasks: (1) Collecting text data from one or more web pages and (2) automating the download of a number of files from a website
+
+## How does Urban use web scraping?
+
+- Collecting thousands of community college course descriptions from the [FLDOE website](https://www.fldoe.org/)
+- Downloading hundreds of CSV files from the [Centers for Medicare & Medicaid Services website](https://data.cms.gov/tools/mapping-medicare-disparities-by-population) that all required clicking different dropdowns from a menu of options
+- Collecting the contact info for all notaries in Mississippi by clicking through thousands of pages on the [Secretary of State website](https://www.sos.ms.gov/notarysearch/notarysearch.aspx) 
+- Pulling voting history information from the [North Carolina State Board of Election website](https://vt.ncsbe.gov/RegLkup/) by searching for thousands of registered voters 
+
+## What are some drawbacks of web scraping?
+- Not all sites can be legally or responsibly scraped
+- Repeated requests to a website can lead to rate limiting (i.e. capping the number of requests over a certain period of time)
+- Depending on the task and site layout, complexity can vary widely
+- Web scraping code can be brittle as websites change over time 
+
+## Why is this web scraping bootcamp being taught in Python?
+- Python ecosystem more mature, flexible, and better-suited for dynamic web pages
+- Functionality in R is growing and evolving (e.g. the `rvest` package)
+- We may consider R tools for future versions of this workshop
+
+## What questions should I be asking at the outset?
+- Can I get the data without web scraping? (e.g. Is there an API or download option? Can you contact the site owner to request access?)
+- Am I legally allowed to scrape the website? Are there any site/rate limits or responsible web scraping considerations?
+- How many datasets or pieces of text need to be scraped?
+- Is webpage layout consistent or unstandardized?
+- Are there Captchas, pop-ups, or ads blocking the content you want?
+- Does the webpage have slow or inconsistent load times?
+- What tools/packages are needed for the job? (We will learn this throughout the workshop!)
+
+## What are the variables that affect how difficult a web scraping task is?
+1. How many different websites or pages are involved in the web scraping process?
+2. Does the website have dynamic content or only static content?
+3. Is it straightforward to extract the info we want once we reach the desired webpage?
+
+## 1. Different Webpages
+- Intuitively, scraping information from one website is simpler than doing so from many websites
+- If the layouts of the sites are different, difficulty vastly increases
+  - **Rule of thumb**: Think of this as a unique web scraping task for each uniquely structured website
+- Web crawlers such as `scrapy` exist to traverse many websites and grab all relevant information, but without easy ways to filter through that metadata, this can quickly become infeasible
+- For jobs that take a long time to run (e.g. more than a few hours), gracefully logging and handling issues can add complexity 
+
+## 2. Static vs Dynamic Content
+- For a static page like a [Wikipedia article](https://en.wikipedia.org/wiki/Urban_Institute), packages like `BeautifulSoup` or `pandas` can grab HTML text without too much complexity by parsing HTML tags
+- For pages with dynamic content like clickable buttons or dropdown menus, the `Selenium` package is needed and the complexity goes up
+  - **Rule of thumb**: Would a human user need to take any actions (besides scrolling up or down) to navigate to the desired info, or is it immediately available on the webpage?
+
+## 3. Identifying Desired Information
+- Possible future task: scraping area median income from [HUD website](https://www.huduser.gov/portal/datasets/il.html)
+- Upside: Only one webpage, can use `Selenium` to navigate dropdowns
+- Downside: Numbers we want to grab can be in different places within each webpage
+
+
+## Responsible Web Scraping Guidelines
+1. Check the robots.txt file - let's look at an example: [https://www.urban.org/robots.txt](https://www.urban.org/robots.txt) 
+2. Consult Urban's [Automated Data Collection Guidelines](https://urbanorg.account.box.com/login?redirect_url=https%3A%2F%2Furbanorg.app.box.com%2Fs%2Fmam9kpf48mu92f4ktpyuw218yf8j45w0).
+3. Use Headers (we'll see this in action next week)
+
+`headers = {'user-agent': 'Urban Institute Research Data Collector ([your_e-mail]@urban.org)'}`
+
+4. Use Site Monitor to ensure web scraping does not strain the website
+
+## Site Monitor
+- A tool created by Urban to ensure responsible web scraping practices
+- The actual code for Site Monitor lives [here](https://github.com/UrbanInstitute/SiteMonitor/blob/master/site_monitor.py) in this GitHub repository
+- Example code to test strain on a website
+```
+from site_monitor import *
+import requests
+
+sm = SiteMonitor(burn_in=20)
+
+for i in range(100):
+	print(i)
+	url = "https://flscns.fldoe.org/PbInstituteCourseSearch.aspx"
+	response = requests.get(url)
+	delay = sm.track_request(response)
+
+# Display the report of response times in graph format
+sm.report('display')
+```
+## Example Output from Site Monitor
+![](images/site_monitor_output.png){fig-align='center'}
+
+## A note on \~AI\~
+- We don't expect you to understand 100% of the code throughout this bootcamp.
+- We want to emphasize the idea of concepts > syntax.
+- Urban is still working through testing of its guidelines for use of AI, and while things are murky, we want to focus on building Python and web scraping intuition.
+- This workshop will not use Copilot or ChatGPT, though we acknowledge the utility of those tools *if you are asking from a place of conceptual understanding*.
+
+## Homework: Installations for Next Time
+- Install Python via Anaconda - see guidance from PUG's Python Installation training [here](https://ui-research.github.io/python-at-urban/content/installation.html)
+- Install the following Python packages: `requests`, `beautifulsoup4`, `lxml`, and `selenium`
+- Launch a new Jupyter Notebook if you've never done so before - see guidance from PUG's Intro to Python training [here](https://ui-research.github.io/python-at-urban/content/intro-to-python.html)
+- If you have any issues, please use the #python-users channel and we'd love to help. Someone else probably has the same question!
+- Sign up for GitHub using [this guide](https://ui-research.github.io/reproducibility-at-urban/git-installation.html) if you haven't so that you can access these workshop materials!
+
+## Next Session
+- How to scrape text from static webpages using BeautifulSoup
+- Diving into some Python code!
diff --git a/content/index.html b/content/index.html
@@ -135,6 +135,11 @@ <h1 class="site-logo" id="site-title">Python Users Group</h1>
    Intro to Pandas
   </a>
  </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="web-scraping-intro.html">
+   Intro to Web Scraping
+  </a>
+ </li>
  <li class="toctree-l1">
   <a class="reference internal" href="web-scraping-static.html">
    Web Scraping with Static Pages

diff --git a/content/installation.html b/content/installation.html
@@ -136,6 +136,11 @@ <h1 class="site-logo" id="site-title">Python Users Group</h1>
    Intro to Pandas
   </a>
  </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="web-scraping-intro.html">
+   Intro to Web Scraping
+  </a>
+ </li>
  <li class="toctree-l1">
   <a class="reference internal" href="web-scraping-static.html">
    Web Scraping with Static Pages

diff --git a/content/intro-to-pandas.html b/content/intro-to-pandas.html
@@ -54,7 +54,7 @@
     <script async="async" src="../_static/sphinx-thebe.js"></script>
     <link rel="index" title="Index" href="../genindex.html" />
     <link rel="search" title="Search" href="../search.html" />
-    <link rel="next" title="Web Scraping with Static Pages" href="web-scraping-static.html" />
+    <link rel="next" title="Intro to Web Scraping" href="web-scraping-intro.html" />
     <link rel="prev" title="Intro to Python" href="intro-to-python.html" />
     <meta name="viewport" content="width=device-width, initial-scale=1" />
     <meta name="docsearch:language" content="None">
@@ -136,6 +136,11 @@ <h1 class="site-logo" id="site-title">Python Users Group</h1>
    Intro to Pandas
   </a>
  </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="web-scraping-intro.html">
+   Intro to Web Scraping
+  </a>
+ </li>
  <li class="toctree-l1">
   <a class="reference internal" href="web-scraping-static.html">
    Web Scraping with Static Pages
@@ -3848,10 +3853,10 @@ <h3>Collapsing Data<a class="headerlink" href="#collapsing-data" title="Permalin
             <p class="prev-next-title">Intro to Python</p>
         </div>
     </a>
-    <a class='right-next' id="next-link" href="web-scraping-static.html" title="next page">
+    <a class='right-next' id="next-link" href="web-scraping-intro.html" title="next page">
     <div class="prev-next-info">
         <p class="prev-next-subtitle">next</p>
-        <p class="prev-next-title">Web Scraping with Static Pages</p>
+        <p class="prev-next-title">Intro to Web Scraping</p>
     </div>
     <i class="fas fa-angle-right"></i>
     </a>

diff --git a/content/intro-to-python.html b/content/intro-to-python.html
@@ -138,6 +138,11 @@ <h1 class="site-logo" id="site-title">Python Users Group</h1>
    Intro to Pandas
   </a>
  </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="web-scraping-intro.html">
+   Intro to Web Scraping
+  </a>
+ </li>
  <li class="toctree-l1">
   <a class="reference internal" href="web-scraping-static.html">
    Web Scraping with Static Pages

diff --git a/content/python-aws.html b/content/python-aws.html
@@ -136,6 +136,11 @@ <h1 class="site-logo" id="site-title">Python Users Group</h1>
    Intro to Pandas
   </a>
  </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="web-scraping-intro.html">
+   Intro to Web Scraping
+  </a>
+ </li>
  <li class="toctree-l1">
   <a class="reference internal" href="web-scraping-static.html">
    Web Scraping with Static Pages

diff --git a/content/resources.html b/content/resources.html
@@ -135,6 +135,11 @@ <h1 class="site-logo" id="site-title">Python Users Group</h1>
    Intro to Pandas
   </a>
  </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="web-scraping-intro.html">
+   Intro to Web Scraping
+  </a>
+ </li>
  <li class="toctree-l1">
   <a class="reference internal" href="web-scraping-static.html">
    Web Scraping with Static Pages

diff --git a/content/technical-support.html b/content/technical-support.html
@@ -136,6 +136,11 @@ <h1 class="site-logo" id="site-title">Python Users Group</h1>
    Intro to Pandas
   </a>
  </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="web-scraping-intro.html">
+   Intro to Web Scraping
+  </a>
+ </li>
  <li class="toctree-l1">
   <a class="reference internal" href="web-scraping-static.html">
    Web Scraping with Static Pages

diff --git a/content/web-scraping-dynamic.html b/content/web-scraping-dynamic.html
@@ -136,6 +136,11 @@ <h1 class="site-logo" id="site-title">Python Users Group</h1>
    Intro to Pandas
   </a>
  </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="web-scraping-intro.html">
+   Intro to Web Scraping
+  </a>
+ </li>
  <li class="toctree-l1">
   <a class="reference internal" href="web-scraping-static.html">
    Web Scraping with Static Pages
@@ -613,14 +618,17 @@ <h2>Getting Started<a class="headerlink" href="#getting-started" title="Permalin
 <div class="cell_output docutils container">
 <div class="output traceback highlight-ipythontb notranslate"><div class="highlight"><pre><span></span><span class="gt">---------------------------------------------------------------------------</span>
 <span class="ne">ModuleNotFoundError</span><span class="g g-Whitespace">                       </span>Traceback (most recent call last)
-<span class="n">Cell</span> <span class="n">In</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">line</span> <span class="mi">3</span>
-<span class="g g-Whitespace">      </span><span class="mi">1</span> <span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
-<span class="g g-Whitespace">      </span><span class="mi">2</span> <span class="kn">import</span> <span class="nn">os</span>
-<span class="ne">----&gt; </span><span class="mi">3</span> <span class="kn">from</span> <span class="nn">selenium</span> <span class="kn">import</span> <span class="n">webdriver</span>
+<span class="n">Cell</span> <span class="n">In</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">line</span> <span class="mi">6</span>
 <span class="g g-Whitespace">      </span><span class="mi">4</span> <span class="kn">from</span> <span class="nn">selenium.webdriver.common.by</span> <span class="kn">import</span> <span class="n">By</span>
 <span class="g g-Whitespace">      </span><span class="mi">5</span> <span class="kn">from</span> <span class="nn">selenium.webdriver.chrome.service</span> <span class="kn">import</span> <span class="n">Service</span> 
-
-<span class="ne">ModuleNotFoundError</span>: No module named &#39;selenium&#39;
+<span class="ne">----&gt; </span><span class="mi">6</span> <span class="kn">from</span> <span class="nn">webdriver_manager.chrome</span> <span class="kn">import</span> <span class="n">ChromeDriverManager</span>
+<span class="g g-Whitespace">      </span><span class="mi">7</span> <span class="c1">## NOTE: Some users may want to try a Firefox Driver instead;</span>
+<span class="g g-Whitespace">      </span><span class="mi">8</span> <span class="c1">## Can comment above two lines and uncomment the below two lines</span>
+<span class="g g-Whitespace">      </span><span class="mi">9</span> <span class="c1"># from selenium.webdriver.firefox.service import Service</span>
+<span class="g g-Whitespace">     </span><span class="mi">10</span> <span class="c1"># from webdriver_manager.firefox import GeckoDriverManager</span>
+<span class="g g-Whitespace">     </span><span class="mi">11</span> <span class="kn">from</span> <span class="nn">selenium.webdriver.support</span> <span class="kn">import</span> <span class="n">expected_conditions</span> <span class="k">as</span> <span class="n">EC</span>
+
+<span class="ne">ModuleNotFoundError</span>: No module named &#39;webdriver_manager&#39;
 </pre></div>
 </div>
 </div>