Skip to content

michelleminkoff/ji2014scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 

Repository files navigation

NOTE FOR LATER: Run the script in Terminal: python ~/Downloads/ji2014scraping-master/umdmajors.py

CONCEPTUAL OVERVIEW OF SCRAPING

What is scraping?
Getting structured information from the unstructured
Why we like structured info
Best if we can end up with a spreadsheet or database (csv most common)

Why do it?
Exclusive storytelling
More control over info you can't get other way
More control over your information (sorting)
Get and save information

Be careful of...
Ethics
http://lethain.com/an-introduction-to-compassionate-screenscraping/
http://blog.screen-scraper.com/2008/04/21/screening-scraping-ethics/
Paid options
Overwhelming a site
Site changes under you

HOW DOES THE WEB WORK?

HTML/CSS/JS
Browser downloads files to your computer
How do these three languages play together?

Importance of patterns

How to identify a specific area on the page you're looking for
Element types
Ids
Classes
Hierarchy
Role of web inspector in seeing how this works on actual web page

What we need to scrape - content
Introduce sample website we scrape today
http://www.admissions.umd.edu/academics/Majors.php?m=1
A web address
Way to select overall block of content we want
Way to select specific items we want

What do we need to scrape - prog lang?
What is a scripting language?
Python, Ruby, Rails, Django, JavaScript, what does it all mean?
We'll use Python

What are modules/libraries?
Beef up what your language can do
http://xkcd.com/353/
We'll use:
Urllib2
Csv
Beautiful Soup

Interplay between typing and running code
What is command line?
What is text editor?
How do I use the command line to run a file I create in the text editor?
What is the shell?

Build that script with all of our pieces! (Finally!)

Run the script in Terminal: python ~/Downloads/ji2014scraping-master/umdmajors.py

Import our libraries
Start a list for all of our info
Add headers
Get and open URL
Get overall content
Get specific lines
Filter bad lines
Get specific bits of info
Put specific bits in empty list
Add row list to overall list
Load a new file
Write to that new file
Open a spreadsheet that's sortable

Where do I go from here?
Adjust our example
Understand how to gather parts you need
There are many more things you can do with code
Why PDFs are different -- what to do
Use shortcut tools - See NICAR handout w/some examples - https://github.com/kleinmatic/datashow
Role of a tool like Selenium - emulate a Web browser (what does that mean?)
Auto-clicking
Handling JS pages
Understand what's possible - ask community for help if you're having trouble

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages