-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
123 lines (90 loc) · 3.87 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
= Scrapes
Scrapes is a framework for crawling and scraping multi-page web sites.
Unlike other scraping frameworks, Scrapes is designed to work with "dirty"
web sites. That is, web sites that were not designed to have their data
extracted programmatically.
It includes features for both the initial development of a scraper, and the
continued maintenance of that scraper. These features include:
* Rule based selection and extraction of data that can use CSS selectors or
pseudo XPath expressions
* Caching system so that during development you don't have to continuously
download pages from a web server while you experiment with your selectors and
extractors
* Validation system that helps detect web site changes that would
otherwise invalidate your extraction rules
* Support for initiating a session with the web server, and passing session
cookies back to the web server
* When all else fails, you can run a web page through the xsltproc XSLT
processor to generate an XML document that can then be run through your
rule based parser
* Useful set of post-processing methods such as normalize_name
== Installing Scrapes
gem install scrapes --include-dependencies
== Dependencies
* Hpricot: http://code.whytheluckystiff.net/hpricot/wiki/AnHpricotShowcase
* Rextra: http://rubyforge.org/projects/rextra2/
== Quick Start
You start by writing a class for parsing a single page:
# process the Google.com index.html page
class GoogleMain < Scrapes::Page
# make sure that the :about_link rule matched the web page
validates_presence_of(:about_link)
# extract the link to the about page
rule(:about_link, 'a[@href*="about"]', '@href', 1)
end
# process the Google.com about page
class GoogleAbout < Scrapes::Page
# ensure the :title rule below matches the web page
validates_presence_of(:title)
# extract the text inside the <title></title> tag
rule(:title, 'title', 'text()', 1)
end
Then you start a scraping session and use those classes to process the web
site:
Scrapes::Session.start do |session|
session.page(GoogleMain, 'http://google.com') do |main_page|
session.page(GoogleAbout, main_page.about_link) do |about_page|
puts about_page.title + ': ' + session.absolute_uri(main_page.about_link)
end
end
end
On my machine, this code produces:
About Google: http://www.google.com/intl/en/about.html
For more information, please review the following classes:
* Scrapes::Session
* Scrapes::Page
* Scrapes::RuleParser
* Scrapes::Hpricot::Extractors
== Development Tips
=== Add something like this to your .irbrc:
require 'rubygems'
require 'yaml'
require 'open-uri'
require 'hpricot'
require 'scrapes'
def h(url) Hpricot(open(url)) end
Then use like this in irb to understand how Hpricot selectors work:
doc = h 'http://www.foobar.com/'
links = doc.search('table/a[@href]') # for example
To understand the text extractors:
texts(links)
word(links.first) # etc..
=== Converting normal Xpath to Hpricot Xpath, sort of:
There are various add-ons to firefox, for example, that display the Xpath to a
selected node. Hpricot uses a different sytanx however,
(http://code.whytheluckystiff.net/hpricot/wiki/SupportedXpathExpressions). The
following method is a first try at the conversion:
def xpath_to_hpricot path
path.split('/').reject{|e|e=~/^(html|tbody)$/ or e.blank?}.map do |e|
res = e.sub(/\[/,':eq(').sub(/\]/,')')
res.sub(/\d+/, (/(\d+)/.match(res).to_s.to_i - 1).to_s)
end.join('//')
end
=== Hpricot bugs
* This selector will hang, 'a[href="this"]' and this one won't, 'a[@href="this"]'.
Just make sure you have the '@' in front of the attribute name.
== Credits
* Peter Jones, author and maintainer
* Michael Garriss, author and maintainer
* Bob Showalter, continuous improvements and maintenance
* Assaf Arkin, rule inspiration from http://trac.labnotes.org/cgi-bin/trac.cgi/wiki/Ruby/MicroformatParser