-
Notifications
You must be signed in to change notification settings - Fork 36
Scraping HTML
Tim L edited this page Jul 17, 2015
·
46 revisions
::sigh::
First, a nice article about just using the web as an API.
Other's work:
- https://scraperwiki.com
- Python-based parser: BeautifulSoup
-
http://web-xslt.googlecode.com/svn/trunk/htmlparse/htmlparse.xsl parses an HTML string into a DOM object.
- It's functions are in
xmlns:d="data:,dpc"
.
- It's functions are in
- See also XSL Crib Sheet, xargs Cheat Sheet
- Scraping can benefit from SDV organization.
This page lists some XSL utility functions that we've developed to scrape HTML:
- html:text - get all of the displayable text within a DOM element's hierarchy.
-
html:anchor-labels - get all of the anchors' displayable text, delimited by
||
. -
html:anchor-hrefs - get all of the anchors' hrefs, delimited by
||
. - html:parse-value
The following functions help scrape HTML elements into useful strings. It uses the following namespace.
xmlns:html="http://www.w3.org/1999/xhtml"
We prefer to just produce a CSV from the HTML, instead of trying to model it in RDF directly. There are much nicer mechanisms in csv2rdf4lod to handle URI creation within the SDV paradigm. We write a row of CSV using the following.
<xsl:value-of select="concat($DQ,string-join((
$perigee,$apogee,$inclination,$period,$semi-major-axis,
),
concat($DQ,',',$DQ)),$DQ,$NL)"/>
http://www.darpa.mil/OpenCatalog/index.html circa Feb 2014
<tr>
<td>Aptima Inc.</td>
<td>
<a href='http://www.darpa.mil/External_Link.aspx?url=https://github.com/Aptima/pattern-matching'>Network
Query by Example</a>
</td>
<td>Analytics</td>
<td>2014-07</td>
<td>https://github.com/Aptima/pattern-matching.git</td>
<td>
<a href='stats/pattern-matching/index.html'>stats</a>
</td>
<td>Hadoop MapReduce-over-Hive based implementation of network
query by example utilizing attributed network pattern
matching.</td>
<td>ALv2</td>
</tr>
http://hcil2.cs.umd.edu/newvarepository/benchmarks.php
Definition:
<!-- https://github.com/timrdf/csv2rdf4lod-automation/wiki/Scraping-HTML#htmltext -->
<xsl:function name="html:text">
<xsl:param name="node"/>
<xsl:variable name="together">
<xsl:for-each select="$node//text()">
<xsl:value-of select="normalize-space(.)"/>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="normalize-space($together)"/>
</xsl:function>
Usage:
<xsl:template match="html:tr">
<xsl:value-of select="concat(html:text(html:td[1]),$NL)"/>
</xsl:template>
Adding a parameter for a delimiter:
<xsl:function name="html:text">
<xsl:param name="node"/>
<xsl:param name="delim"/>
<xsl:variable name="together">
<xsl:for-each select="$node//text()">
<xsl:value-of select="concat(normalize-space(.),$delim)"/>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="normalize-space($together)"/>
</xsl:function>
Usage:
<xsl:template match="html:tr">
<xsl:value-of select="concat(html:text(html:td[1],' '),$NL)"/>
</xsl:template>
Uses:
- July 2014 GRDDL svg (added the delimiter)
- May 27 2014 hcil-cs-umd-edu/visual-analytics-benchmark-repository (same as shown)
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/pubs.xsl (same as shown)
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/software.xsl (same as shown)
- Dec 5 09:26 2013 n2yo-com/satellites/src/html2csv.xsl (shown above)
- Dec 4 13:12 2013 n2yo-com/satellite-categories/src/category2csv.xsl (same as shown)
- Dec 3 16:45 2013 n2yo-com/satellite-categories/src/index2csv.xsl (same as shown)
- Dec 1 19:06 2013 n2yo-com/browse/src/html2csv.xsl (same as shown)
Definition:
<!-- https://github.com/timrdf/csv2rdf4lod-automation/wiki/Scraping-HTML#htmlanchor-labels -->
<xsl:function name="html:anchor-labels">
<xsl:param name="anchors"/>
<xsl:variable name="together">
<xsl:for-each select="$anchors">
<xsl:if test="position() gt 1">
<xsl:value-of select="'||'"/>
</xsl:if>
<xsl:value-of select="normalize-space(.)"/>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="normalize-space($together)"/>
</xsl:function>
Uses:
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/software.xsl (same as shown)
- Dec 5 09:26 n2yo-com/satellites/src/html2csv.xsl (shown above)
Definition:
<!-- https://github.com/timrdf/csv2rdf4lod-automation/wiki/Scraping-HTML#htmlanchor-hrefs -->
<xsl:function name="html:anchor-hrefs">
<xsl:param name="anchors"/>
<xsl:param name="base"/>
<xsl:variable name="together">
<xsl:for-each select="$anchors">
<xsl:if test="position() gt 1">
<xsl:value-of select="'||'"/>
</xsl:if>
<xsl:value-of select="concat($base,normalize-space(@href))"/>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="normalize-space($together)"/>
</xsl:function>
Uses:
- Aug 24 2014 freeformatter-com/mime-types-list/src/html2csv.xsl
- May 27 2014 hcil-cs-umd-edu/visual-analytics-benchmark-repository (same as shown)
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/pubs.xsl (same as shown)
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/software.xsl (same as shown)
- Dec 5 09:26 n2yo-com/satellites/src/html2csv.xsl (shown above)
- Dec 4 13:12 n2yo-com/satellite-categories/src/category2csv.xsl (same as shown)
- Dec 3 16:45 n2yo-com/satellite-categories/src/index2csv.xsl (same as shown)
Uses:
- n2yo-com/satellites/src/html2csv.xsl
Definition:
<xsl:function name="html:capitalize">
<xsl:param name="string"/>
<xsl:value-of select="concat(upper-case(substring($string,1,1)),
substring($string, 2))"/>
</xsl:function>