Skip to content

A webscraper which allows re-usage of components from other scrapers

Notifications You must be signed in to change notification settings

Zurga/modelscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

modelscraper

A webscraper which allows re-usage of components from other scrapers

By creating a model for the website you want to scrape, parts of the model can be used for other websites, or the model can be adapted for other websites. The advantage of this is that scrapers don't have to be written specifically for each website and that the data from different sources which need to be grouped together have the same format.

Example that will scrape the search results from DuckDuckGo with the query "example":

from modelscraper.components import Template, Attr, Scraper
from modelscraper.sources import WebSource
from modelscraper.databases import CSV
from modelscraper.parsers import HTMLParser

db = CSV(db='duckduckgo', table='search_results')
htmlp = HTMLParser() 

results_source = WebSource(name='result', urls=['https://duckduckgo.com/html?q=example'])
next_pages_source = WebSource(name='next_page', session=results_source.session,
                              func='post', duplicate=True)

url = Attr(name='url', func=htmlp.url(selector='a'))
title = Attr(name='title', func=htmlp.text(selector='h2'))
snippet = Attr(name='snippet', func=htmlp.text(selector='.result__snippet'))

search_result = Template(
    name='search_result',
    source=[results_source, next_pages_source],
    database=db,
    selector=htmlp.select('.result'),
    attrs=[url, title, snippet])

input_fields = ['q', 's', 'nextParams', 'v', 'o', 'dc', 'api', 'kl']

next_page = Template(
    name='next_page',
    source=[results_source, next_pages_source],
    selector=htmlp.select('//input[@value="Next"]/..'),
    emits=next_pages_source,
    attrs=[
        Attr(name='url', value='https://duckduckgo.com/html'),
        *[Attr(name=field, func=htmlp.attr(selector='input[name="'+field+'"]',
                                            attr='value'))
          for field in input_fields]]
)

scraper = Scraper(templates=[search_result, next_page])
scraper.start()

To explain what is going on here I need to introduce some concepts

About

A webscraper which allows re-usage of components from other scrapers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages