modelscraper

A webscraper which allows re-usage of components from other scrapers

By creating a model for the website you want to scrape, parts of the model can be used for other websites, or the model can be adapted for other websites. The advantage of this is that scrapers don't have to be written specifically for each website and that the data from different sources which need to be grouped together have the same format.

Example that will scrape the search results from DuckDuckGo with the query "example":

from modelscraper.components import Template, Attr, Scraper
from modelscraper.sources import WebSource
from modelscraper.databases import CSV
from modelscraper.parsers import HTMLParser

db = CSV(db='duckduckgo', table='search_results')
htmlp = HTMLParser() 

results_source = WebSource(name='result', urls=['https://duckduckgo.com/html?q=example'])
next_pages_source = WebSource(name='next_page', session=results_source.session,
                              func='post', duplicate=True)

url = Attr(name='url', func=htmlp.url(selector='a'))
title = Attr(name='title', func=htmlp.text(selector='h2'))
snippet = Attr(name='snippet', func=htmlp.text(selector='.result__snippet'))

search_result = Template(
    name='search_result',
    source=[results_source, next_pages_source],
    database=db,
    selector=htmlp.select('.result'),
    attrs=[url, title, snippet])

input_fields = ['q', 's', 'nextParams', 'v', 'o', 'dc', 'api', 'kl']

next_page = Template(
    name='next_page',
    source=[results_source, next_pages_source],
    selector=htmlp.select('//input[@value="Next"]/..'),
    emits=next_pages_source,
    attrs=[
        Attr(name='url', value='https://duckduckgo.com/html'),
        *[Attr(name=field, func=htmlp.attr(selector='input[name="'+field+'"]',
                                            attr='value'))
          for field in input_fields]]
)

scraper = Scraper(templates=[search_result, next_page])
scraper.start()

To explain what is going on here I need to introduce some concepts

Name		Name	Last commit message	Last commit date
Latest commit History 352 Commits
modelscraper		modelscraper
scrape_models		scrape_models
test		test
.gitignore		.gitignore
README.md		README.md
banner.txt		banner.txt
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

modelscraper

About

Releases

Packages

Languages

Zurga/modelscraper

Folders and files

Latest commit

History

Repository files navigation

modelscraper

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages