-
Notifications
You must be signed in to change notification settings - Fork 1
Rules definitions of scraping intervals priorities
Krzysiek Madejski edited this page May 4, 2018
·
1 revision
Used by Continuous Archiving feature.
Use cases of scraping priorities:
- scrape one portal more often because it's more important
- scrape homepage often because it's probably changing often
- scrape often the pages that are changing often (self-adaptive)
- scrape html more often because it can point to new content
- the bigger the resource the lower frequency of scraping
- ...
- portal (each portal might have a priority)
- resource groups/tags (ie. homepages)
- media type (html, png, etc.) [dictionary]
- size [kb]
- changing frequency [1/hours] (this indicator needs to be calculated)
- Divide objects in queues by priorities (nominal numbers starting from 1)
- objects that were not yet scraped should have a separate queue sorted by priorities
- In each priority-queue get first objects that were scraped the earliest
- Take items from queues with frequency according to priority (ie. priority 10 should be scraped 10 times more often than priority 1)
- Refill queues when empty
Output: queues of items to be processed
- take basic default priority - ie. 100
- multiply by portal priority (0-5?) (note: interface should have a log scale)
- multiply by media type priority (0-10?)
- take into account size
- < 500kb => 1
- 500kb - 5 MB => 1/4
- 5 - 50 MB => 1/8
-
50 MB => 1/16
- multiply by individual object priority (based on group/tag)
- cast to integer
Output: a simple efficient SQL statement
TODO Feature: scrape often the pages that are changing often