Rules definitions of scraping intervals priorities

Jump to bottom

Krzysiek Madejski edited this page May 4, 2018 · 1 revision

Used by Continuous Archiving feature.

Use cases

Use cases of scraping priorities:

scrape one portal more often because it's more important
scrape homepage often because it's probably changing often
scrape often the pages that are changing often (self-adaptive)
scrape html more often because it can point to new content
the bigger the resource the lower frequency of scraping
...

Metadata influencing priorities (based on use cases)

portal (each portal might have a priority)
resource groups/tags (ie. homepages)
media type (html, png, etc.) [dictionary]
size [kb]
changing frequency [1/hours] (this indicator needs to be calculated)

Proposed Algorithm

Queuing items

Divide objects in queues by priorities (nominal numbers starting from 1)
- objects that were not yet scraped should have a separate queue sorted by priorities
In each priority-queue get first objects that were scraped the earliest
Take items from queues with frequency according to priority (ie. priority 10 should be scraped 10 times more often than priority 1)
Refill queues when empty

Output: queues of items to be processed

Counting priority

take basic default priority - ie. 100
multiply by portal priority (0-5?) (note: interface should have a log scale)
multiply by media type priority (0-10?)
take into account size
- < 500kb => 1
- 500kb - 5 MB => 1/4
- 5 - 50 MB => 1/8
- 50 MB => 1/16
multiply by individual object priority (based on group/tag)
cast to integer

Output: a simple efficient SQL statement

Notes

TODO Feature: scrape often the pages that are changing often