Skip to content

Rules definitions of scraping intervals priorities

Krzysiek Madejski edited this page May 4, 2018 · 1 revision

Used by Continuous Archiving feature.

Use cases

Use cases of scraping priorities:

  • scrape one portal more often because it's more important
  • scrape homepage often because it's probably changing often
  • scrape often the pages that are changing often (self-adaptive)
  • scrape html more often because it can point to new content
  • the bigger the resource the lower frequency of scraping
  • ...

Metadata influencing priorities (based on use cases)

  • portal (each portal might have a priority)
  • resource groups/tags (ie. homepages)
  • media type (html, png, etc.) [dictionary]
  • size [kb]
  • changing frequency [1/hours] (this indicator needs to be calculated)

Proposed Algorithm

Queuing items

  1. Divide objects in queues by priorities (nominal numbers starting from 1)
    • objects that were not yet scraped should have a separate queue sorted by priorities
  2. In each priority-queue get first objects that were scraped the earliest
  3. Take items from queues with frequency according to priority (ie. priority 10 should be scraped 10 times more often than priority 1)
  4. Refill queues when empty

Output: queues of items to be processed

Counting priority

  • take basic default priority - ie. 100
  • multiply by portal priority (0-5?) (note: interface should have a log scale)
  • multiply by media type priority (0-10?)
  • take into account size
    • < 500kb => 1
    • 500kb - 5 MB => 1/4
    • 5 - 50 MB => 1/8
    • 50 MB => 1/16

  • multiply by individual object priority (based on group/tag)
  • cast to integer

Output: a simple efficient SQL statement

Notes

TODO Feature: scrape often the pages that are changing often