Simplify logic to list what needs to be retrieved #33

benoit74 · 2022-04-15T12:01:48Z

In order to avoid too complex logic, for every type of item (category, guide, wiki, item, user, team, ...) we will always:

retrieve the list of items to scrape from the API, not from already fetched items (i.e. the list of guides to retrieve will come from the API, not from the guides found while scraping categories) ; this will ensure that we grab all known items, no matter if we have an error while processing another item
allow a by-pass for every items by specifying exact items to be processed on the command line
don't care about non-browsability (i.e. if you ask to retrieve a sub-category while its parent is not scraped, we don't care); browsability is anyway ensured by the full text search + the future sitemap (see Generate a sitemap, accessible from the homepage #12 )
handle "unexpected" items separately, i.e. items we discovered while scraping other items, and retrieve them ; they will be listed for reporting at the end of the scrape

benoit74 mentioned this issue Apr 15, 2022

Simplify logic #34

Merged

benoit74 closed this as completed in #34 Apr 15, 2022

Provide feedback