Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reload existing state from DynamoDB on subsequent crawl #17

Open
ashley-evans opened this issue Sep 14, 2021 · 0 comments
Open

Reload existing state from DynamoDB on subsequent crawl #17

ashley-evans opened this issue Sep 14, 2021 · 0 comments
Labels

Comments

@ashley-evans
Copy link
Owner

Description

Currently the crawl-urls lambda function relies on the Apify local storage to know which pages have been visited and what is next on the queue. However, the lambda execution environment is not permanent and as such we cannot rely on this storage persisting between messages

  • If the message failed to fully process in time and the lambda execution environment was re-initialised then we would not be able to continue from where we left off.

Therefore, the crawl-urls lambda function should be able to restart from a non-initialised environment.

Acceptance Criteria

AC01

  • Update crawl-urls to be able to restart from the middle of a crawl operation
  • And the base url for each new entry added to DynamoDB should be the same as previously

AC02

  • The restart should respect the maximum depth environment variable with respect to the depths from the original crawl operation
  • e.g. If the last page crawled to was at depth 12 then when restarted the depth should be retained and used for future crawling

AC03

  • The restart should respect the maximum crawl operations
  • e.g. If the crawl operation had accessed 10 pages before restart then it should start the counter for the max page crawling at 10 rather than 0
@ashley-evans ashley-evans added 5 8 and removed 5 labels Sep 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant