You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently if an error occurs while accessing a web page during the crawling operation in crawl-urls we simply retry 2 additional times.
As many sites implement strategies to prevent web scraping or have DDOS protection it is likely that the crawler will be prevented from accessing a page due to accessing too many of the pages in quick succession.
This has currently been mitigated by only requesting 50 pages from a single domain in one crawl.
Any failed page should be marked as failed in the DynamoDB URLsTable and the message related to that base URL should not be removed from the SQS Queue. Once the message is retried, the error-ed pages should be retried (Currently the visibility window is set to 720 seconds which might be adequate time to retry)
Acceptance Criteria
AC01
Any web page that errors due to DDOS protection etc. should be marked as such in the URLsTable
The related message should not be removed from the SQS queue and retried at a later stage
Upon retry, the error-ed web pages should be retried.
Repeating as necessary
AC02
If a page errors as per AC01 then no further pages from that base url should be accessed
AC03
Other records within the event should be processed as appropriately.
If the other records successfully process then those messages should be marked as complete and removed from the SQS queue.
The text was updated successfully, but these errors were encountered:
ashley-evans
changed the title
Failed Request Handling
Failed Request Handling - DDOS protection
Sep 15, 2021
Description
Currently if an error occurs while accessing a web page during the crawling operation in
crawl-urls
we simply retry 2 additional times.As many sites implement strategies to prevent web scraping or have DDOS protection it is likely that the crawler will be prevented from accessing a page due to accessing too many of the pages in quick succession.
Any failed page should be marked as failed in the DynamoDB
URLsTable
and the message related to that base URL should not be removed from the SQS Queue. Once the message is retried, the error-ed pages should be retried (Currently the visibility window is set to 720 seconds which might be adequate time to retry)Acceptance Criteria
AC01
URLsTable
AC02
AC03
The text was updated successfully, but these errors were encountered: