Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed Request Handling - DDOS protection #19

Open
ashley-evans opened this issue Sep 15, 2021 · 0 comments
Open

Failed Request Handling - DDOS protection #19

ashley-evans opened this issue Sep 15, 2021 · 0 comments
Labels

Comments

@ashley-evans
Copy link
Owner

ashley-evans commented Sep 15, 2021

Description

Currently if an error occurs while accessing a web page during the crawling operation in crawl-urls we simply retry 2 additional times.

As many sites implement strategies to prevent web scraping or have DDOS protection it is likely that the crawler will be prevented from accessing a page due to accessing too many of the pages in quick succession.

  • This has currently been mitigated by only requesting 50 pages from a single domain in one crawl.

Any failed page should be marked as failed in the DynamoDB URLsTable and the message related to that base URL should not be removed from the SQS Queue. Once the message is retried, the error-ed pages should be retried (Currently the visibility window is set to 720 seconds which might be adequate time to retry)

Acceptance Criteria

AC01

  • Any web page that errors due to DDOS protection etc. should be marked as such in the URLsTable
  • The related message should not be removed from the SQS queue and retried at a later stage
  • Upon retry, the error-ed web pages should be retried.
  • Repeating as necessary

AC02

  • If a page errors as per AC01 then no further pages from that base url should be accessed

AC03

  • Other records within the event should be processed as appropriately.
  • If the other records successfully process then those messages should be marked as complete and removed from the SQS queue.
@ashley-evans ashley-evans changed the title Failed Request Handling Failed Request Handling - DDOS protection Sep 15, 2021
@ashley-evans ashley-evans added the 5 label Sep 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant