Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update crawl service to map subdomains to same domain in results #381

Open
ashley-evans opened this issue Sep 23, 2022 · 1 comment
Open
Labels

Comments

@ashley-evans
Copy link
Owner

ashley-evans commented Sep 23, 2022

Value Added

Consolidates results for different subdomains against the same overall domain. Enables crawling of links that are on different subdomains

Description

Currently the crawl service will only crawl pages that are on the exact same hostname as the provided base URL, therefore, any links on a site that reference a different subdomain (www. etc.) will not be crawled.

  • This can have significant impact if all the URLs use the www. subdomain but the base URL does not have it (Only crawls the first page)

The crawl service should be updated to enable the crawling of any page that is on the same domain as the base URL. Crawls against different subdomains should update the known URLs for the overall domain in DynamoDB

  • Should store results for both subdomain crawls under the same partition key (domain name)

Acceptance Criteria

AC01

  • Update crawl service to enable the crawling of pages that are on the same domain (regardless of sub domain)

AC02

  • The crawl service should only allow one crawl every 2 days on any given domain
  • i.e. Only one crawl should be performed if multiple are initiated on different subdomains

AC03

  • The crawl service should be updated so the URLs and cached page content are stored against the domain name rather than specific hostname provided by the user
@ashley-evans ashley-evans added the 3 label Sep 23, 2022
@ashley-evans
Copy link
Owner Author

Can use: https://www.npmjs.com/package/tldts to obtain domain name from URL

@ashley-evans ashley-evans self-assigned this Sep 24, 2022
@ashley-evans ashley-evans removed their assignment Dec 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant