Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New endpoint to accept a GOV URL and crawl it to find the links ? #473

Open
guy-roberts opened this issue Jun 21, 2021 · 2 comments
Open

Comments

@guy-roberts
Copy link

For our largely static site, it would be useful to let the API code find the URLs by crawling from our home page.

This must have been considered before, I bet I am missing something. If I did a PR to do this, am I likely to meet any show stoppers ?

Also, are there any instances running that our DfE project could use rather than hosting it ourselves ?

@thomasleese
Copy link
Contributor

thomasleese commented Jun 21, 2021

👋 This has been considered before, but the way we use the API on GOV.UK means that it's never been a requirement. Our publishing tools use this API to check the links of an individual document before it's published on GOV.UK, this means the API doesn't have access to see the new page, and therefore the publishing tool needs to extract the links itself and send them to the Link Checker API.

For your needs, it does sound like it would be useful to get Link Checker API to do the crawling. Thinking a bit how it would fit into the app could be a little tricky, as the expectation is that the API receives a set of links. However, I don't see any major technical blockers to getting the API to build a Batch itself by crawling an initial link.

I think the biggest problem is that it would be a feature that we wouldn't use on GOV.UK, so there would be an added maintenance cost to us which I'm not sure we'd be able to support. I don't see any reason not to raise a PR though, you could always use the API as a forked version of ours containing the feature you need.

In terms of a live API, unfortunately there isn't one available at the moment, as we run it privately within our infrastructure. It shouldn't be too difficult to run yourself as it's a standard Rails app, the difficult part might be getting it to work without our API key authentication.

@guy-roberts
Copy link
Author

guy-roberts commented Jun 21, 2021

Thanks for your quick response. We might well do a PR then, because we need such a thing. It could be a new API endpoint that accepts a URL, checks that its gov, then crawls to find all of the links under it. From then on it would just use the existing code.

Our project is for the DfE, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants