-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Road to v2 #68
Comments
🔔 @danshearer, @bartman081523, @xplosionmind, @shoeper, @fgrehm, @jhcloos, @milliken, @jeanpauldejong If any of you have any input or ideas I would love to her them! ⭐ |
The good thing about retries is that they put off to some extent the need to think about tracking state. The state issue is this: if there are 2000 URLs to submit, and wayback_archiver has submitted 300 of them before aborting with an exception, it seems a bit silly to restart at URL 1. A first implementation of state tracking might only cover sites that have a Sitemap, because then it is trivial. |
Here is a possible feature to consider: update-only pushing. This might be a better alternative to keeping state, even in the case of a Sitemap walk, as described below. At the moment wayback_archiver blindly pushes URLs. I didn't find any reference to the Wayback Availability JSON API being ratelimited. Presumably it is at some level, but it would seem unlikely to be as strict as the submission API. That means it is a cheap operation to query if the URL we are about to push already exists, using Waybacks idea of "closest URL". But if, for example , there appears to be an identical URL with a snapshot time of 5 minutes ago then we might decide to skip it and move on to the next. Using the Availability API means wayback_archiver can still be stateless and yet not keep repeating existing work. And for smaller-scale sites (say a few thousand URLs) we don't need any kind of sophisticated tree walk algorithm because the API is cheap. |
Easy features to implement would be --order-reverse and --order-random. This is like the very first of the "don't submit URL 1 again and again". It would start from the bottom of the Sitemap, or, do a random walk through the sitemap. Still doesn't keep any state, but it gives a modest improvement with almost no development effort. |
@buren Thank you for inviting me, I will happily and naively suggest:
|
Interesting project, was going to implement this myself until I found you. Some info which may or may not be helpful: The (draft) Save Page Now 2 (SPN2) API docs are here. AFAICT this is the API the Wayback Machine uses for saving urls with an authenticated user. The spec uses cookie or API key authentication (I can't get the former to work). An authenticated page save results in a JSON response, so: {"url":"github.com","job_id":"spn2-8674ce5a6bb3aa7e67c394bdc97a9fa1f6802f6b"} You can then do a status update request on that job_id like this: One other thing I've noticed: The job_id is simply "spn2-" followed by an sha1 hash of the url*. *In the form http://<url>/ $ echo "http://github.com/"|tr -d "\n"|shasum
=> 8674ce5a6bb3aa7e67c394bdc97a9fa1f6802f6b - I've found at least one exception when, for example you save a page with a fragment (https://url/foo#bar) the sha1 hash is calculated on https://url/foo - i.e with https scheme retained and no trailing / Also, beware frequently saved urls like example.com, as you'll just get the status of the most recent save by anyone. The API includes rate limit parameters etc. If you don't wish to include authenticated SPN2 REST API calls into your project I may create a gem just for that purpose, as I am considering building a server based archiving solution for long running jobs. Other thoughts: Adding archive.today would be great, not sure they have an official API. |
So FWIW gem |
WaybackArchiver.configure { |c| ... }
instead of using top-level functions likeWaybackArchiver.user_agent=
(current configuration)Happy for any input!
The text was updated successfully, but these errors were encountered: