Road to v2 #68

buren · 2022-06-24T09:09:48Z

buren · 2022-06-24T09:16:29Z

🔔 @danshearer, @bartman081523, @xplosionmind, @shoeper, @fgrehm, @jhcloos, @milliken, @jeanpauldejong

If any of you have any input or ideas I would love to her them! ⭐

danshearer · 2022-06-24T09:30:15Z

The good thing about retries is that they put off to some extent the need to think about tracking state.

The state issue is this: if there are 2000 URLs to submit, and wayback_archiver has submitted 300 of them before aborting with an exception, it seems a bit silly to restart at URL 1. A first implementation of state tracking might only cover sites that have a Sitemap, because then it is trivial.

danshearer · 2022-06-24T10:36:46Z

Here is a possible feature to consider: update-only pushing. This might be a better alternative to keeping state, even in the case of a Sitemap walk, as described below. At the moment wayback_archiver blindly pushes URLs.

I didn't find any reference to the Wayback Availability JSON API being ratelimited. Presumably it is at some level, but it would seem unlikely to be as strict as the submission API. That means it is a cheap operation to query if the URL we are about to push already exists, using Waybacks idea of "closest URL". But if, for example , there appears to be an identical URL with a snapshot time of 5 minutes ago then we might decide to skip it and move on to the next.

Using the Availability API means wayback_archiver can still be stateless and yet not keep repeating existing work. And for smaller-scale sites (say a few thousand URLs) we don't need any kind of sophisticated tree walk algorithm because the API is cheap.

danshearer · 2022-06-24T11:03:31Z

Easy features to implement would be --order-reverse and --order-random. This is like the very first of the "don't submit URL 1 again and again". It would start from the bottom of the Sitemap, or, do a random walk through the sitemap. Still doesn't keep any state, but it gives a modest improvement with almost no development effort.

bartman081523 · 2022-06-25T04:52:07Z

@buren Thank you for inviting me, I will happily and naively suggest:

Loading and saving states in the crawling process; therefore a map would have to be created and saved first, with all the crawling targets, and the results from the crawling. This also makes tracking of errors and/or repeating crawl of failed or not-archived targets easier. (maybe gzip the state tracking sitemap in a temp dir)
Maybe split functionalities between crawling and uploading, and only chain them in auto mode together. This also makes the process of archiving easier, when you can run crawling for many adresses with high threads and then just archive them one after another with low threads (as lately in archive.org) (just a naive suggestion)
I think at least one archive service to add would give a better handling against errors or breaking changes on archive.org side. Maybe specify custom archive targets (--custom-target="https://archive.ph/submit/?&url=%%url%%") (not easy in the sense of results tracking, but I think most archive pages will forward to archive result) (just a naive suggestion)
Specify filetypes to crawl (--filetype= all | txt | txt,pdf | all,-pdf | [etc])
Json input and output, I recently read about how efficient in the sense of portability it is for programs, to be chained together with json input and output; might be not far off from csv output (just a naive suggestion)
dont overcomplicate things in auto mode, leave easy-to-access auto mode, but it is most likely that you do :-D

MatzFan · 2022-06-29T12:16:57Z

Interesting project, was going to implement this myself until I found you. Some info which may or may not be helpful:

The (draft) Save Page Now 2 (SPN2) API docs are here. AFAICT this is the API the Wayback Machine uses for saving urls with an authenticated user. The spec uses cookie or API key authentication (I can't get the former to work). An authenticated page save results in a JSON response, so:
curl web.archive.org/save -d "url=github.com" -H "Accept: application/json" -H "Authorization: LOW myaccesskey:mysecret" gives a 200 response and:

{"url":"github.com","job_id":"spn2-8674ce5a6bb3aa7e67c394bdc97a9fa1f6802f6b"}

You can then do a status update request on that job_id like this:
curl web.archive.org/save/status/spn2-8674ce5a6bb3aa7e67c394bdc97a9fa1f6802f6b -H "Accept: application/json" -H "Authorization: LOW myaccesskey:mysecret"
The JSON response includes lots of information, including a status key whose value may be "error", "pending" or "success". This could be used to retry failed jobs.

One other thing I've noticed: The job_id is simply "spn2-" followed by an sha1 hash of the url*.

*In the form http://<url>/
So any of the following parameter data used in this example will yield the job_id above: github.com, http://github.com, https://github.com, github.com/ etc.. Proof:-

$ echo "http://github.com/"|tr -d "\n"|shasum
=> 8674ce5a6bb3aa7e67c394bdc97a9fa1f6802f6b  -

I've found at least one exception when, for example you save a page with a fragment (https://url/foo#bar) the sha1 hash is calculated on https://url/foo - i.e with https scheme retained and no trailing /

Also, beware frequently saved urls like example.com, as you'll just get the status of the most recent save by anyone.

The API includes rate limit parameters etc.

If you don't wish to include authenticated SPN2 REST API calls into your project I may create a gem just for that purpose, as I am considering building a server based archiving solution for long running jobs.

Other thoughts: Adding archive.today would be great, not sure they have an official API.

MatzFan · 2022-06-29T18:19:25Z

I may create a gem just for that purpose

So FWIW gem spn2 is now a thing. Bare bones but I'll add the rest of the SPN2 API functionality ASAP.

buren added this to the v2.0 milestone Jun 24, 2022

buren self-assigned this Jun 24, 2022

buren mentioned this issue Jun 24, 2022

Uncaught timeout exception in http.rb #66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Road to v2 #68

Road to v2 #68

buren commented Jun 24, 2022 •

edited

Loading

buren commented Jun 24, 2022

danshearer commented Jun 24, 2022

danshearer commented Jun 24, 2022

danshearer commented Jun 24, 2022

bartman081523 commented Jun 25, 2022 •

edited

Loading

MatzFan commented Jun 29, 2022

MatzFan commented Jun 29, 2022

Road to v2 #68

Road to v2 #68

Comments

buren commented Jun 24, 2022 • edited Loading

buren commented Jun 24, 2022

danshearer commented Jun 24, 2022

danshearer commented Jun 24, 2022

danshearer commented Jun 24, 2022

bartman081523 commented Jun 25, 2022 • edited Loading

MatzFan commented Jun 29, 2022

MatzFan commented Jun 29, 2022

buren commented Jun 24, 2022 •

edited

Loading

bartman081523 commented Jun 25, 2022 •

edited

Loading