Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive images in IA #31

Open
jc86035 opened this issue Jan 5, 2019 · 9 comments
Open

Archive images in IA #31

jc86035 opened this issue Jan 5, 2019 · 9 comments

Comments

@jc86035
Copy link

jc86035 commented Jan 5, 2019

It would be nice if the tool would also archive embedded content for Internet Archive requests. This could be done by downloading the archived page and searching for any /save/_embed/[^"'<>\(\)]* URLs in the page source.

(It would also be nice if the tool could download lazy-loaded files and/or any linked media files, although even the Wayback Machine can't really do that in many cases.)

@maturban
Copy link
Member

maturban commented Jan 7, 2019

I have seen those links to embedded resources before. The Internet Archive will archive those resources when the archived page is reloaded, and, in other cases, even though you see those links in the returned HTML code, the Internet Archive has already placed them in a queue for crawling, so next time the page is reloaded/revisited, those resources will be already captured, and they will be served from the archive. So I think no need to add this feature to ArchiveNow right now.

The other issue is that some links to web resources (e.g., images) are generated by Javascript and they are totally different each time you reload the archived page. For example, each time you reload http://web.archive.org/web/20190107013706/http://ws-dl.blogspot.com in the browser, you will get unique links that have not been archived yet, again, because those links contains some unique random values (e.g., the link that ends with slideshare.net/fizzy/admin?...)

@maturban maturban closed this as completed Jan 9, 2019
@jc86035
Copy link
Author

jc86035 commented Jan 10, 2019

@maturban, why did you close this?

The issue with not downloading the images is that by the time someone actually opens the page (which may not happen if the user just assumes the archive is fine), the embedded content may already have been modified or have disappeared. Some images may change daily or even more frequently.

I have personally archived more than 107 pages to IA using similar (but less sophisticated) methods, but have had to either download every page and then separately archive all the images, or not archive the images at all. I've written a script which saves embedded content, but it's basically just cd tmp; cat $1 | xargs -P 5 wget --spider --retry-connrefused ; grep -hro [...] tmp | awk '!seen[$0]++' | xargs -P 5 wget [...]. I also wrote a different script for YouTube because of its lazy loading – in some cases (e.g. Buzzfeed) it could be beneficial to save images that Wayback doesn't know how to archive or how to display.

@maturban maturban reopened this Jan 10, 2019
@maturban
Copy link
Member

Thanks for the information. Could you please give one or more examples of such pages with images that when submitted to the Internet Archive, the returned response may have URIs with .../save/_embed/...?

@ibnesayeed
Copy link
Member

One solution to this issue would be to load the returned URI-M in a headless web browser, which will automatically trigger requests to archive all the embedded resources.

@jc86035
Copy link
Author

jc86035 commented Jan 10, 2019

the returned response may have URIs with .../save/_embed/...

I had thought this was always the case for pages being saved (.../save/, but not .../web/) with embedded content from another address used through the src HTML attribute.

@jc86035
Copy link
Author

jc86035 commented Jan 10, 2019

$ curl -s "https://web.archive.org/save/https://en.wikipedia.org/wiki/Main_Page" | grep -o '/save/_embed/[^"<>()]*'

/save/_embed/https://en.wikipedia.org/w/load.php?debug=false&amp;lang=en&amp;modules=ext.3d.styles%7Cext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.skinning.interface%7Cskins.vector.styles&amp;only=styles&amp;skin=vector
/save/_embed/https://en.wikipedia.org/w/load.php?debug=false&amp;lang=en&amp;modules=startup&amp;only=scripts&amp;skin=vector
/save/_embed/https://en.wikipedia.org/w/load.php?debug=false&amp;lang=en&amp;modules=ext.gadget.charinsert-styles&amp;only=styles&amp;skin=vector
/save/_embed/https://en.wikipedia.org/w/load.php?debug=false&amp;lang=en&amp;modules=site.styles&amp;only=styles&amp;skin=vector
/save/_embed/https://en.wikipedia.org/static/favicon/wikipedia.ico
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Corporation_of_London_Records_Office%2C_Plea_and_Memoranda_Roll_A34%2C_m.2_%281395%29.jpg/120px-Corporation_of_London_Records_Office%2C_Plea_and_Memoranda_Roll_A34%2C_m.2_%281395%29.jpg
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/7/79/Sunset_Parade_-_US_Marin_Corps.jpg/180px-Sunset_Parade_-_US_Marin_Corps.jpg
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/0/0b/Batholomew_handing_tomos_to_Epiphanius.jpg/162px-Batholomew_handing_tomos_to_Epiphanius.jpg
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Constructing_the_Metropolitan_Railway.png/174px-Constructing_the_Metropolitan_Railway.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/b/b8/John_Quidor_-_The_Headless_Horseman_Pursuing_Ichabod_Crane_-_Google_Art_Project.jpg/550px-John_Quidor_-_The_Headless_Horseman_Pursuing_Ichabod_Crane_-_Google_Art_Project.jpg
/save/_embed/https://upload.wikimedia.org/wikipedia/en/thumb/4/4a/Commons-logo.svg/31px-Commons-logo.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/3/3d/Mediawiki-logo.png/35px-Mediawiki-logo.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/7/75/Wikimedia_Community_Logo.svg/35px-Wikimedia_Community_Logo.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Wikibooks-logo.svg/35px-Wikibooks-logo.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/f/ff/Wikidata-logo.svg/47px-Wikidata-logo.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/2/24/Wikinews-logo.svg/51px-Wikinews-logo.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Wikiquote-logo.svg/35px-Wikiquote-logo.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Wikisource-logo.svg/35px-Wikisource-logo.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Wikispecies-logo.svg/35px-Wikispecies-logo.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/0/0b/Wikiversity_logo_2017.svg/41px-Wikiversity_logo_2017.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Wikivoyage-Logo-v3-icon.svg/35px-Wikivoyage-Logo-v3-icon.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/en/thumb/0/06/Wiktionary-logo-v2.svg/35px-Wiktionary-logo-v2.svg.png
/save/_embed/https://en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1
/save/_embed/https://en.wikipedia.org/static/images/wikimedia-button.png
/save/_embed/https://en.wikipedia.org/static/images/poweredby_mediawiki_88x31.png

@ibnesayeed
Copy link
Member

I think this is not an issue. These /save/_embed URIs are temporary and go away in the next few second/minutes. I archived a Wikipedia page and saved the immediate response into a file locally. This files contained a handful of embed URIs. Then within a minute I downloaded the recently archived memento using cURL (not a web browser to avoid any implicit save requests) and found no embed URIs in it. This means while in the immediate response other resources are still in the frontier queue, the server rewrites those links differently, but in the next few minutes those queued resources should be archived.

@jc86035
Copy link
Author

jc86035 commented Jan 10, 2019

but in the next few minutes those queued resources should be archived

Does it work like that? I don't think the server does that. I thought it just did redirect magic with /web/ URLs so that all of the links work.

curl -s "https://web.archive.org/web/20190110133207/https://en.wikipedia.org/wiki/Main_Page" | grep -o '/web/[^"<>()]*\.jpg'

/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/c/cf/Corporation_of_London_Records_Office%2C_Plea_and_Memoranda_Roll_A34%2C_m.2_%281395%29.jpg
/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Corporation_of_London_Records_Office%2C_Plea_and_Memoranda_Roll_A34%2C_m.2_%281395%29.jpg/120px-Corporation_of_London_Records_Office%2C_Plea_and_Memoranda_Roll_A34%2C_m.2_%281395%29.jpg
/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Corporation_of_London_Records_Office%2C_Plea_and_Memoranda_Roll_A34%2C_m.2_%281395%29.jpg/180px-Corporation_of_London_Records_Office%2C_Plea_and_Memoranda_Roll_A34%2C_m.2_%281395%29.jpg 1.5x, //web.archive.org/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Corporation_of_London_Records_Office%2C_Plea_and_Memoranda_Roll_A34%2C_m.2_%281395%29.jpg/240px-Corporation_of_London_Records_Office%2C_Plea_and_Memoranda_Roll_A34%2C_m.2_%281395%29.jpg
/web/20190110133207/https://en.wikipedia.org/wiki/File:Sunset_Parade_-_US_Marin_Corps.jpg
/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/7/79/Sunset_Parade_-_US_Marin_Corps.jpg/180px-Sunset_Parade_-_US_Marin_Corps.jpg
/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/7/79/Sunset_Parade_-_US_Marin_Corps.jpg/270px-Sunset_Parade_-_US_Marin_Corps.jpg 1.5x, //web.archive.org/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/7/79/Sunset_Parade_-_US_Marin_Corps.jpg/360px-Sunset_Parade_-_US_Marin_Corps.jpg
/web/20190110133207/https://en.wikipedia.org/wiki/File:Batholomew_handing_tomos_to_Epiphanius.jpg
/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/0/0b/Batholomew_handing_tomos_to_Epiphanius.jpg/162px-Batholomew_handing_tomos_to_Epiphanius.jpg
/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/0/0b/Batholomew_handing_tomos_to_Epiphanius.jpg/243px-Batholomew_handing_tomos_to_Epiphanius.jpg 1.5x, //web.archive.org/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/0/0b/Batholomew_handing_tomos_to_Epiphanius.jpg/324px-Batholomew_handing_tomos_to_Epiphanius.jpg
/web/20190110133207/https://en.wikipedia.org/wiki/File:John_Quidor_-_The_Headless_Horseman_Pursuing_Ichabod_Crane_-_Google_Art_Project.jpg
/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/b/b8/John_Quidor_-_The_Headless_Horseman_Pursuing_Ichabod_Crane_-_Google_Art_Project.jpg/550px-John_Quidor_-_The_Headless_Horseman_Pursuing_Ichabod_Crane_-_Google_Art_Project.jpg
/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/b/b8/John_Quidor_-_The_Headless_Horseman_Pursuing_Ichabod_Crane_-_Google_Art_Project.jpg/825px-John_Quidor_-_The_Headless_Horseman_Pursuing_Ichabod_Crane_-_Google_Art_Project.jpg 1.5x, //web.archive.org/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/b/b8/John_Quidor_-_The_Headless_Horseman_Pursuing_Ichabod_Crane_-_Google_Art_Project.jpg/1100px-John_Quidor_-_The_Headless_Horseman_Pursuing_Ichabod_Crane_-_Google_Art_Project.jpg

Most of the images shown on the page were archived 35 minutes before that capture.

I've tried using an odd image size for an example image in my Wikipedia sandbox. MediaWiki generates scaled thumbnails from images originally uploaded to the server by users, so it's very likely that the image was never rendered until a few minutes ago.

curl -s "https://web.archive.org/save/https://en.wikipedia.org/wiki/User:Jc86035/sandbox3"
curl -s "https://web.archive.org/web/20190110140927/https://en.wikipedia.org/wiki/User:Jc86035/sandbox3" | grep -o '/web/[^"<>()]*\.png'

/web/20190110140927im_/https://en.wikipedia.org/static/apple-touch/wikipedia.png
/web/20190110140927im_/https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Example.svg/277px-Example.svg.png
/web/20190110140927im_/https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Example.svg/416px-Example.svg.png 1.5x, //web.archive.org/web/2/https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Example.svg/554px-Example.svg.png
/web/20190110140927im_/https://en.wikipedia.org/static/images/wikimedia-button.png
/web/20190110140927im_/https://en.wikipedia.org/static/images/wikimedia-button-1.5x.png 1.5x, /web/20190110140927im_/https://en.wikipedia.org/static/images/wikimedia-button-2x.png
/web/20190110140927im_/https://en.wikipedia.org/static/images/poweredby_mediawiki_88x31.png
/web/20190110140927im_/https://en.wikipedia.org/static/images/poweredby_mediawiki_132x47.png 1.5x, /web/20190110140927im_/https://en.wikipedia.org/static/images/poweredby_mediawiki_176x62.png

The Example.svg.png image links have not been saved yet (277px · 416px · 554px); thus the absence of _embed URLs does not indicate that the Internet Archive has saved the linked embedded content.

@ibnesayeed
Copy link
Member

Fair enough! If that's the approach they are taking, then headless browser seems to be the way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants