-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archive images in IA #31
Comments
I have seen those links to embedded resources before. The Internet Archive will archive those resources when the archived page is reloaded, and, in other cases, even though you see those links in the returned HTML code, the Internet Archive has already placed them in a queue for crawling, so next time the page is reloaded/revisited, those resources will be already captured, and they will be served from the archive. So I think no need to add this feature to ArchiveNow right now. The other issue is that some links to web resources (e.g., images) are generated by Javascript and they are totally different each time you reload the archived page. For example, each time you reload http://web.archive.org/web/20190107013706/http://ws-dl.blogspot.com in the browser, you will get unique links that have not been archived yet, again, because those links contains some unique random values (e.g., the link that ends with slideshare.net/fizzy/admin?...) |
@maturban, why did you close this? The issue with not downloading the images is that by the time someone actually opens the page (which may not happen if the user just assumes the archive is fine), the embedded content may already have been modified or have disappeared. Some images may change daily or even more frequently. I have personally archived more than 107 pages to IA using similar (but less sophisticated) methods, but have had to either download every page and then separately archive all the images, or not archive the images at all. I've written a script which saves embedded content, but it's basically just |
Thanks for the information. Could you please give one or more examples of such pages with images that when submitted to the Internet Archive, the returned response may have URIs with .../save/_embed/...? |
One solution to this issue would be to load the returned URI-M in a headless web browser, which will automatically trigger requests to archive all the embedded resources. |
I had thought this was always the case for pages being saved (.../save/, but not .../web/) with embedded content from another address used through the |
|
I think this is not an issue. These |
Does it work like that? I don't think the server does that. I thought it just did redirect magic with /web/ URLs so that all of the links work.
Most of the images shown on the page were archived 35 minutes before that capture. I've tried using an odd image size for an example image in my Wikipedia sandbox. MediaWiki generates scaled thumbnails from images originally uploaded to the server by users, so it's very likely that the image was never rendered until a few minutes ago.
The Example.svg.png image links have not been saved yet (277px · 416px · 554px); thus the absence of _embed URLs does not indicate that the Internet Archive has saved the linked embedded content. |
Fair enough! If that's the approach they are taking, then headless browser seems to be the way to go. |
It would be nice if the tool would also archive embedded content for Internet Archive requests. This could be done by downloading the archived page and searching for any
/save/_embed/[^"'<>\(\)]*
URLs in the page source.(It would also be nice if the tool could download lazy-loaded files and/or any linked media files, although even the Wayback Machine can't really do that in many cases.)
The text was updated successfully, but these errors were encountered: