Post URLs to Wayback Machine (Internet Archive), using a crawler, from Sitemap(s), or a list of URLs.
The Wayback Machine is a digital archive of the World Wide Web [...] The service enables users to see archived versions of web pages across time ...
- Wikipedia
Index
Install the gem:
$ gem install wayback_archiver
Or add this line to your application's Gemfile:
gem 'wayback_archiver'
And then execute:
$ bundle
Strategies:
auto
(the default) - Will try to- Find Sitemap(s) defined in
/robots.txt
- Then in common sitemap locations
/sitemap-index.xml
,/sitemap.xml
etc. - Fallback to crawling (using the excellent spidr gem)
- Find Sitemap(s) defined in
sitemap
- Parse Sitemap(s), supports index files (and gzip)urls
- Post URL(s)
First require the gem
require 'wayback_archiver'
Examples:
Auto
# auto is the default
WaybackArchiver.archive('example.com')
# or explicitly
WaybackArchiver.archive('example.com', strategy: :auto)
Crawl
WaybackArchiver.archive('example.com', strategy: :crawl)
Only send one single URL
WaybackArchiver.archive('example.com', strategy: :url)
Send multiple URLs
WaybackArchiver.archive(%w[example.com www.example.com], strategy: :urls)
Send all URL(s) found in Sitemap
WaybackArchiver.archive('example.com/sitemap.xml', strategy: :sitemap)
# works with Sitemap index files too
WaybackArchiver.archive('example.com/sitemap-index.xml.gz', strategy: :sitemap)
Specify concurrency
WaybackArchiver.archive('example.com', strategy: :auto, concurrency: 10)
Specify max number of URLs to be archived
WaybackArchiver.archive('example.com', strategy: :auto, limit: 10)
Each archive strategy can receive a block that will be called for each URL
WaybackArchiver.archive('example.com', strategy: :auto) do |result|
if result.success?
puts "Successfully archived: #{result.archived_url}"
else
puts "Error (HTTP #{result.code}) when archiving: #{result.archived_url}"
end
end
Use your own adapter for posting found URLs
WaybackArchiver.adapter = ->(url) { puts url } # whatever that responds to #call
Usage:
wayback_archiver [<url>] [options]
Print full usage instructions
wayback_archiver --help
Examples:
Auto
# auto is the default
wayback_archiver example.com
# or explicitly
wayback_archiver example.com --auto
Crawl
wayback_archiver example.com --crawl
Only send one single URL
wayback_archiver example.com --url
Send multiple URLs
wayback_archiver example.com www.example.com --urls
Crawl multiple URLs
wayback_archiver example.com www.example.com --crawl
Send all URL(s) found in Sitemap
wayback_archiver example.com/sitemap.xml
# works with Sitemap index files too
wayback_archiver example.com/sitemap-index.xml.gz
Most options
wayback_archiver example.com www.example.com --auto --concurrency=10 --limit=100 --log=output.log --verbose
View archive: https://web.archive.org/web/*/http://example.com (replace http://example.com
with to your desired domain).
ℹ️ By default wayback_archiver
doesn't respect robots.txt files, see this Internet Archive blog post for more information.
Configuration (the below values are the defaults)
WaybackArchiver.concurrency = 1
WaybackArchiver.user_agent = WaybackArchiver::USER_AGENT
WaybackArchiver.respect_robots_txt = WaybackArchiver::DEFAULT_RESPECT_ROBOTS_TXT
WaybackArchiver.logger = Logger.new(STDOUT)
WaybackArchiver.max_limit = -1 # unlimited
WaybackArchiver.adapter = WaybackArchiver::WaybackMachine # must implement #call(url)
For a more verbose log you can configure WaybackArchiver
as such:
WaybackArchiver.logger = Logger.new(STDOUT).tap do |logger|
logger.progname = 'WaybackArchiver'
logger.level = Logger::DEBUG
end
Pro tip: If you're using the gem in a Rails app you can set WaybackArchiver.logger = Rails.logger
.
You can find the docs online on RubyDoc.
This gem is documented using yard
(run from the root of this repository).
yard # Generates documentation to doc/
Contributions, feedback and suggestions are very welcome.
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request
- Don't know what the Wayback Machine (Internet Archive) is? Wayback Machine
- Don't know what a Sitemap is? sitemaps.org
- Don't know what robot.txt is? www.robotstxt.org