- Strip URLs found in Sitemaps
- Inline
robots
dependency, closes #51 - Update Sitemap XML parsing to work better with newer versions of REXML
- Fix issue calling
Spidr
with option hash (i.e use double spat operator)
- Don't respect robots.txt file by default, PR#41
- Add
WaybackArchiver::respect_robots_txt=
configuration option, to control whether to respect robots.txt file or not - Update
spidr
gem, resolves issue#25 - Set default concurrency to
1
due to harsher rate limiting on Wayback Machine - Support for crawling multiple hosts, for example www.example.com, example.com and app.example.com PR#27
- Archive every page found, not only HTML pages - #24 thanks @chlorophyll-zz.
- Track what urls have been visited in sitemapper and don't visit them twice
- Protect sitemap index duplicates
Is history...