GitHub - rob-mcgrail/audit-tentacles: An Anemone/Redis script for media auditing

A web crawling app that seeks out media files, and stores their location, context and MD5 hash in easy-to-query redis sets. Data can be pulled out ‘by checksum’, giving URI locations and context pages for specific files. Data can also be viewed by size, URI, and by site. Files that appear on multiple pages, or accross multiple sites can also be easily singled out.

The basics

1. Sort out your environment

$ gem install bundler

$ bundle install

2. Make sure it passes its tests

$ rake

3. Clean out your redis…

$ bundle exec ruby bin/FLUSH

4. CRAWL

$ bundle exec ruby bin/run http://site.com

Options

Options are located in ./options.rb.

$options.datastore : Takes an integer representing the desired Redis db.

$options.global_prefix : A string placed at the start of all keys, to avoid collisions if your app shares a db with others.

$options.page_delay : Delay between pages when crawling sites.

Additional configuration may be necessary for appropirate slurping out of file paths. This can be altered in ./lib/audit_tentacles/slurp.rb.. In this class you can see exclusions for certain url paths and filetypes etc. Adjust to suit your requirements. This is a messy class.

A TKI specific cookie-grabbing function (SSOAuth.get_cookies) passes a hash of cookies in to the crawler via a Mechanize action that clicks the first form submission button it sees. This wont be desirable behaviour is many cases :D. Disable by removing :cookies => SSOAuth.get_cookies(site) from the options hash in ./lib/audit_tentacles/crawler.rb.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
bin		bin
lib		lib
test		test
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.textile		README.textile
Rakefile		Rakefile
options.rb		options.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The basics

Options

About

Releases

Packages

Languages

rob-mcgrail/audit-tentacles

Folders and files

Latest commit

History

Repository files navigation

The basics

Options

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages