-
Notifications
You must be signed in to change notification settings - Fork 7
Home
DarkSpider is a python script to crawl and extract (regular or onion) webpages through TOR network.
Warning
Crawling is not illegal, but violating copyright is. It’s always best to double check a website’s T&C before crawling them. Some websites set up what’s called
robots.txt
to tell crawlers not to visit those pages. This crawler will allow you to go around this, but we always recommend respectingrobots.txt
.
Keep in mind
Extracting and crawling through TOR network take some time. That's normal behaviour; you can find more information here.
With a single argument you can read an .onion webpage or a regular one through TOR Network and using pipes you can pass the output at any other tool you prefer.
$ python darkspider.py -u http://github.com/ | grep 'google-site-verification'
<meta name="google-site-verification" content="xxxx">
If you want to crawl the links of a webpage use the -c
and you will get a folder all the extracted links. You can even use -d
to crawl them and so on. As far, there is also the necessary argument -p
to wait some seconds before the next crawl.
$ python darkspider.py -v -u http://github.com/ -c -d 2 -p 2
[ DEBUG ] TOR is ready!
[ DEBUG ] Your IP: XXX.XXX.XXX.XXX :: Tor Connection: True
[ DEBUG ] URL :: http://github.com
[ DEBUG ] Folder created :: github.com
[ INFO ] Crawler started from http://github.com with 2 depth, 2.0 seconds delay and using 16 Threads. Excluding 'None' links.
[ INFO ] Step 1 completed :: 87 result(s)
[ INFO ] Step 2 completed :: 4228 result(s)
[ INFO ] Network Structure created :: github.com/network_structure.json
Note
Output in Readme is trimmed for better readability. General verbose output is much detailed.