Shallots

"Shining the light on the dark web, analysis of what is beneath the surface of the internet"
Analysis dashboard

Summary

The dark web has a lot of secrets. My goal is to give some insight in what is going on there. I'll describe the main topics that are discussed there and give insight in their meaning. I'll divide the clusters into 2 groups (legal/illegal) and find out if they behave like 2 separate islands or are actually connected. And last I'll geographically plot the countries talked about in relation to the clusters. It will start with crawled dark web html content and it will end with a website with visualizations.

Description

Tor helps anonymous online communication. It is meant to provide safety for vulnerable internet users such as political activists. The downside, however, is that it facilitates criminals that use servers that can only receive connections through Tor, to make it hard to get identified. Those servers are called hidden services and can be accessed through an .onion address.

Not much research has been done on what is going on in this "dark web". There was some content clustering, which showed that both legal and illegal content is available on websites. It is not clear how connected those 2 groups are.

Motivation

In 2011 I first encountered the illegal side of the dark web. Since then it kept surprising me that tools and analysts that focus on the internet, normally don't take the dark web into account. They actually should in my opinion because this is especially the place where things can come to the surface since users feel save by the anonymizing function of Tor.

It is an ideal way to combine my interest in the dark web with my preference for NLP, social network analysis (SNA) and visualization into one project. And it can grow along the way, if there is time, looking further into insights I get during the analysis.

Numbers
The current setup works wih 3350 crawled onion websites (2408 of which are classified as being in english) There are 1117 distinct domains within the data, 743 of which being in English.

Data Sources
Crawled dark web data stored in mongoDB, crawled by the builder of Ahmia and OnionBot. OnionBot

Details

Process
-Get the scraped html content stored in MongoDB
-Check the scraped data for correctness and completeness (+EDA)
-Detect language of the content and continue only with English content data
-Find .onion links in result (regex) and fill the relations table with that (id, id) in SQL
-Clear html from content
-Clean stopwords, lemmatize and vectorize. Do topic modeling with varying k (somewhere around k=10)
-Read cluster top x words to decide what the best descriptive word is, if not clear, change k
-Store manually decided name, legal/illegal in table with cluster
-NER on country names for visualization
-Create concept graph data of similar words with word2vec
-Create json files with relevant data for the viz
-Create website with data viz dashboard

-Visualize

Barchart with on click -> wordcloud
Network grouped clusters with relations between them based on url references
Map of the world with spectrum red-green based on legal/illegal
Mouseover piechart on map

Architecture & implementation
-Python
-MongoDB
-PostgreSQL
-d3.js
-NER from stanford
-Gensim
-sklearn NMF

Chart for data viz

Challenges I ran into:
-Getting it to work on amazon & storage > a lot of crashes

References
https://blog.torproject.org/category/tags/crawling
http://arxiv.org/pdf/1308.6768v2.pdf
http://www.dis.uniroma1.it/~dasec/DASec_Pustogarov.pdf
https://www.gwern.net/docs/sr/2014-spitters.pdf
https://github.com/juhanurmi/ahmia/tree/master/onionbot

Future work
Check the future_work.MD for my plans for future improvements.

Dependencies
pip install pymongo
conda install psycopg2
pip install psycountry
pip install fuzzywuzzy
Geograpy2 -> install from github, but comment out the reference to geograpy-nltk in the install script.
pip install python-Levenshtein
conda install gensim

Run it
Be sure to already have crawled data in mongodb
Python shallots.py for preparing the data
Python index.py starting the dashboard

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
data		data
images		images
web		web
README.md		README.md
add_languages.py		add_languages.py
clean_tokenize.py		clean_tokenize.py
conceptextractor.py		conceptextractor.py
extractcountries.py		extractcountries.py
future_work.md		future_work.md
setup_postgres.py		setup_postgres.py
shallots.py		shallots.py
todo.md		todo.md
topic_model.py		topic_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Shallots

About

Releases

Packages

Languages

roosje/shallots

Folders and files

Latest commit

History

Repository files navigation

Shallots

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages