diff --git a/whitepaper/darknet-body.tex b/whitepaper/darknet-body.tex index 1cf6a94..3ec0a7b 100644 --- a/whitepaper/darknet-body.tex +++ b/whitepaper/darknet-body.tex @@ -415,7 +415,7 @@ \subsection{Content Preprocessing} For the following steps we need an array of terms instead of a string; thus we tokenize the content string into words. Now that the language of the content is known, we remove stop words~\cite{McDowall} depending on the language. As it turns out, some of the content only contains stop words, resulting in an empty content after stop-words removal. This content is immediately marked as \emph{empty}. -After stop word removal, we normalize the content further and apply stemming \cite{WormerStemmer} based on the porter stemmer \cite{porter1980algorithm} when appropriate. Previous work \cite{Harman1991, hull1996stemming, krovetz1996word,SaltonTextProcessing} suggests that stemming does not work equally well in all languages, hence we only apply stemming for English content. Lemmatization (determining the canonical form of a set of words depending on the context) is not considered since previous work showed that it can even degrade performance when applied to English content. +After stop word removal, we normalize the content further and apply stemming \cite{WormerStemmer} based on the porter stemmer \cite{porter1980algorithm} when appropriate. Previous work \cite{Harman1991,hull1996stemming,krovetz1996word,SaltonTextProcessing} suggests that stemming does not work equally well in all languages, hence we only apply stemming for English content. Lemmatization (determining the canonical form of a set of words depending on the context) is not considered since previous work showed that it can even degrade performance when applied to English content. Next, we build a positional index, which allows us to choose whether the classification should happen on a list of words, a bag of words or a set of words approach. Since we want to keep the most information possible and employ an SVM classifier, we decided to use a bag of words approach for the classification of labels. However, the software provides the functionality for all three approaches, which makes it easier to experiment with different machine learning techniques on the collected data set. For the bag of words model, we need to store the frequency of each word; the frequency can be extracted from the positional index, which is discussed in the following subsection. In addition, we keep the extracted clean content string to facilitate manual classification. Thereby, the clean content string can be directly presented to a human classifier, in order to get the labeled training data set. @@ -433,7 +433,7 @@ \subsection{Storing Content for Classification} %---------------Content classification------------------ % \subsection{Content Classification} -As discussed in \cite{Nabki2017, Verma2013}, a Support Vector Machine (SVM) is well suited for text classification, mainly because the SVM efficiently copes with high dimensional features that are typical for text classification. Moreover, an SVM is easy to train and apply, compared to different types of neural networks. We use $k$-fold cross-validation and compare the output scores to decide which parameters and kernels are performing better. The parameters that deliver the best results are then used to train the model. +As discussed in \cite{Nabki2017,Verma2013}, a Support Vector Machine (SVM) is well suited for text classification, mainly because the SVM efficiently copes with high dimensional features that are typical for text classification. Moreover, an SVM is easy to train and apply, compared to different types of neural networks. We use $k$-fold cross-validation and compare the output scores to decide which parameters and kernels are performing better. The parameters that deliver the best results are then used to train the model. We begin the classification process with a very small training set of about $200$ entries, in an attempt to minimize the size of the training set, and then apply active learning~\cite{Xu2009}. Active learning requires knowledge on the uncertainty level of the classifier for each sample, since in every iteration we manually label the samples with the highest uncertainty. The idea is to manually label the entries with the highest entropy to convey the most information to the classifier. @@ -564,11 +564,11 @@ \section{Darknet Content} \section{Related Work}\label{sec:relatedwork} %crawling the clearnet -Starting with the rise of the web, crawling has been a topic of interest for the academic community as well as society in general. Some of the earliest crawlers \cite{Gray1993,McBryan1994, Eichmann1994, Pinkerton1994} were presented just shortly after the web became public. However, since the early web was smaller than the visible darknet today, clearnet crawlers did not face the same issues as they do today. Castillo~\cite{Castillo2005} and Najork and Heydon~\cite{Najork2002} addressed a wide variety of issues, e.g. scheduling. Castillo found that applying different scheduling strategies works best. Both acknowledged the need to limit the number of requests sent to the targets, and described several approaches similar to ours. +Starting with the rise of the web, crawling has been a topic of interest for the academic community as well as society in general. Some of the earliest crawlers \cite{Gray1993,McBryan1994,Eichmann1994,Pinkerton1994} were presented just shortly after the web became public. However, since the early web was smaller than the visible darknet today, clearnet crawlers did not face the same issues as they do today. Castillo~\cite{Castillo2005} and Najork and Heydon~\cite{Najork2002} addressed a wide variety of issues, e.g. scheduling. Castillo found that applying different scheduling strategies works best. Both acknowledged the need to limit the number of requests sent to the targets, and described several approaches similar to ours. %crawling the darknet Contrary to aforementioned work, we are interested in exploring the \emph{darknet}. -Previous work~\cite{GeorgeKadianakis2015, TorMetricsOnion} has estimated the size of the Tor darknet using non-invasive methods that allow us to measure what portion of the darknet we have explored. +Previous work~\cite{GeorgeKadianakis2015,TorMetricsOnion} has estimated the size of the Tor darknet using non-invasive methods that allow us to measure what portion of the darknet we have explored. Similarly to our work, Al Nabki et al.~\cite{Nabki2017} want to understand the Tor network; they created a dataset of active Tor hidden services and explored different methods for content classification. However, instead of using a real crawler, they browsed a big initial dataset as seed, and then followed all links they found exactly twice. In turn, Moore and Rid~\cite{Moore2016} went to depth five, which seems to be good enough to catch most of the darknet, as their results are similar to ours. Our crawler does a real recursive search and as such understands the darknet topology in more detail, in contrast to both of these previous works. %classifying the darknet with different methods diff --git a/whitepaper/report.tex b/whitepaper/report.tex index dd283c9..151c020 100644 --- a/whitepaper/report.tex +++ b/whitepaper/report.tex @@ -2,8 +2,8 @@ \newif\ifdgruyter \newif\ifanonymous -\dgruytertrue -% \lncstrue +% \dgruytertrue +\lncstrue \newif\ifonecolumn \newif\iftwocolumn