Skip to content

Commit

Permalink
References now working, still unknown command
Browse files Browse the repository at this point in the history
It appears that some styles and some versions of LaTex do not like
spaces in the citation command. It now compiles as bibtex and does
include the necessary references. However, one issue still remains
in the compilation with pdflatex, which I was not yet able to resolve.
This fixes the first part of issue #27
  • Loading branch information
jogli5er committed Sep 24, 2018
1 parent c86e70b commit 332d8b6
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 6 deletions.
8 changes: 4 additions & 4 deletions whitepaper/darknet-body.tex
Original file line number Diff line number Diff line change
Expand Up @@ -415,7 +415,7 @@ \subsection{Content Preprocessing}
For the following steps we need an array of terms instead of a string; thus we tokenize the content string into words.
Now that the language of the content is known, we remove stop words~\cite{McDowall} depending on the language. As it turns out, some of the content only contains stop words, resulting in an empty content after stop-words removal. This content is immediately marked as \emph{empty}.

After stop word removal, we normalize the content further and apply stemming \cite{WormerStemmer} based on the porter stemmer \cite{porter1980algorithm} when appropriate. Previous work \cite{Harman1991, hull1996stemming, krovetz1996word,SaltonTextProcessing} suggests that stemming does not work equally well in all languages, hence we only apply stemming for English content. Lemmatization (determining the canonical form of a set of words depending on the context) is not considered since previous work showed that it can even degrade performance when applied to English content.
After stop word removal, we normalize the content further and apply stemming \cite{WormerStemmer} based on the porter stemmer \cite{porter1980algorithm} when appropriate. Previous work \cite{Harman1991,hull1996stemming,krovetz1996word,SaltonTextProcessing} suggests that stemming does not work equally well in all languages, hence we only apply stemming for English content. Lemmatization (determining the canonical form of a set of words depending on the context) is not considered since previous work showed that it can even degrade performance when applied to English content.

Next, we build a positional index, which allows us to choose whether the classification should happen on a list of words, a bag of words or a set of words approach. Since we want to keep the most information possible and employ an SVM classifier, we decided to use a bag of words approach for the classification of labels. However, the software provides the functionality for all three approaches, which makes it easier to experiment with different machine learning techniques on the collected data set. For the bag of words model, we need to store the frequency of each word; the frequency can be extracted from the positional index, which is discussed in the following subsection.
In addition, we keep the extracted clean content string to facilitate manual classification. Thereby, the clean content string can be directly presented to a human classifier, in order to get the labeled training data set.
Expand All @@ -433,7 +433,7 @@ \subsection{Storing Content for Classification}
%---------------Content classification------------------
%
\subsection{Content Classification}
As discussed in \cite{Nabki2017, Verma2013}, a Support Vector Machine (SVM) is well suited for text classification, mainly because the SVM efficiently copes with high dimensional features that are typical for text classification. Moreover, an SVM is easy to train and apply, compared to different types of neural networks. We use $k$-fold cross-validation and compare the output scores to decide which parameters and kernels are performing better. The parameters that deliver the best results are then used to train the model.
As discussed in \cite{Nabki2017,Verma2013}, a Support Vector Machine (SVM) is well suited for text classification, mainly because the SVM efficiently copes with high dimensional features that are typical for text classification. Moreover, an SVM is easy to train and apply, compared to different types of neural networks. We use $k$-fold cross-validation and compare the output scores to decide which parameters and kernels are performing better. The parameters that deliver the best results are then used to train the model.

We begin the classification process with a very small training set of about $200$ entries, in an attempt to minimize the size of the training set, and then apply active learning~\cite{Xu2009}.
Active learning requires knowledge on the uncertainty level of the classifier for each sample, since in every iteration we manually label the samples with the highest uncertainty. The idea is to manually label the entries with the highest entropy to convey the most information to the classifier.
Expand Down Expand Up @@ -564,11 +564,11 @@ \section{Darknet Content}
\section{Related Work}\label{sec:relatedwork}

%crawling the clearnet
Starting with the rise of the web, crawling has been a topic of interest for the academic community as well as society in general. Some of the earliest crawlers \cite{Gray1993,McBryan1994, Eichmann1994, Pinkerton1994} were presented just shortly after the web became public. However, since the early web was smaller than the visible darknet today, clearnet crawlers did not face the same issues as they do today. Castillo~\cite{Castillo2005} and Najork and Heydon~\cite{Najork2002} addressed a wide variety of issues, e.g. scheduling. Castillo found that applying different scheduling strategies works best. Both acknowledged the need to limit the number of requests sent to the targets, and described several approaches similar to ours.
Starting with the rise of the web, crawling has been a topic of interest for the academic community as well as society in general. Some of the earliest crawlers \cite{Gray1993,McBryan1994,Eichmann1994,Pinkerton1994} were presented just shortly after the web became public. However, since the early web was smaller than the visible darknet today, clearnet crawlers did not face the same issues as they do today. Castillo~\cite{Castillo2005} and Najork and Heydon~\cite{Najork2002} addressed a wide variety of issues, e.g. scheduling. Castillo found that applying different scheduling strategies works best. Both acknowledged the need to limit the number of requests sent to the targets, and described several approaches similar to ours.

%crawling the darknet
Contrary to aforementioned work, we are interested in exploring the \emph{darknet}.
Previous work~\cite{GeorgeKadianakis2015, TorMetricsOnion} has estimated the size of the Tor darknet using non-invasive methods that allow us to measure what portion of the darknet we have explored.
Previous work~\cite{GeorgeKadianakis2015,TorMetricsOnion} has estimated the size of the Tor darknet using non-invasive methods that allow us to measure what portion of the darknet we have explored.
Similarly to our work, Al Nabki et al.~\cite{Nabki2017} want to understand the Tor network; they created a dataset of active Tor hidden services and explored different methods for content classification. However, instead of using a real crawler, they browsed a big initial dataset as seed, and then followed all links they found exactly twice. In turn, Moore and Rid~\cite{Moore2016} went to depth five, which seems to be good enough to catch most of the darknet, as their results are similar to ours. Our crawler does a real recursive search and as such understands the darknet topology in more detail, in contrast to both of these previous works.

%classifying the darknet with different methods
Expand Down
4 changes: 2 additions & 2 deletions whitepaper/report.tex
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
\newif\ifdgruyter
\newif\ifanonymous

\dgruytertrue
% \lncstrue
% \dgruytertrue
\lncstrue

\newif\ifonecolumn
\newif\iftwocolumn
Expand Down

0 comments on commit 332d8b6

Please sign in to comment.