This is a small and reasonably performant implementation of TF-IDF written in Clojure.
There is only a single namespace, dk.cst.tf-idf
. This namespace contains the core TF-IDF functions:
(tf documents) ; => seq of normalized term frequency maps
(idf documents) ; => inverse term frequency map
(tf-idf documents) ; => seq of term->tf-idf maps
(vocab documents) ; => set containing the vocabulary
These core functions all take a sequence of documents
— usually just strings, although this depends on what *tokenizer-xf*
is bound to — and return regular Clojure collections. In order to avoid recalculating too many things, results of any intermediate calculations can usually also be fed into the next step of the algorithm.
The dk.cst.tf-idf
namespace also contains a few extra utility functions, e.g. functions for picking terms from TF-IDF results:
;; Top 3 terms for every document.
(top-n-terms 3 (tf-idf documents))
;; Top 50 terms based on the highest recorded TF-IDF score.
(take 50 (order-terms max (tf-idf documents)))
;; Top 50 terms based on TF-IDF score sums.
(take 50 (order-terms + (tf-idf documents)))
The *tokenizer-xf*
dynamic var contains a reference to the default transducer used to tokenize input documents.
In order to perform other kinds of text normalization, this dynamic var can be rebound to allow for alternative implementations. The simplest way to create a new tokenizer transducer is to use the included ->tokenizer-xf
function:
(binding [*tokenizer-xf* (->tokenizer-xf :tokenize #(str/split % #"\s"))]
(tf-idf documents))
This is a very brief explanation of the different terms used in TF-IDF.
- The list of all words considered in the corpus.
tf(d,t) = count(t in d) / count(x in d)
- How many times does the word/lemma appear in the document?
- Each frequency score is normalised by dividing with the total number of words in the text (
count(x in d)
). - Only the frequency of vocab is considered!
df(d,t) = count(d containing t)
- How many documents does word/lemma appear in?
- Not normalised by default in this implementation, although you can always run
(normalize-frequencies df-result)
to achieve this.
idf(d,t) = log(count(d) / (count(d containing t) + 1))
- The
document frequency
, except this number has been normalised by dividing withcount(d containing t)
. - This has the opposite effect of
df(d,t)
as rarer words will have a higher inverse document frequency than common words. - To avoid dividing by zero, 1 is added to
count(d containing t)
. - To keep very rare terms from having gigantic scores, the final value returned is actually the logarithm applied to this expression.
tfidf(d,t) = tf(d,t) * idf(d,t)
- The product of the
term frequency
and theinverse document frequency
.
- https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089
- https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/
- https://stackoverflow.com/questions/42269313/interpreting-the-sum-of-tf-idf-scores-of-words-across-documents