title | toc | date |
---|---|---|
Free text and natural language processing |
false |
2017-10-30 |
Bag of words(词袋): Represent each document as a vector of word frequencies.
!!! example "bag of words"
* “The goal of this lecture is to explain the basics of free text processing”
* “The bag of words model is one such approach”
* “Text processing via bag of words”
![bag_of_words_example](figures/bag_of_words_example.png)
Term frequency(词频) just refers to the counts of each word in a document. Denoted
Often (as in the previous slide), this just means the raw count, but there are also other possibilities.
-
$\text{tf}_{i, j}\in {0,1}$ – does word occur in document or not -
$\log(1 + \text{tf}_{i, j})$ – log scaling of counts -
$\text{tf}{i, j} / \max_j \text{tf}{i, j}$ – scale by document’s most frequent word
Term frequencies tend to be "overloaded" with very common words (“the”, “is”, “of”, etc). Idea if inverse document frequency(逆文档频率) weight words negatively in proportion to how often they occur in the entire set of documents.
Term frequency inverse document frequency(TFIDF) =
Given two documents
Between zero and one, higher numbers mean documents more similar.
We can create “bag of words” from word embedding vectors instead of term frequency vectors (see also, doc2vec model):
The good news is that you don’t need to create these models yourself, there exist publicly-available "pretrained" models that have just hardcoded the embeddings for a large number of words.
The "original" word2vec model, trained on 100 billion words from Google News documents, with vocabulary size of 3 million unique words:** https://code.google.com/archive/p/word2vec/
A (probabilistic) language model(语言模型) aims at providing a probability distribution over every word, given all the words before it
$$P(\text{word}_i|\text{word}i,...,\text{word}{i-1})$$
$$P(\text{word}i|\text{word}i,...,\text{word}{i-1})\approx P(\text{word}i|\text{word}{i-n+1},...,\text{word}{i-1})$$
A simple way (but not the only way) to estimate the conditional probabilities is simply by counting
$$P(\text{word}i|\text{word}i,...,\text{word}{i-1}) =\frac{\text{#(word}{i-n+1},...,\text{word}i)}{\text{#(word}{i-n+1},...,\text{word}_{i-1})}$$
Estimating language models with raw counts tends to estimate a lot of zero probabilities (especially if estimating the probability of some new text that was not used to build the model)
Simple solution: allow for any word to appear with some small probability
$$P(\text{word}i|\text{word}i,...,\text{word}{i-1}) =\frac{\text{#(word}{i-n+1},...,\text{word}i)+\alpha}{\text{#(word}{i-n+1},...,\text{word}_{i-1})+\alpha D}$$
where
How do we pick
- Lower
$𝑛$ : less context, but more samples of each possible$𝑛$ -gram - Higher
$𝑛$ : more context, but less samples “Correct” - choice is to use some measure of held-out cross-validation
- In practice: use
$𝑛 = 3$ for large datasets (i.e., triplets),$𝑛 = 2$ for small ones
Evaluating language models: Common strategy is to estimate the probability of some held out portion of data, and evaluate perplexity:
where we can evaluate the probability using
Perplexity on the corpus used to build the model will always decrease using higher
Note: this is only strictly true when 𝛼 = 0
The NLTK (natural language toolkit) library (http://www.nltk.org) is a standard Python library for handling text and natural language data.
# Load nltk and download necessary files:
import nltk
import nltk.corpus
#nltk.download() # just run this once
# Tokenize a document
sentence = "The goal of this lecture isn't to explain complex free text processing"
tokens = nltk.word_tokenize(sentence)
> ['The', 'goal', 'of', 'this', 'lecture', 'is', "n't", 'to',
'explain', 'complex', 'free', 'text', 'processing']
# Tag parts of speech
pos = nltk.pos_tag(tokens)
> [('The', 'DT'), ('goal', 'NN'), ('of', 'IN'), ('this', 'DT'), ('lecture', 'NN'),
('is', 'VBZ'), ("n't", 'RB'), ('to', 'TO'), ('explain', 'VB'), ('complex', 'JJ'),
('free', 'JJ'), ('text', 'NN'), ('processing', 'NN')]
Get list of English stop words (common words)
stopwords = nltk.corpus.stopwords.words("English")
Generate n-grams from document
list(nltk.ngrams(tokens, 3))
> [('The', 'goal', 'of'), ('goal', 'of', 'this'), ('of', 'this', 'lecture'), ('this', 'lecture', 'is'), ('lecture', 'is', "n't"), ('is', "n't", 'to'), ("n't", 'to', 'explain'), ('to', 'explain', 'complex'), ('explain', 'complex', 'free'), ('complex', 'free', 'text'), ('free', 'text', 'processing')]