Skip to content

Latest commit

 

History

History
335 lines (275 loc) · 41.4 KB

File metadata and controls

335 lines (275 loc) · 41.4 KB

NLP

Word Embedding

A word embedding is an approach to provide a dense vector representation of words that capture something about their meaning.

Word embeddings are an improvement over simpler bag-of-word model word encoding schemes like word counts and frequencies that result in large and sparse vectors (mostly 0 values) that describe documents but not the meaning of the words.

Word embeddings work by using an algorithm to train a set of fixed-length dense and continuous-valued vectors based on a large corpus of text. Each word is represented by a point in the embedding space and these points are learned and moved around based on the words that surround the target word.

It is defining a word by the company that it keeps that allows the word embedding to learn something about the meaning of words. The vector space representation of the words provides a projection where words with similar meanings are locally clustered within the space.

The use of word embeddings over other text representations is one of the key methods that has led to breakthrough performance with deep neural networks on problems like machine translation.

Word2vec is one algorithm for learning a word embedding from a text corpus.

There are two main training algorithms that can be used to learn the embedding from text; they are continuous bag of words (CBOW) and skip grams.

We will not get into the algorithms other than to say that they generally look at a window of words for each target word to provide context and in turn meaning for words. The approach was developed by Tomas Mikolov, formerly at Google and currently at Facebook.

According to Wikipedia:

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension.

Further readings:https://www.quora.com/What-does-the-word-embedding-mean-in-the-context-of-Machine-Learning/answer/Julien-Despoishttps://www.tensorflow.org/tutorials/word2vec#motivation_why_learn_word_embeddingshttps://www.zhihu.com/question/32275069

Awesome-Chinese-NLP: A curated list of resources for Chinese NLP 中文自然語言處理相關資料

Natural Language Processing Key Terms, Explained

Sentiment Analysis

Text Classification

Analyzing tf-idf results in scikit-learn - datawerk

Tf-idf stands for term frequency-inverse document frequency

Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

  • TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
  • IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

See below for a simple example.

Example:

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

Chinese

Text Analysis using Machine Learning

Most of the algorithms accept only numerical feature vectors (vector is a one dimensional array in computer science). So we need to convert the text documents into numerical features vectors with a fixed size in order to make use of the machining learning algorithms for text analysis.

This can be done by the following steps:

  1. Assign each of the words in the text documents an integer ID. Each of the words is called a token. This step is called tokenization.
  2. Count the occurrences of tokens in each document. This step is called counting. The count of each token is created as a feature.
  3. Normalization (Don't understand what it means at this moment)

(to add easy-to-understand example)

This process is called vectorization. The resulting numerical feature vectors is called a bag-of-words representation.

One issue of vectorization is that longer documents will have higher average count values than shorter documents while they might talk about the same topic. The solution is to divide the number of occurrences of each word in a document by total number of words in the document. These features are called term frequency or tf.

Another issue vectorization is that in a large text corpus the common words like "the", "a", "is" will shadow the rare words during the model induction. The solution is to downscale the weight of the words that appear in many documents. This downscaling is called term frequency times inverse document frequency or tf-idf .

I learnt the above from a scikit-learn tutorial.

According to Kaggle, word embedding is an example of pre-trained models. The followings are the embeddings mentioned by Kaggle competitors:

Kaggle requires competitors to share the pre-trained models and word embeddings used to "keep the competition fair by making sure that everyone has access to the same data and pretrained models."

What is pre-trained models?

What is word embedding?

Some other tools:

Google AI Blog: Text summarization with TensorFlow

Transfer Learning

To be categorized