A word embedding is an approach to provide a dense vector representation of words that capture something about their meaning.
Word embeddings are an improvement over simpler bag-of-word model word encoding schemes like word counts and frequencies that result in large and sparse vectors (mostly 0 values) that describe documents but not the meaning of the words.
Word embeddings work by using an algorithm to train a set of fixed-length dense and continuous-valued vectors based on a large corpus of text. Each word is represented by a point in the embedding space and these points are learned and moved around based on the words that surround the target word.
It is defining a word by the company that it keeps that allows the word embedding to learn something about the meaning of words. The vector space representation of the words provides a projection where words with similar meanings are locally clustered within the space.
The use of word embeddings over other text representations is one of the key methods that has led to breakthrough performance with deep neural networks on problems like machine translation.
Word2vec is one algorithm for learning a word embedding from a text corpus.
There are two main training algorithms that can be used to learn the embedding from text; they are continuous bag of words (CBOW) and skip grams.
We will not get into the algorithms other than to say that they generally look at a window of words for each target word to provide context and in turn meaning for words. The approach was developed by Tomas Mikolov, formerly at Google and currently at Facebook.
- fast.ai NLP · Practical NLP
- Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) – Jay Alammar – Visualizing machine learning one concept at a time
- Intuitive Understanding of Word Embeddings: Count Vectors to Word2Vec
- How to get started in NLP – Towards Data Science
- Data Analysis & XGBoost Starter (0.35460 LB) | Kaggle
- Bag of Words Meets Bags of Popcorn | Kaggle
- Working With Text Data — scikit-learn 0.19.1 documentation
- sloria/TextBlob: Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
- Getting Started with spaCy for Natural Language Processing
- How I lost a silver medal in Kaggle’s Mercari Price Suggestion Challenge using CNNs and Tensorflow
- Understanding Feature Engineering (Part 4) — Deep Learning Methods for Text Data
- fastText/pretrained-vectors.md at master · facebookresearch/fastText
- Kyubyong/nlp_tasks: Natural Language Processing Tasks and References
- xiamx/awesome-sentiment-analysis: 😀😄😂😭 A curated list of Sentiment Analysis methods, implementations and misc. 😥😟😱😤
- The Essential NLP Guide for data scientists (codes for top 10 NLP tasks)
- What is TF-IDF? The 10 minute guide
- NLP: Any libraries/dictionaries out there for fixing common spelling errors? - Part 2 & Alumni - Deep Learning Course Forums
- How To Create a ChatBot With tf-seq2seq For Free! – Deep Learning as I See It
- How to easily do Topic Modeling with LSA, PSLA, LDA & lda2Vec
- Facebook Open Sources Dataset on NLP and Navigation Every Data Scientist should Download
According to Wikipedia:
Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension.
Further readings:https://www.quora.com/What-does-the-word-embedding-mean-in-the-context-of-Machine-Learning/answer/Julien-Despoishttps://www.tensorflow.org/tutorials/word2vec#motivation_why_learn_word_embeddingshttps://www.zhihu.com/question/32275069
Awesome-Chinese-NLP: A curated list of resources for Chinese NLP 中文自然語言處理相關資料
Natural Language Processing Key Terms, Explained
- How can I tokenize a sentence with Python?
- 自然語言處理從入門到進階資代碼資源庫彙總(隨時更新)
- 艾伦AI研究院发布AllenNLP:基于PyTorch的NLP工具包
- Deep Learning for NLP Best Practices
- Topic Modelling Financial News with Natural Language Processing
- Best Practices for Document Classification with Deep Learning
- Natural Language Processing in Artificial Intelligence is almost human-level accurate. Worse yet, it gets smart!
- Word vectors for non-NLP data and research people
- Deep Learning for NLP Best Practices
- 初学者指南:神经网络在自然语言处理中的应用
- A gentle introduction to Doc2Vec
- Word embeddings in 2017: Trends and future directions
- Word Embedding in Deep Learning
- Using pre-trained word embeddings in a Keras model
- Deep Learning for Natural Language Processing: 2016-2017
- 基于Spark /Tensorflow使用CNN处理NLP的尝试
- Understanding Convolutional Neural Networks for NLP
- Embedding projector - visualization of high-dimensional data
- Pytorch implementations of various Deep NLP models in cs-224n(Stanford Univ)
- Stop Using word2vec
- 让机器像人一样交流:斯坦福李纪为博士毕业论文
- Gentle Introduction to Statistical Language Modeling and Neural Language Models
- Dan Jurafsky & Chris Manning: Natural Language Processing (great intro video series)
- A simple spell checker built from word vectors – Ed Rushton – Medium
- Data Science 101 (Getting started in NLP): Tokenization tutorial | No Free Hunch
- Vector Representations of Words | TensorFlow (highly recommneded by Jeremey)
- NLP — Building a Question Answering model – Towards Data Science
- Entity extraction using Deep Learning based on Guillaume Genthial work on NER
- Text Classification using machine learning – Nitin Panwar – Technical Lead (Data Science), Naukri.com
- Unsupervised sentence representation with deep learning
- How to solve 90% of NLP problems: a step-by-step guide (11.3k clap!)
- Building a FAQ Chatbot in Python – The Future of Information Searching
- Sentiment analysis on Trump's tweets using Python
- Improving Airbnb Yield Prediction with Text Mining – Towards Data Science
- Machine Learning with Text in scikit-learn (PyCon 2016) - YouTube
- Natural Language Processing Nuggets: Getting Started with NLP
- Machine Learning as a Service: Part 1 – Towards Data Science
- Text Generation using a RNN
- Text Classification Using CNN, LSTM and Pre-trained Glove Word Embeddings: Part-3
- Ahmed BESBES - Data Science Portfolio – Overview and benchmark of traditional and deep learning models in text classification
- The 7 NLP Techniques That Will Change How You Communicate in the Future (Part I)
- Natural Language Processing: What are algorithms for auto summarize text? - Quora
- A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
- Salesforce has Developed One Single Model to Deal with 10 Different NLP Tasks
- Samsung's ConZNet Algorithm just won Two Popular NLP Challenges (Dataset Links Inside)
- Detecting Sarcasm with Deep Convolutional Neural Networks
- Detecting Emotions with CNN Fusion Models – DAIR – Medium
- What is the best tool to summarize a text document? - Quora
- Text Classification: Applications and Use Cases - ParallelDots
- 重磅譯制 | 更新:牛津大學xDeepMind自然語言處理 第10講 文本轉語音(2) | 機器之心****
- 技術解讀 | 基於fastText和RNN的語義消歧實戰 | 機器之心
- 葉志豪:介紹強化學習及其在 NLP 上的應用 | 分享總結 | 雷鋒網
- Convolutional neural networks for language tasks - O'Reilly Media
- A Comprehensive Guide to Understand and Implement Text Classification in Python
- Jeremy Howard on Twitter: "Very interesting - combining recent work from @lnsmith613 & @GuggerSylvain on super-convergence with a transformer language model (@AlecRad) shows dramatic improvements in perplexity, speed, and size over @Smerity's very strong AWD LSTM! https://t.co/t6LbAKap3M https://t.co/xI9E8zHZP8"
- kororo/excelcy: Excel Integration with SpaCy. Includes, Entity training, Entity matcher pipe.
- dongjun-Lee/text-classification-models-tf: Tensorflow implementations of Text Classification Models.
- dongjun-Lee/transfer-learning-text-tf: Tensorflow implementation of Semi-supervised Sequence Learning (https://arxiv.org/abs/1511.01432)
- IndicoDataSolutions/finetune: Scikit-learn style model finetuning for NLP
- [1807.00914] Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing
- ****Introducing state of the art text classification with universal language models · fast.ai NLP****
- 香儂科技獨家對話斯坦福大學計算機學院教授、麥克阿瑟天才獎得主Dan Jurafsky | 機器之心
- NLP概述和文本自動分類算法詳解 | 機器之心
- Multi-Class Text Classification with Scikit-Learn – Towards Data Science
- Comparison of Top 6 Python NLP Libraries – ActiveWizards: machine learning company – Medium
- 李航教授展望自然語言對話領域:現狀與未來 | 機器之心
- The unreasonable effectiveness of one neuron | Rakesh Chada's Blog
- 在自然語言處理領域,哪些企業的發展遙遙領先?(附報告) - CSDN博客
- ACL 2018:Attention 機制佔主流,中文語法檢錯測評引關注 | ACL 2018 | 雷鋒網
- harvardnlp/var-attn
- 苏州大学张民教授两小时讲座精华摘录:自然语言处理方法与应用 | 雷锋网
- Named Entity Recognition and Classification with Scikit-Learn
- Python自然语言处理工具NLTK学习导引及相关资料
- Perform sentiment analysis with LSTMs, using TensorFlow - O'Reilly Media
- Data Science 101: Sentiment Analysis in R Tutorial | No Free Hunch
- Sentiment Analysis through LSTMs – Towards Data Science
- A Beginner’s Guide on Sentiment Analysis with RNN – Towards Data Science
- Twitter Sentiment Analysis using combined LSTM-CNN Models – B-sides
- Comprehensive Hands on Guide to Twitter Sentiment Analysis with dataset & code
- 文本挖掘和情感分析的基础示例 | ATYUN
- A Comprehensive Guide to Understand and Implement Text Classification in Python
- Big Picture Machine Learning: Classifying Text with Neural Networks and TensorFlow
- Step 2.5: Choose a Model | ML Universal Guides | Google Developers
Analyzing tf-idf results in scikit-learn - datawerk
Tf-idf stands for term frequency-inverse document frequency
Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.
- TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
- IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
See below for a simple example.
Example:
Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
- Chinese Word Vectors:目前最全的中文預訓練詞向量集合 | 機器之心
- Embedding/Chinese-Word-Vectors: 100+ Chinese Word Vectors 上百種預訓練中文詞向量
Most of the algorithms accept only numerical feature vectors (vector
is a one dimensional array
in computer science). So we need to convert the text documents into numerical features vectors with a fixed size in order to make use of the machining learning algorithms for text analysis.
This can be done by the following steps:
- Assign each of the words in the text documents an integer ID. Each of the words is called a
token
. This step is calledtokenization
. - Count the occurrences of tokens in each document. This step is called
counting
. The count of each token is created as a feature. Normalization
(Don't understand what it means at this moment)
(to add easy-to-understand example)
This process is called vectorization
. The resulting numerical feature vectors is called a bag-of-words
representation.
One issue of vectorization
is that longer documents will have higher average count values than shorter documents while they might talk about the same topic. The solution is to divide the number of occurrences of each word in a document by total number of words in the document. These features are called term frequency
or tf
.
Another issue vectorization
is that in a large text corpus the common words like "the", "a", "is" will shadow the rare words during the model induction. The solution is to downscale the weight of the words that appear in many documents. This downscaling is called term frequency times inverse document frequency
or tf-idf
.
I learnt the above from a scikit-learn tutorial.
According to Kaggle, word embedding
is an example of pre-trained models
. The followings are the embeddings mentioned by Kaggle competitors:
Kaggle requires competitors to share the pre-trained models and word embeddings used to "keep the competition fair by making sure that everyone has access to the same data and pretrained models."
What is pre-trained models
?
What is word embedding
?
Some other tools:
Google AI Blog: Text summarization with TensorFlow
- NervanaSystems/nlp-architect: NLP Architect by Intel AI Lab: Python library for exploring the state-of-the-art deep learning topologies and techniques for natural language processing and natural language understanding
- NLTK Book
- NLP in Online Courses: an Overview – sciforce – Medium
- Home: AAN
- A simple spell checker built from word vectors – Noteworthy - The Journal Blog
- facebookresearch/fastText: Library for fast text representation and classification.
- 再谈最小熵原理:“飞象过河”之句模版和语言结构 | 附开源NLP库 | 机器之心
- 从无监督构建词库看「最小熵原理」,套路是如何炼成的
- gt-nlp-class/notes at master · jacobeisenstein/gt-nlp-class
- 干货 | 如何从编码器和解码器两方面改进生成式句子摘要? | 机器之心
- Deep Learning for Conversational AI
- gt-nlp-class/notes at master · jacobeisenstein/gt-nlp-class
- 📚The Current Best of Universal Word Embeddings and Sentence Embeddings
- How To Create a ChatBot With tf-seq2seq For Free! – Deep Learning as I See It
- ryanjgallagher.github.io/2018-SICSS-InfoTheoryTextAnalysis-Gallagher.pdf at master · ryanjgallagher/ryanjgallagher.github.io
- nmt_with_attention.ipynb - Colaboratory
- minimaxir/textgenrnn: Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.
- Detecting Sarcasm with Deep Convolutional Neural Networks
- plasticityai/magnitude: A fast, efficient universal vector embedding utility package.
- sebastianruder/NLP-progress: Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
- HUBOT | Hubot is your friendly robot sidekick. Install him in your company to dramatically improve employee efficiency.
- lanwuwei/SPM_toolkit: Neural network toolkit for sentence pair modeling.
- Holy NLP! Understanding Part of Speech Tags, Dependency Parsing, and Named Entity Recognition • Peter Baumgartner
- bfelbo/DeepMoji: State-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc.
- 腾讯AI Lab副主任俞栋:语音识别领域的现状与进展 | 机器之心
- Word2Vec — a baby step in Deep Learning but a giant leap towards Natural Language Processing
- NervanaSystems/nlp-architect: NLP Architect by Intel AI Lab: Python library for exploring the state-of-the-art deep learning topologies and techniques for natural language processing and natural language understanding
- IBM Unveils System That ‘Debates’ With Humans - The New York Times
- Kyubyong/nlp_tasks: Natural Language Processing Tasks and References
- nateraw/Lda2vec-Tensorflow: Tensorflow 1.5 implementation of Chris Moody's Lda2vec, adapted from @meereeum
- LDA2vec: Word Embeddings in Topic Models (article) - DataCamp
- cemoody/lda2vec
- The Natural Language Decathlon
- 🚀 100 Times Faster Natural Language Processing in Python
- How to easily do Topic Modeling with LSA, PSLA, LDA & lda2Vec
- Deep-learning-free Text and Sentence Embedding, Part 2 – Off the convex path
- Answering English questions using knowledge graphs and sequence translation
- Ahmed BESBES - Data Science Portfolio – Overview and benchmark of traditional and deep learning models in text classificationf
- A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
- LDA2vec: Word Embeddings in Topic Models – Towards Data Science
- Multi-Class Text Classification with Scikit-Learn – Towards Data Science
- Rasa: Open source conversational AI
- Deep Learning for Natural Language Processing: Tutorials with Jupyter Notebooks
- Keras LSTM tutorial - How to easily build a powerful deep learning language model - Adventures in Machine Learning
- CS224n: Natural Language Processing with Deep Learning
- Transfer Learning for Text using Deep Learning Virtual Machine (DLVM) | Machine Learning Blog
- Ahmed BESBES - Data Science Portfolio – Overview and benchmark of traditional and deep learning models in text classification
- NLP's ImageNet moment has arrived
- Generating Text with RNNs in 4 Lines of Code
- 全球首家多語言智能客服上線,這家神秘AI公司有什麼秘密武器? | 機器之心
- Transfer Learning in Natural Language Processing | Intel® Software
- dmlc/gluon-nlp: NLP made easy
- Introduction | ML Universal Guides | Google Developers
- ****AutoML Natural Language Beginner's guide | AutoML Natural Language | Google Cloud****
- The 7 NLP Techniques That Will Change How You Communicate in the Future (Part II)
- Has AI surpassed humans at translation? Not even close! – Skynet Today
- ACL 2018 Highlights: Understanding Representations and Evaluation in More Challenging Settings - AYLIEN
- 叶志豪:介绍强化学习及其在 NLP 上的应用 | 分享总结 | 雷锋网
- 2018 机器阅读理解技术竞赛颁奖仪式:竞赛冠军 Naturali 分享问答系统新思路 | 雷锋网
- salesforce/decaNLP: The Natural Language Decathlon: A Multitask Challenge for NLP
- 專訪騰訊鐘黎:知文團隊在智能問答系統方面的探索 | 雷鋒網
- 清华-中国工程院知识智能联合实验室发布「2018自然语言处理研究报告」
- ****Natural Language Processing is Fun! – Adam Geitgey – Medium****
- kororo/excelcy: Excel Integration with spaCy. Training NER using Excel/XLSX from PDF, DOCX, PPT, PNG or JPG.
- brightmart/ai_law: all kinds of baseline models for long text classificaiton( text categorization)
- 哈工大秦兵:机器智能中的文本情感计算 | CCF-GAIR 2018 | 雷锋网
- The Real Problems with Neural Machine Translation | Delip Rao
- ACL 2018 Highlights: Understanding Representations and Evaluation in More Challenging Settings - AYLIEN
- Quicksilver - A Natural Language Processing System that Writes Wikipedia Entries
- Drake — Using Natural Language Processing to understand his lyrics
- VerbiAge: Using NLP to help writers craft age-specific writing
- faneshion/MatchZoo: MatchZoo is a toolkit for text matching. It was developed to facilitate the designing, comparing, and sharing of deep text matching models.
- dongjun-Lee/text-classification-models-tf: Tensorflow implementations of Text Classification Models.
- Named Entity Recognition Tagging
- 讓計算機明白「天天」代表「每一天」之後,如何避免讓它認為「爸爸」代表「每個爸」 | 雷鋒網
- On word embeddings - Part 1
- The Annotated Transformer
- Text Analytics - Azure Machine Learning Studio | Microsoft Docs
- Embrace the noise: A case study of text annotation for medical imaging | LightTag - The easy way to annotate text
- Generating Natural-Language Text with Neural Networks
- Using Artificial Intelligence to Fix Wikipedia's Gender Problem | WIRED
- A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
- Introduction to NLP – machinelearning-blog.com
- A Word is Worth a Thousand Vectors | Stitch Fix Technology – Multithreaded
- dedupeio/dedupe: A python library for accurate and scaleable fuzzy matching, record deduplication and entity-resolution.
- Basics of Entity Resolution — District Data Labs: Data Science Consulting and Training
- Text Analytics with Yellowbrick — District Data Labs: Data Science Consulting and Training
- ml-meetup-feb2017
- Convolutional Methods for Text – Tal Perry – Medium
- Machine-Generated Knowledge Bases
- 2018机器阅读理解技术竞赛
- ****NLP, 知識圖譜參考資源 - CSDN博客****
- 📚The Current Best of Universal Word Embeddings and Sentence Embeddings
- WTF is TF-IDF?
- Breakfast with AI – Fireflies.ai Blog
- 基于深度神经网络的自动问答系统概述
- Introduction to NLP – Towards Data Science
- Multi-Task Learning Objectives for Natural Language Processing
- NLP, 知识图谱参考资源 - CSDN博客
- Fully-parallel text generation for neural machine translation
- ⛵ Learning Meaning in Natural Language Processing - The Semantics Mega-Thread
- A NLP Guide to Text Classification using Conditional Random Fields
- Neural Tagger Implementations
- A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
- Holy NLP! Understanding Part of Speech Tags, Dependency Parsing, and Named Entity Recognition • Peter Baumgartner
- Google AI Blog: Moving Beyond Translation with the Universal Transformer
- Named Entity Recognition with NLTK and SpaCy – Towards Data Science
- Named Entity Recognition and Classification with Scikit-Learn
- zalandoresearch/flair: A very simple framework for state-of-the-art NLP
- chakki-works/doccano: Open source text annotation tool for machine learning practitioner.
- 一文详解深度学习在命名实体识别(NER)中的应用 | 机器之心