NLP

Word Embedding

A word embedding is an approach to provide a dense vector representation of words that capture something about their meaning.

Word embeddings are an improvement over simpler bag-of-word model word encoding schemes like word counts and frequencies that result in large and sparse vectors (mostly 0 values) that describe documents but not the meaning of the words.

Word embeddings work by using an algorithm to train a set of fixed-length dense and continuous-valued vectors based on a large corpus of text. Each word is represented by a point in the embedding space and these points are learned and moved around based on the words that surround the target word.

It is defining a word by the company that it keeps that allows the word embedding to learn something about the meaning of words. The vector space representation of the words provides a projection where words with similar meanings are locally clustered within the space.

The use of word embeddings over other text representations is one of the key methods that has led to breakthrough performance with deep neural networks on problems like machine translation.

Word2vec is one algorithm for learning a word embedding from a text corpus.

There are two main training algorithms that can be used to learn the embedding from text; they are continuous bag of words (CBOW) and skip grams.

We will not get into the algorithms other than to say that they generally look at a window of words for each target word to provide context and in turn meaning for words. The approach was developed by Tomas Mikolov, formerly at Google and currently at Facebook.

fast.ai NLP · Practical NLP
Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) – Jay Alammar – Visualizing machine learning one concept at a time
Intuitive Understanding of Word Embeddings: Count Vectors to Word2Vec
How to get started in NLP – Towards Data Science
Data Analysis & XGBoost Starter (0.35460 LB) | Kaggle
Bag of Words Meets Bags of Popcorn | Kaggle
Working With Text Data — scikit-learn 0.19.1 documentation
sloria/TextBlob: Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
Getting Started with spaCy for Natural Language Processing
How I lost a silver medal in Kaggle’s Mercari Price Suggestion Challenge using CNNs and Tensorflow
Understanding Feature Engineering (Part 4) — Deep Learning Methods for Text Data
fastText/pretrained-vectors.md at master · facebookresearch/fastText
Kyubyong/nlp_tasks: Natural Language Processing Tasks and References
xiamx/awesome-sentiment-analysis: 😀😄😂😭 A curated list of Sentiment Analysis methods, implementations and misc. 😥😟😱😤
The Essential NLP Guide for data scientists (codes for top 10 NLP tasks)
What is TF-IDF? The 10 minute guide
NLP: Any libraries/dictionaries out there for fixing common spelling errors? - Part 2 & Alumni - Deep Learning Course Forums
How To Create a ChatBot With tf-seq2seq For Free! – Deep Learning as I See It
How to easily do Topic Modeling with LSA, PSLA, LDA & lda2Vec
Facebook Open Sources Dataset on NLP and Navigation Every Data Scientist should Download

According to Wikipedia:

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension.

Further readings:https://www.quora.com/What-does-the-word-embedding-mean-in-the-context-of-Machine-Learning/answer/Julien-Despois https://www.tensorflow.org/tutorials/word2vec#motivation_why_learn_word_embeddings https://www.zhihu.com/question/32275069

Awesome-Chinese-NLP: A curated list of resources for Chinese NLP 中文自然語言處理相關資料

Natural Language Processing Key Terms, Explained

How can I tokenize a sentence with Python?
自然語言處理從入門到進階資代碼資源庫彙總（隨時更新）
艾伦AI研究院发布AllenNLP：基于PyTorch的NLP工具包
Deep Learning for NLP Best Practices
Topic Modelling Financial News with Natural Language Processing
Best Practices for Document Classification with Deep Learning
Natural Language Processing in Artificial Intelligence is almost human-level accurate. Worse yet, it gets smart!
Word vectors for non-NLP data and research people
Deep Learning for NLP Best Practices
初学者指南：神经网络在自然语言处理中的应用
A gentle introduction to Doc2Vec
Word embeddings in 2017: Trends and future directions
Word Embedding in Deep Learning
Using pre-trained word embeddings in a Keras model
Deep Learning for Natural Language Processing: 2016-2017
基于Spark /Tensorflow使用CNN处理NLP的尝试
Understanding Convolutional Neural Networks for NLP
Embedding projector - visualization of high-dimensional data
Pytorch implementations of various Deep NLP models in cs-224n(Stanford Univ)
Stop Using word2vec
让机器像人一样交流：斯坦福李纪为博士毕业论文
Gentle Introduction to Statistical Language Modeling and Neural Language Models
Dan Jurafsky & Chris Manning: Natural Language Processing (great intro video series)
A simple spell checker built from word vectors – Ed Rushton – Medium
Data Science 101 (Getting started in NLP): Tokenization tutorial | No Free Hunch
Vector Representations of Words | TensorFlow (highly recommneded by Jeremey)
NLP — Building a Question Answering model – Towards Data Science
Entity extraction using Deep Learning based on Guillaume Genthial work on NER
Text Classification using machine learning – Nitin Panwar – Technical Lead (Data Science), Naukri.com
Unsupervised sentence representation with deep learning
How to solve 90% of NLP problems: a step-by-step guide (11.3k clap!)
Building a FAQ Chatbot in Python – The Future of Information Searching
Sentiment analysis on Trump's tweets using Python
Improving Airbnb Yield Prediction with Text Mining – Towards Data Science
Machine Learning with Text in scikit-learn (PyCon 2016) - YouTube
Natural Language Processing Nuggets: Getting Started with NLP
Machine Learning as a Service: Part 1 – Towards Data Science
Text Generation using a RNN
Text Classification Using CNN, LSTM and Pre-trained Glove Word Embeddings: Part-3
Ahmed BESBES - Data Science Portfolio – Overview and benchmark of traditional and deep learning models in text classification
The 7 NLP Techniques That Will Change How You Communicate in the Future (Part I)
Natural Language Processing: What are algorithms for auto summarize text? - Quora
A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
Salesforce has Developed One Single Model to Deal with 10 Different NLP Tasks
Samsung's ConZNet Algorithm just won Two Popular NLP Challenges (Dataset Links Inside)
Detecting Sarcasm with Deep Convolutional Neural Networks
Detecting Emotions with CNN Fusion Models – DAIR – Medium
What is the best tool to summarize a text document? - Quora
Text Classification: Applications and Use Cases - ParallelDots
重磅譯制 | 更新：牛津大學xDeepMind自然語言處理第10講文本轉語音（2） | 機器之心****
技術解讀 | 基於fastText和RNN的語義消歧實戰 | 機器之心
葉志豪：介紹強化學習及其在 NLP 上的應用 | 分享總結 | 雷鋒網
Convolutional neural networks for language tasks - O'Reilly Media
A Comprehensive Guide to Understand and Implement Text Classification in Python
Jeremy Howard on Twitter: "Very interesting - combining recent work from @lnsmith613 & @GuggerSylvain on super-convergence with a transformer language model (@AlecRad) shows dramatic improvements in perplexity, speed, and size over @Smerity's very strong AWD LSTM! https://t.co/t6LbAKap3M https://t.co/xI9E8zHZP8"
kororo/excelcy: Excel Integration with SpaCy. Includes, Entity training, Entity matcher pipe.
dongjun-Lee/text-classification-models-tf: Tensorflow implementations of Text Classification Models.
dongjun-Lee/transfer-learning-text-tf: Tensorflow implementation of Semi-supervised Sequence Learning (https://arxiv.org/abs/1511.01432)
IndicoDataSolutions/finetune: Scikit-learn style model finetuning for NLP
[1807.00914] Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing
****Introducing state of the art text classification with universal language models · fast.ai NLP****
香儂科技獨家對話斯坦福大學計算機學院教授、麥克阿瑟天才獎得主Dan Jurafsky | 機器之心
NLP概述和文本自動分類算法詳解 | 機器之心
Multi-Class Text Classification with Scikit-Learn – Towards Data Science
Comparison of Top 6 Python NLP Libraries – ActiveWizards: machine learning company – Medium
李航教授展望自然語言對話領域：現狀與未來 | 機器之心
The unreasonable effectiveness of one neuron | Rakesh Chada's Blog
在自然語言處理領域，哪些企業的發展遙遙領先？（附報告） - CSDN博客
ACL 2018：Attention 機制佔主流，中文語法檢錯測評引關注 | ACL 2018 | 雷鋒網
harvardnlp/var-attn
苏州大学张民教授两小时讲座精华摘录：自然语言处理方法与应用 | 雷锋网
Named Entity Recognition and Classification with Scikit-Learn
Python自然语言处理工具NLTK学习导引及相关资料

Sentiment Analysis

Perform sentiment analysis with LSTMs, using TensorFlow - O'Reilly Media
Data Science 101: Sentiment Analysis in R Tutorial | No Free Hunch
Sentiment Analysis through LSTMs – Towards Data Science
A Beginner’s Guide on Sentiment Analysis with RNN – Towards Data Science
Twitter Sentiment Analysis using combined LSTM-CNN Models – B-sides
Comprehensive Hands on Guide to Twitter Sentiment Analysis with dataset & code
文本挖掘和情感分析的基础示例 | ATYUN

Text Classification

A Comprehensive Guide to Understand and Implement Text Classification in Python
Big Picture Machine Learning: Classifying Text with Neural Networks and TensorFlow
Step 2.5: Choose a Model | ML Universal Guides | Google Developers

Analyzing tf-idf results in scikit-learn - datawerk

Tf-idf stands for term frequency-inverse document frequency

Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

See below for a simple example.

Example:

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

Chinese

Chinese Word Vectors：目前最全的中文預訓練詞向量集合 | 機器之心
Embedding/Chinese-Word-Vectors: 100+ Chinese Word Vectors 上百種預訓練中文詞向量

Text Analysis using Machine Learning

Most of the algorithms accept only numerical feature vectors (vector is a one dimensional array in computer science). So we need to convert the text documents into numerical features vectors with a fixed size in order to make use of the machining learning algorithms for text analysis.

This can be done by the following steps:

Assign each of the words in the text documents an integer ID. Each of the words is called a token. This step is called tokenization.
Count the occurrences of tokens in each document. This step is called counting. The count of each token is created as a feature.
Normalization (Don't understand what it means at this moment)

(to add easy-to-understand example)

This process is called vectorization. The resulting numerical feature vectors is called a bag-of-words representation.

One issue of vectorization is that longer documents will have higher average count values than shorter documents while they might talk about the same topic. The solution is to divide the number of occurrences of each word in a document by total number of words in the document. These features are called term frequency or tf.

Another issue vectorization is that in a large text corpus the common words like "the", "a", "is" will shadow the rare words during the model induction. The solution is to downscale the weight of the words that appear in many documents. This downscaling is called term frequency times inverse document frequency or tf-idf .

I learnt the above from a scikit-learn tutorial.

According to Kaggle, word embedding is an example of pre-trained models. The followings are the embeddings mentioned by Kaggle competitors:

word2vec by Google
GloVe
fastText by Facebook

Kaggle requires competitors to share the pre-trained models and word embeddings used to "keep the competition fair by making sure that everyone has access to the same data and pretrained models."

What is pre-trained models?

What is word embedding?

Some other tools:

Gensim
spaCy
Amazon Machine Learning

Google AI Blog: Text summarization with TensorFlow

NervanaSystems/nlp-architect: NLP Architect by Intel AI Lab: Python library for exploring the state-of-the-art deep learning topologies and techniques for natural language processing and natural language understanding
NLTK Book
NLP in Online Courses: an Overview – sciforce – Medium
Home: AAN
A simple spell checker built from word vectors – Noteworthy - The Journal Blog
facebookresearch/fastText: Library for fast text representation and classification.
再谈最小熵原理：“飞象过河”之句模版和语言结构 | 附开源NLP库 | 机器之心
从无监督构建词库看「最小熵原理」，套路是如何炼成的
gt-nlp-class/notes at master · jacobeisenstein/gt-nlp-class
干货 | 如何从编码器和解码器两方面改进生成式句子摘要？ | 机器之心
Deep Learning for Conversational AI
gt-nlp-class/notes at master · jacobeisenstein/gt-nlp-class
📚The Current Best of Universal Word Embeddings and Sentence Embeddings
How To Create a ChatBot With tf-seq2seq For Free! – Deep Learning as I See It
ryanjgallagher.github.io/2018-SICSS-InfoTheoryTextAnalysis-Gallagher.pdf at master · ryanjgallagher/ryanjgallagher.github.io
nmt_with_attention.ipynb - Colaboratory
minimaxir/textgenrnn: Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.
Detecting Sarcasm with Deep Convolutional Neural Networks
plasticityai/magnitude: A fast, efficient universal vector embedding utility package.
sebastianruder/NLP-progress: Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
HUBOT | Hubot is your friendly robot sidekick. Install him in your company to dramatically improve employee efficiency.
lanwuwei/SPM_toolkit: Neural network toolkit for sentence pair modeling.
Holy NLP! Understanding Part of Speech Tags, Dependency Parsing, and Named Entity Recognition • Peter Baumgartner
bfelbo/DeepMoji: State-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc.
腾讯AI Lab副主任俞栋：语音识别领域的现状与进展 | 机器之心
Word2Vec — a baby step in Deep Learning but a giant leap towards Natural Language Processing
NervanaSystems/nlp-architect: NLP Architect by Intel AI Lab: Python library for exploring the state-of-the-art deep learning topologies and techniques for natural language processing and natural language understanding
IBM Unveils System That ‘Debates’ With Humans - The New York Times
Kyubyong/nlp_tasks: Natural Language Processing Tasks and References
nateraw/Lda2vec-Tensorflow: Tensorflow 1.5 implementation of Chris Moody's Lda2vec, adapted from @meereeum
LDA2vec: Word Embeddings in Topic Models (article) - DataCamp
cemoody/lda2vec
The Natural Language Decathlon
🚀 100 Times Faster Natural Language Processing in Python
How to easily do Topic Modeling with LSA, PSLA, LDA & lda2Vec
Deep-learning-free Text and Sentence Embedding, Part 2 – Off the convex path
Answering English questions using knowledge graphs and sequence translation
Ahmed BESBES - Data Science Portfolio – Overview and benchmark of traditional and deep learning models in text classificationf
A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
LDA2vec: Word Embeddings in Topic Models – Towards Data Science
Multi-Class Text Classification with Scikit-Learn – Towards Data Science
Rasa: Open source conversational AI
Deep Learning for Natural Language Processing: Tutorials with Jupyter Notebooks
Keras LSTM tutorial - How to easily build a powerful deep learning language model - Adventures in Machine Learning
CS224n: Natural Language Processing with Deep Learning
Transfer Learning for Text using Deep Learning Virtual Machine (DLVM) | Machine Learning Blog
Ahmed BESBES - Data Science Portfolio – Overview and benchmark of traditional and deep learning models in text classification
NLP's ImageNet moment has arrived
Generating Text with RNNs in 4 Lines of Code
全球首家多語言智能客服上線，這家神秘AI公司有什麼秘密武器？ | 機器之心
Transfer Learning in Natural Language Processing | Intel® Software
dmlc/gluon-nlp: NLP made easy
Introduction | ML Universal Guides | Google Developers
****AutoML Natural Language Beginner's guide | AutoML Natural Language | Google Cloud****
The 7 NLP Techniques That Will Change How You Communicate in the Future (Part II)
Has AI surpassed humans at translation? Not even close! – Skynet Today
ACL 2018 Highlights: Understanding Representations and Evaluation in More Challenging Settings - AYLIEN
叶志豪：介绍强化学习及其在 NLP 上的应用 | 分享总结 | 雷锋网
2018 机器阅读理解技术竞赛颁奖仪式：竞赛冠军 Naturali 分享问答系统新思路 | 雷锋网
salesforce/decaNLP: The Natural Language Decathlon: A Multitask Challenge for NLP
專訪騰訊鐘黎：知文團隊在智能問答系統方面的探索 | 雷鋒網
清华-中国工程院知识智能联合实验室发布「2018自然语言处理研究报告」
****Natural Language Processing is Fun! – Adam Geitgey – Medium****
kororo/excelcy: Excel Integration with spaCy. Training NER using Excel/XLSX from PDF, DOCX, PPT, PNG or JPG.
brightmart/ai_law: all kinds of baseline models for long text classificaiton( text categorization)
哈工大秦兵：机器智能中的文本情感计算 | CCF-GAIR 2018 | 雷锋网
The Real Problems with Neural Machine Translation | Delip Rao
ACL 2018 Highlights: Understanding Representations and Evaluation in More Challenging Settings - AYLIEN
Quicksilver - A Natural Language Processing System that Writes Wikipedia Entries
Drake — Using Natural Language Processing to understand his lyrics
VerbiAge: Using NLP to help writers craft age-specific writing
faneshion/MatchZoo: MatchZoo is a toolkit for text matching. It was developed to facilitate the designing, comparing, and sharing of deep text matching models.
dongjun-Lee/text-classification-models-tf: Tensorflow implementations of Text Classification Models.
Named Entity Recognition Tagging
讓計算機明白「天天」代表「每一天」之後，如何避免讓它認為「爸爸」代表「每個爸」 | 雷鋒網
On word embeddings - Part 1
The Annotated Transformer
Text Analytics - Azure Machine Learning Studio | Microsoft Docs
Embrace the noise: A case study of text annotation for medical imaging | LightTag - The easy way to annotate text
Generating Natural-Language Text with Neural Networks
Using Artificial Intelligence to Fix Wikipedia's Gender Problem | WIRED
A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
Introduction to NLP – machinelearning-blog.com
A Word is Worth a Thousand Vectors | Stitch Fix Technology – Multithreaded
dedupeio/dedupe: A python library for accurate and scaleable fuzzy matching, record deduplication and entity-resolution.
Basics of Entity Resolution — District Data Labs: Data Science Consulting and Training
Text Analytics with Yellowbrick — District Data Labs: Data Science Consulting and Training
ml-meetup-feb2017
Convolutional Methods for Text – Tal Perry – Medium
Machine-Generated Knowledge Bases
2018机器阅读理解技术竞赛
****NLP, 知識圖譜參考資源 - CSDN博客****
📚The Current Best of Universal Word Embeddings and Sentence Embeddings
WTF is TF-IDF?
Breakfast with AI – Fireflies.ai Blog
基于深度神经网络的自动问答系统概述
Introduction to NLP – Towards Data Science
Multi-Task Learning Objectives for Natural Language Processing
NLP, 知识图谱参考资源 - CSDN博客
Fully-parallel text generation for neural machine translation
⛵ Learning Meaning in Natural Language Processing - The Semantics Mega-Thread
A NLP Guide to Text Classification using Conditional Random Fields
Neural Tagger Implementations
A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
Holy NLP! Understanding Part of Speech Tags, Dependency Parsing, and Named Entity Recognition • Peter Baumgartner
Google AI Blog: Moving Beyond Translation with the Universal Transformer
Named Entity Recognition with NLTK and SpaCy – Towards Data Science
Named Entity Recognition and Classification with Scikit-Learn
zalandoresearch/flair: A very simple framework for state-of-the-art NLP
chakki-works/doccano: Open source text annotation tool for machine learning practitioner.
一文详解深度学习在命名实体识别(NER)中的应用 | 机器之心

Transfer Learning

Transfer Learning in NLP | Universal Language Models - YouTube
Transfer Learning in NLP – Feedly Blog

To be categorized

Using NLP to Automate Customer Support, Part Two
Extracting events from news articles (clustering using DBSCAN)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NLP

Word Embedding

Sentiment Analysis

Text Classification

Chinese

Text Analysis using Machine Learning

Transfer Learning

To be categorized

Files

README.md

Latest commit

History

README.md

File metadata and controls

NLP

Word Embedding

Sentiment Analysis

Text Classification

Chinese

Text Analysis using Machine Learning

Transfer Learning

To be categorized