-
- Purpose: Tag verb, noun, etc. in sentences.
- Library: Pomegranate.
- Algorithm: HMM and Supervised Learning.
- Main Program
-
- Purpose: Translate English texts to French texts.
- Framework: Keras.
- Algorithm: Recursive Encoder-Decoder RNN.
- Dataset: Subset of WMT
- Report
- Main Program
-
- Purpose: Implement UVI (User-Voice-Interface).
- Framework: Keras
- Algorithm: 2-Dimensional CNN + RNN + Dense Layer.
- Dataset: LibriSpeech
- Report
- Main Program
-
Part 1: Introduction to Natural Language
-
- Purpose: Tokenize articles
- Libraries: Pandas and NLTK
- Key APIs:
- Tokenize: nltk.tokenize.word_tokenize(text)
- Stopwords: nltk.corpus.stopwords.words('english')
- Stem/Lemmatize:
- Stem: nltk.stem.PorterStemmer().stem(word)
- Lemmatize: nltk.stem.WordNetLemmatizer().lemmatize(word, pos='v')
-
- Purpose: Classify spam email.
- Libraries: Pandas and Scikit-Learn.
- Algorithm: Apply naive Bayes to BOW (Bag of Words).
- Key Concept:
- Bag Of Words: It is a statictis of corpus and ingnores the order of words. For example, "chicago bulls" might be treated as a city and an animal, rather than the basketball team.
- Key APIs:
- Pre-process + Vectorize + BOW: sklearn.feature_extraction.text.CountVectorizer().fit_transform(text)
- Split train/test set: sklearn.cross_validation.train_test_split()
- Naive Bayes: sklearn.naive_bayes.MultinomialNB().fit()
- F1 score, recall score, ...:
- sklearn.metrics.f1_score()
- sklearn.metrics.accuracy_score()
- sklearn.metrics.precision_score()
- sklearn.metrics.recall_score()
-
- Purpose: A simple question-answering system built using IBM Watson's NLP services.
-
-
Part 2: Computing with Natural Language
-
- Purpose: Classify text to a particular topic
- Libraries: Gensim and Pandas.
- Algorithm: LDA (Latent Dirichlet Allocation) using TF-IDF (Trem Frequency-Inverse Document Frequency).
- Key concept:
- TF-IDF: Consider a document containing 100 words wherein the word 'tiger' appears 3 times.
- TF:
- The term frequency (i.e., tf) for 'tiger' is then: TF = (3 / 100) = 0.03.
- IDF:
- Now, assume we have 10 million documents and the word 'tiger' appears in 1000 of these. Then, the inverse document frequency (i.e., idf) is calculated as: IDF = log(10,000,000 / 1,000) = 4.
- TF-IDF:
- Thus, the Tf-idf weight is the product of these quantities: TF-IDF = 0.03 * 4 = 0.12.
- TF:
- TF-IDF: Consider a document containing 100 words wherein the word 'tiger' appears 3 times.
- Key APIs:
- Normalize and Tokenize: gensim.utils.simple_preprocess(text)
- Stopswords: gensim.parsing.preprocessing.STOPWORDS
- Lemmatize/Stem:
- Lemmatize: nltk.stem.WordNetLemmatizer().lemmatize(word)
- Stem: nltk.stem.SnowballStemmer().stem(word)
- Create Dictionary: gensim.corpora.Dictionary(docs)
- Filter rare/common words: gensim.corpora.Dictionary(docs).filter_extrems()
- BOW/TF-IDF:
- BOW: bow_corpus = gensim.corpora.Dictionary(docs).doc2bow(text)
- TF-IDF: tfidf_corpus = gensim.models.TfidfModel(bow_corpus)
- LDA:
- gensim.models.LdaMulticore(bow_corpus, num_topics)
- gensim.models.LdaMulticore(tfidf_corpus, num_topics)
-
- Purpose: Predict positive or negative sentiment upon a comment.
- Libraries: Sklearn.
- Algorithm: Naive Bayes and Gradient-Boosted Decision Tree classifier.
-
- Purpose: Implement basic block in Attention algorithm.
- Algorithm: Attention
-
- Purpose: Decipher strings encrypted with a certain cipher.
- Framework: Keras.
- Algorithm: Char-level RNN using GRU.
- Key APIs:
- Char-level Tokenize: keras.preprocessing.text.Tokenizer(char_level=True).fit_on_texts(text).texts_to_sequences(text)
- Padding: keras.preprocessing.sequence.pad_sequences(tokens, maxlen, padding='post')
- Keras:
- keras.models.Model
- keras.layers
- keras.layers.Input
- keras.layers.GRU
- keras.layers.Dense
- keras.layers.TimeDistributed
- keras.layers.Activation
- keras.optimizer.Adam
- keras.losses.sparse_categorical_crossentropy
-
-
Part 3: Communicating with Natural Language
- Voice Data
- Purpose: Explore the LibriSpeech data set and format
- Voice Data