Skip to content

Latest commit

 

History

History

NLP_nanodegree

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

Natural Language Processing (NLP) Nanodegree

Projects

  1. Part of Speech Tagging

    • Purpose: Tag verb, noun, etc. in sentences.
    • Library: Pomegranate.
    • Algorithm: HMM and Supervised Learning.
    • Main Program
  2. Machine Translation

    • Purpose: Translate English texts to French texts.
    • Framework: Keras.
    • Algorithm: Recursive Encoder-Decoder RNN.
    • Dataset: Subset of WMT
    • Report
    • Main Program
  3. DNN Speech Recognizer

    • Purpose: Implement UVI (User-Voice-Interface).
    • Framework: Keras
    • Algorithm: 2-Dimensional CNN + RNN + Dense Layer.
    • Dataset: LibriSpeech
    • Report
    • Main Program

Labs

  • Part 1: Introduction to Natural Language

    • Text Processing

      • Purpose: Tokenize articles
      • Libraries: Pandas and NLTK
      • Key APIs:
        • Tokenize: nltk.tokenize.word_tokenize(text)
        • Stopwords: nltk.corpus.stopwords.words('english')
        • Stem/Lemmatize:
          • Stem: nltk.stem.PorterStemmer().stem(word)
          • Lemmatize: nltk.stem.WordNetLemmatizer().lemmatize(word, pos='v')
    • Spam Classifier

      • Purpose: Classify spam email.
      • Libraries: Pandas and Scikit-Learn.
      • Algorithm: Apply naive Bayes to BOW (Bag of Words).
      • Key Concept:
        • Bag Of Words: It is a statictis of corpus and ingnores the order of words. For example, "chicago bulls" might be treated as a city and an animal, rather than the basketball team.
      • Key APIs:
        • Pre-process + Vectorize + BOW: sklearn.feature_extraction.text.CountVectorizer().fit_transform(text)
        • Split train/test set: sklearn.cross_validation.train_test_split()
        • Naive Bayes: sklearn.naive_bayes.MultinomialNB().fit()
        • F1 score, recall score, ...:
          • sklearn.metrics.f1_score()
          • sklearn.metrics.accuracy_score()
          • sklearn.metrics.precision_score()
          • sklearn.metrics.recall_score()
    • IBM Bookworm

      • Purpose: A simple question-answering system built using IBM Watson's NLP services.
  • Part 2: Computing with Natural Language

    • Topic Modeling

      • Purpose: Classify text to a particular topic
      • Libraries: Gensim and Pandas.
      • Algorithm: LDA (Latent Dirichlet Allocation) using TF-IDF (Trem Frequency-Inverse Document Frequency).
      • Key concept:
        • TF-IDF: Consider a document containing 100 words wherein the word 'tiger' appears 3 times.
          • TF:
            • The term frequency (i.e., tf) for 'tiger' is then: TF = (3 / 100) = 0.03.
          • IDF:
            • Now, assume we have 10 million documents and the word 'tiger' appears in 1000 of these. Then, the inverse document frequency (i.e., idf) is calculated as: IDF = log(10,000,000 / 1,000) = 4.
          • TF-IDF:
            • Thus, the Tf-idf weight is the product of these quantities: TF-IDF = 0.03 * 4 = 0.12.
      • Key APIs:
        • Normalize and Tokenize: gensim.utils.simple_preprocess(text)
        • Stopswords: gensim.parsing.preprocessing.STOPWORDS
        • Lemmatize/Stem:
          • Lemmatize: nltk.stem.WordNetLemmatizer().lemmatize(word)
          • Stem: nltk.stem.SnowballStemmer().stem(word)
        • Create Dictionary: gensim.corpora.Dictionary(docs)
        • Filter rare/common words: gensim.corpora.Dictionary(docs).filter_extrems()
        • BOW/TF-IDF:
          • BOW: bow_corpus = gensim.corpora.Dictionary(docs).doc2bow(text)
          • TF-IDF: tfidf_corpus = gensim.models.TfidfModel(bow_corpus)
        • LDA:
          • gensim.models.LdaMulticore(bow_corpus, num_topics)
          • gensim.models.LdaMulticore(tfidf_corpus, num_topics)
    • Sentiment Analysis

      • Purpose: Predict positive or negative sentiment upon a comment.
      • Libraries: Sklearn.
      • Algorithm: Naive Bayes and Gradient-Boosted Decision Tree classifier.
    • Attention Basic

      • Purpose: Implement basic block in Attention algorithm.
      • Algorithm: Attention
    • RNN Keras Lab

      • Purpose: Decipher strings encrypted with a certain cipher.
      • Framework: Keras.
      • Algorithm: Char-level RNN using GRU.
      • Key APIs:
        • Char-level Tokenize: keras.preprocessing.text.Tokenizer(char_level=True).fit_on_texts(text).texts_to_sequences(text)
        • Padding: keras.preprocessing.sequence.pad_sequences(tokens, maxlen, padding='post')
        • Keras:
          • keras.models.Model
          • keras.layers
            • keras.layers.Input
            • keras.layers.GRU
            • keras.layers.Dense
            • keras.layers.TimeDistributed
            • keras.layers.Activation
          • keras.optimizer.Adam
          • keras.losses.sparse_categorical_crossentropy
  • Part 3: Communicating with Natural Language

    • Voice Data
      • Purpose: Explore the LibriSpeech data set and format