This repository shows my NLP personal project and assignments. The following files are components of project:
- albert.py
- bidaf.py
- qanet.py
- roberta.py
- copy_of_nlpprojectanalysis.py
Rest of the files have smaller tasks covered which are explained later.
The Stanford Question Answering Dataset (SQuAD 1.1), is a reading comprehension dataset consisting of 100,000+ questions posed by crowd workers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. The Wikipedia articles were selected from project Nayuki’s Wikipedia’s internal PageRanks to obtain the top 10000 articles of English Wikipedia, from which 536 articles were sampled uniformly at random. Paragraphs with less than 500 characters, tables, and images were discarded. The result was 23,215 paragraphs for the 536 articles covering a wide range of topics, from musical celebrities to abstract concepts. On each paragraph, crowd workers were tasked with asking and answering up to 5 questions on the content of that paragraph Among the answers the shortest span in the paragraph that answered the question was selected. On average 4.8 answers were collected per question. A major issue with SQuAD 1.1 was that the models only needed to select the span that seemed most related to the question, instead of checking that the answer is actually entailed by the text. To fix the issue the SQuAD 2.0 dataset was released which combines existIng SQuAD data with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. This ensures that models must not only answer questions when possible but also determine when no answer is supported by the paragraph and abstain from answering.
In the SQuAD2.0 dataset, we have tried to explore 75 percent of the training data. We wanted to come up with the answer to “type” of questions, answers, and context alongside numbers. Upon further analysis, we came up with concepts of LDA (Latent Dirichlet Allocation) but LDA required training which would mean we need to have labeled data. We have unlabeled data and now this could be a project in itself to cluster unlabeled datasets. We also came across different methods to deal with the problem of unlabeled data, some of them were BERT-based models that could deal with semi unlabeled data but going that way would mean deviation from our current project title. Just for data exploration, we required something pretrained and generalized enough in a way that could be just like having an optimal N number of clusters of unlabeled data. Spacy name entity recognition seemed a perfect choice for this task as it’s pretrained and seems sophisticated enough for the task of data exploration. We used Spacy’s “en core web sm” library as its smaller than other two available and yet has similar F1 rates. It has 18 types of data categories namely ‘PERSON’, ’NORP’, ’FAC’, ’ORG’, ’GPE’, ’LOC’, ’PRODUCT’, ’EVENT’, ’WORK OF ART’, ’LAW’, ’LANGUAGE’, ’DATE’, ’TIME’, ’PERCENT’, ’MONEY’, ’QUANTITY’, ’ORDINAL’, ’CARDINAL’. For the task of type of question asked, we passed context, title, and combined question answer to Spacy NER algorithm. We found that Spacy doesn’t work with single entities i.e. if we pass just a person’s name it returns None, as it is based upon the grammar of the language and just one word could mean a noun, adjective, or anything. We then generated title-labels from the context labels if the context contained the title within. That’s how we came up with title labels. Then we passed combined question and answers hoping that this way it can return much better labels for question-answer and we were successful. Next, we tried to find the number of each label type in titles and for each of those titles, the number of questions type. We found from the following plot that the majority of titles were of three types i.e. Person, organization, GPE (countries, states, cities)
After this, we tried to plot the question types for all these categories of title types and came up with the following plot. It shows that most of the questions for each label type ask about persons, organizations, places, dates, and numbers which do not fall under other categories. Some relationships seen were: Norp (Nationality, religious groups) links with GPE (places, countries, cities, etc.) and dates. Organizations link with Person, organization, places, dates, ranks, and other kinds of numbers.
After this, we tried to analyze our data from the length of context, questions and answers perspective and we found the following figures suggesting that most answers are of length 6, while most contexts are of length 500-600 characters and most questions have a length of 110 approximately. One point to be noted is that we tried to normalize this data by choosing distinct values of all contexts, questions, and answers.
The paper Bidirectional Attention Flow for Machine Comprehension implements is a multi-stage hierarchical architecture that represents the context and query at multiple levels of granularity. It also involves recurrence as it extensively uses LSTMs and a memory-less attention mechanism which is bi-directional in nature. Below are some important points regarding the BiDAF model. The key issue that this paper tries to address is that of early summarization in all the earlier approches that use attention mechanisms. The attention mechanisms until then were used to obtain a fixed-size summarization of given values and query. This, according to the authors leads to early summarization and loss of information. Moreover, previously, attention was only calculated in only one direction. To improve upon these issues, the authors propose a hierarchical, multi-stage network. Word embedding layers maps each word to a high dimensional vector space. We use pre-trained word vectors, GloVe to obtain the fixed word embedding of each word. A character embedding is calculated for each context and query word. This is done by using convolutions. It maps each word to a vector space using character-level CNNs. Here, we trained a simple CNN with one layer of convolution on top of pretrained word vectors and hypothesized that these pretrained word vectors could work as a universal feature extractors for various classification tasks. This is analogous to the earlier layers of vision models like VGG and Inception working as generic feature extractors. The intuition is simple over here. Just as convolutional filters learn various features in an image by operating on its pixels, here they’ll do so by operating on characters of words. Highway networks were originally introduced to ease the training of deep neural networks. The purpose of this layer is to learn to pass relevant information from the input. A high way network is a series of feed-forward or linear layers with a gating mechanism. The gating is implemented by using a sigmoid function which decides what amount of information should be transformed and what should be passed as it is. The input to this layer is the concatenation of word and character embeddings of each word. The idea here is that the adding of highway layers enables the network to make more efficient use of character embeddings. If a particular word is not found in the pretrained word vector vocabulary (OOV word), it will most likely be initialized with a zero vector. It then makes much more sense to look at the character embedding of that word rather than the word embedding. The soft gating mechanism in highway layers helps the model to achieve this. Attention Flow Layer is responsible for fusing and linking the context and query representations. This layer calculates attention in two directions: from context to query and from query to context. Attention vectors for these calculations are derived from a common matrix which is called as the similarity matrix. Modeling Layer is responsible for capturing temporal features interactions among the context words. This is done using a bidirectional LSTM. The difference between this layer and the contextual layer, both of which involve an LSTM layer is that here we have a query aware representation of the context while in the contextual layer, encoding of the context and query was independent. Contextual Embedding is the final embedding layer in the model. The output of highway layers is passed to a bidirection LSTM to model the temporal features of the text. This is done for both, the context and the query. The diagram of the BiDAF model is shown below.The paper [1804.09541] QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension (arxiv.org) implements is a multi-stage hierarchical architecture that represents the context and query at multiple levels of granularity. It also involves recurrence as it extensively uses LSTMs and a memory-less attention mechanism which is bi-directional in nature. Below are some important points regarding the BiDAF model. The key issue that this paper tries to address is that of early summarization in all the earlier approches that use attention mechanisms. The attention mechanisms until then were used to obtain a fixed-size summarization of given values and query. This, according to the authors leads to early summarization and loss of information. Moreover, previously, attention was only calculated in only one direction. To improve upon these issues, the authors propose a hierarchical, multi-stage network. Word embedding layers maps each word to a high dimensional vector space. We use pre-trained word vectors, GloVe to obtain the fixed word embedding of each word. A character embedding is calculated for each context and query word. This is done by using convolutions. It maps each word to a vector space using character-level CNNs. Here, we trained a simple CNN with one layer of convolution on top of pretrained word vectors and hypothesized that these pretrained word vectors could work as a universal feature extractors for various classification tasks. This is analogous to the earlier layers of vision models like VGG and Inception working as generic feature extractors. The intuition is simple over here. Just as convolutional filters learn various features in an image by operating on its pixels, here they’ll do so by operating on characters of words. Highway networks were originally introduced to ease the training of deep neural networks. The purpose of this layer is to learn to pass relevant information from the input. A high way network is a series of feed-forward or linear layers with a gating mechanism. The gating is implemented by using a sigmoid function which decides what amount of information should be transformed and what should be passed as it is. The input to this layer is the concatenation of word and character embeddings of each word. The idea here is that the adding of highway layers enables the network to make more efficient use of character embeddings. If a particular word is not found in the pretrained word vector vocabulary (OOV word), it will most likely be initialized with a zero vector. It then makes much more sense to look at the character embedding of that word rather than the word embedding. The soft gating mechanism in highway layers helps the model to achieve this. Attention Flow Layer is responsible for fusing and linking the context and query representations. This layer calculates attention in two directions: from context to query and from query to context. Attention vectors for these calculations are derived from a common matrix which is called as the similarity matrix. Modeling Layer is responsible for capturing temporal features interactions among the context words. This is done using a bidirectional LSTM. The difference between this layer and the contextual layer, both of which involve an LSTM layer is that here we have a query aware representation of the context while in the contextual layer, encoding of the context and query was independent. Contextual Embedding is the final embedding layer in the model. The output of highway layers is passed to a bidirection LSTM to model the temporal features of the text. This is done for both, the context and the query. The diagram of the BiDAF model is shown below.
The paper QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension draws inspiration from "Attention Is All You Need". The key motivation behind the design of the model is: convolution captures the local structure of the text, while the self-attention learns the global interaction between each pair of words. Below are some important points regarding the QANet model. Other papers have been heavily based on recurrent neural nets and attention. However, RNNs are slow to train given their sequential nature and are also slow for inference. QANet was proposed in early 2018. This paper does away with recurrence and is only based on self-attention and con volutions. Depthwise separable convolutions serve the same purpose as normal convolutions with the only difference being that they are faster because they reduce the number of multiplication operations. This is done by breaking the convolution operation into two parts: depthwise convolution and point wise convolution. Depthwise convolutions are faster than traditional convolution as number of computations in depth wise separable convolutions are lesser than traditional ones. Highway network used here is same as that used in BiDAF model. Embedding Layer converts ord-level tokens into a 300- dim pre-trained glove embedding vector, creates trainable character embeddings using 2-D convolutions and concatenates character and word embeddings and passes them through a highway network. Self Attention is same as dis cussed in the BiDAF model. The attention layer is the core building block of the network where the fusion between context and query occurs. QANet uses trilinear attention function used in BiDAF model. For Encoder Layer a positional embedding is injected into the input. This is then passed through a series of convolutional layers. The number of these layers depend upon the layer of which these encoder blocks are a part of. The output of this is then passed to a multiheaded self attention layer and finally to a feedforward network which is simply a linear layer. The model also involves residual connections, layer normalizations and dropouts. The encoder layer is shown as below.
BERT The field of NLP has been transformed with the recent advent of pre-trained language models. These language models are first trained on a large corpus of text and then are fine-tuned on a downstream task. BERT Bidirectional Encoder Representations from Transformers is one such model that uses employs a stacked transformer-based architecture and bidirectional training. BERT’s model architecture is a multi-layer bidirectional Transformer with only the encoder part. To make BERT handle a variety of down-stream tasks, the input representation can unambiguously represent both a single sentence and a pair of sentences in one token sequence. To represent a word/token it uses WordPiece embeddings Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation with a 30,000 token vocabulary. The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. Sentence differentiation is done in two ways firstly by separating them with a special token ([SEP]). Additionally, a learned embed ding is added to every token indicating whether it belongs to sentence A or sentence B For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings It is trained on a large corpus with two objectives Masked language modeling (MSM) and Next Sentence Prediction (NSP). Although BERT has given SOTA results on many NLP tasks it is still cumbersome to train and fine-tune due to a large number of parameters. As such it takes a long time to train and the training is compute intensive.
ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS is an update on BERT that addresses these problems, utilizing parameter reduction. The parameter reduction techniques allow for the models to have a lesser number of parameters and make it easier to train and require lesser compute. The parameter reduction is done using the following approaches. Factorization of the embedding parametrization — Input-level embeddings (words, sub-tokens, etc.) need to learn context-independent representations. To do so, the embedding matrix is split between input-level embeddings with a relatively-low dimension (e.g., 128), while the hidden-layer embeddings use higher dimensionalities (768 as in the BERT case, or more). This allows for about 80% reduction in the number of parameters. Parameter-sharing across the layers – Transformer based neural network architectures rely on independent layers stacked on top of each other. But the network often learns to perform similar operations at various layers, using different parameters of the network. In ALBERT parameters are shared across the layers, i.e., the same layer is applied on top of each other. Implementing these two design changes together yields an ALBERT-base model that has only 12M parameters, an 89% parameter reduction compared to the BERT-base model. We have employed the huggingface’ AL BERT base model and fine-tuned it on the SQUAD dataset.
RoBERTa: A Robustly Optimized BERT Pretraining Approach builds on BERT’s language masking strategy, and is retraining of BERT with improved training methodology, more data, and compute power. Essentially it makes the following changes to the original BERT implementation. It uses more data: 160GB of text instead of the 16GB dataset originally used to train BERT. It also trains for a longer period: increasing the number of iterations from 100K to 300K and then further to 500K. RoBERTa is also trained using larger batches: 8K instead of 256 in the original BERT base model, and also employs byte-level BPE vocabulary with 50K subword units instead of character-level BPE vocabu lary of size 30K. Finally, it removes the next sequence pre diction objective from the training procedure and dynamically changes the masking pattern applied to the training data. We have employed the huggingface’ RoBERTa base model and fine-tuned it on the SQUAD dataset.
Model | Epochs | F1 score | EM score |
---|---|---|---|
BiDAF (Hidden Dim: 100) | 10 | 56.46 | 68.32 |
QANET (Hidden Dim: 128, attention head:8) | 10 | 73.41 | 62.52 |
ALBERT | 9 | 84.36 | 73.41 |
RoBERTa | 3 | 83.83 | 74.93 |
From our experiments, we observe that ALBERT performed the best on the F1 score metric with an average score of 84.36, Whereas RoBERTa performed marginally better on the exact match metric with a value of 74.93 ( although due to compute constraints it was trained for lesser epochs). Our empirical results come in line with the expected results. The pretrained language models outperformed the end to end trained models by a huge margin, achieving better results in the first epoch itself. These results show the benefit these large models enjoy due to a large number of parameters and the large dataset they are trained on. Another interesting observation we observe is that the larger models almost plateaued after a couple of epochs and no major improvement was observed in later epochs (in fact the models performed slightly poor in the later epochs). Whereas, QANet continued to improve with subsequent epochs. In terms of non-pretrained models, QANet performed better than BiDAF. Model Epochs F1 EM BiDAF (Hidden Dim: 100) 10 56.46 68.32 QANET (Hidden Dim:128, Attention Head: 8) 10 73.41 62.52 ALBERT 9 84.36 74.52 RoBERTa 3 83.83 74.93 Learning This project allowed us to explore the domain of Question Answering, an important task in Natural Language Processing.