Skip to content

Latest commit

 

History

History
36 lines (25 loc) · 2.78 KB

File metadata and controls

36 lines (25 loc) · 2.78 KB

Bayesian-Evaluation-of-Text-Classification-Models

When evaluating text classification models, we want to be certain about the performance of a model as well as its superiority over another. In the area of text classification it has become a norm to apply Null Hypothesis Significance Test(NHST) to statistically state and compare classifier performance. But, a frequentist approach has its own limitations and fallacies. In this report, we reflect on limitations posed by NHST. We also implement a novel Bayesian approach for evaluating text-classification models. We use a benchmark dataset and create several shallow models consisting of sparse and dense features and also an attention-based model for comparison. We empirically demonstrate the difference between the two evaluation approaches.

Project


Notes


  • PyTorch indexing was different from Sk-learn's indexing. In order to compare output of pytorch model with sklearn's output, we need to reset the index:
# Example
sklearn.metrics.f1_score(ytest[ytest_bert_idx,:], ytest_pred_bert, average='micro', sample_weight=None, zero_division='warn')
  • For NHST, the bootstrap sampling was not optimized, it can take a while to create 10000 bootstrap samples for each case!

Datasets


All the model output are provided in Data folder.

To obtain feature-matrix from DitilBERT model, please refer the section "Creating BERT based features" in the Shallow_Models.ipynb.

Report