Added initial version

lm-pub-quiz · Apr 3, 2024 · 9e7e277 · 9e7e277
1 parent 6108664
commit 9e7e277
Show file tree

Hide file tree

Showing 4 changed files with 2,909 additions and 9 deletions.
diff --git a/index.html b/index.html
@@ -1,12 +1,276 @@
 <!DOCTYPE html>
 <html lang="en">
-  <head>
-    <meta charset="UTF-8">
-    <meta name="viewport" content="width=device-width, initial-scale=1.0">
-    <meta http-equiv="X-UA-Compatible" content="ie=edge">
-    <title>LM Pub Quiz</title>
-    <link rel="stylesheet" href="style.css">
-  </head>
-  <body>
-  </body>
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>LM Pub Quiz</title>
+
+  <link rel="stylesheet" href="https://fonts.xz.style/serve/inter.css">
+  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/@exampledev/[email protected]/new.min.css">
+  <link rel="stylesheet" href="style.css">
+  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/font/bootstrap-icons.min.css">
+</head>
+<body>
+    <header>
+        <h1>LM Pub Quiz</h1>
+        <h2 class="subtitle">Evaluating language models using multiple choice items</h2>
+        <nav>
+            <a href=""><i class="bi bi-git"></i> Library</a> /
+            <a href=""><i class="bi bi-database"></i> BEAR Dataset</a> /
+            <a href=""><i class="bi bi-file-earmark"></i> Paper</a> /
+            <a href=""><i class="bi bi-file-zip"></i> Raw Results</a> 
+        </nav>
+    </header>
+
+    <section>
+      <figure class="shadow-box">
+        <img src="./media/bear_evaluation_final.svg" width="100%" alt="Illustration of how LM Pub Quiz evaluates LMs.">
+        <figcaption>Illustration of how LM Pub Quiz evaluates LMs: Answers are ranked by the (pseudo) log-likelihoods of the textual statements derived from all of the answer options.</figcaption> 
+      </figure>
+    </section>
+
+    <section class="shadow-box">
+      <div style="text-align: right;"><span class="badge acl">Accepted at NAACL 2024</span></div>
+      <h2>BEAR: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models</h2>
+      <h3>Abstract</h3>
+      <p>
+  Knowledge probing assesses to which degree a language model (LM) has successfully learned relational knowledge during pre-training. Probing is an inexpensive way to compare LMs of different sizes and training configurations. However, previous approaches rely on the objective function used in pre-training LMs and are thus applicable only to masked or causal LMs. As a result, comparing different types of LMs becomes impossible. To address this, we propose an approach that uses an LM's inherent ability to estimate the log-likelihood of any given textual statement. We carefully design an evaluation dataset of 7,731 instances (40,916 in a larger variant) from which we produce alternative statements for each relational fact, one of which is correct. We then evaluate whether an LM correctly assigns the highest log-likelihood to the correct statement. Our experimental evaluation of 22 common LMs shows that our proposed framework, BEAR, can effectively probe for knowledge across different LM types. We release the BEAR datasets and an open-source framework that implements the probing approach to the research community to facilitate the evaluation and development of LMs. 
+      </p>
+      <p>
+        <a href="">
+          <button><i class="bi bi-file-earmark"></i> Read the Paper</button>
+        </a>
+      </p>
+    </section>
+    <section>
+      <figure class="shadow-box">
+        <img src="./media/accuracy_by_model_size_bear.svg" width="100%" alt="Illustration of how LM Pub Quiz evaluates LMs." style="max-width: 600px;">
+        <figcaption>Accuracy of various models on the BEAR dataset.</figcaption> 
+      </figure>
+    </section>
+    <section class="shadow-box">
+      <h2>Model Results</h2>
+      <p>
+        We evaluated 22 lanuages models (of various sizes, trained using different pretraining objectives, and of both causal and masked LM types) on the BEAR dataset.
+      </p>
+
+<table border="1" class="dataframe">
+  <thead>
+    <tr style="text-align: left;">
+      <th></th>
+      <th>Type</th>
+      <th>Num Params</th>
+      <th>BEAR</th>
+      <th>BEAR<sub>1:1</sub></th>
+      <th>BEAR<sub>N:1</sub></th>
+    </tr>
+    <tr>
+      <th>Model</th>
+      <th></th>
+      <th></th>
+      <th></th>
+      <th></th>
+      <th></th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>Llama-2-13b-hf</th>
+      <td>CLM</td>
+      <td>13b</td>
+      <td>&#8199;66.9%&#8199;&plusmn;<small>&#8199;1.0%</small></td>
+      <td>&#8199;66.5%&#8199;&plusmn;<small>&#8199;1.6%</small></td>
+      <td>&#8199;67.0%&#8199;&plusmn;<small>&#8199;1.1%</small></td>
+    </tr>
+    <tr>
+      <th>Mistral-7B-v0.1</th>
+      <td>CLM</td>
+      <td>7.0b</td>
+      <td>&#8199;65.4%&#8199;&plusmn;<small>&#8199;1.1%</small></td>
+      <td>&#8199;64.5%&#8199;&plusmn;<small>&#8199;1.2%</small></td>
+      <td>&#8199;65.5%&#8199;&plusmn;<small>&#8199;1.1%</small></td>
+    </tr>
+    <tr>
+      <th>gemma-7b</th>
+      <td>CLM</td>
+      <td>7.0b</td>
+      <td>&#8199;63.7%&#8199;&plusmn;<small>&#8199;1.3%</small></td>
+      <td>&#8199;63.5%&#8199;&plusmn;<small>&#8199;0.7%</small></td>
+      <td>&#8199;63.8%&#8199;&plusmn;<small>&#8199;1.4%</small></td>
+    </tr>
+    <tr>
+      <th>Llama-2-7b-hf</th>
+      <td>CLM</td>
+      <td>7.0b</td>
+      <td>&#8199;62.4%&#8199;&plusmn;<small>&#8199;1.3%</small></td>
+      <td>&#8199;62.2%&#8199;&plusmn;<small>&#8199;1.1%</small></td>
+      <td>&#8199;62.4%&#8199;&plusmn;<small>&#8199;1.3%</small></td>
+    </tr>
+    <tr>
+      <th>gemma-2b</th>
+      <td>CLM</td>
+      <td>2.0b</td>
+      <td>&#8199;51.5%&#8199;&plusmn;<small>&#8199;1.0%</small></td>
+      <td>&#8199;53.1%&#8199;&plusmn;<small>&#8199;1.3%</small></td>
+      <td>&#8199;51.3%&#8199;&plusmn;<small>&#8199;1.0%</small></td>
+    </tr>
+    <tr>
+      <th>opt-30b</th>
+      <td>CLM</td>
+      <td>30b</td>
+      <td>&#8199;47.9%&#8199;&plusmn;<small>&#8199;0.5%</small></td>
+      <td>&#8199;45.8%&#8199;&plusmn;<small>&#8199;1.0%</small></td>
+      <td>&#8199;48.2%&#8199;&plusmn;<small>&#8199;0.6%</small></td>
+    </tr>
+    <tr>
+      <th>opt-13b</th>
+      <td>CLM</td>
+      <td>13b</td>
+      <td>&#8199;45.4%&#8199;&plusmn;<small>&#8199;0.8%</small></td>
+      <td>&#8199;43.5%&#8199;&plusmn;<small>&#8199;2.1%</small></td>
+      <td>&#8199;45.7%&#8199;&plusmn;<small>&#8199;0.6%</small></td>
+    </tr>
+    <tr>
+      <th>opt-6.7b</th>
+      <td>CLM</td>
+      <td>6.7b</td>
+      <td>&#8199;43.8%&#8199;&plusmn;<small>&#8199;1.1%</small></td>
+      <td>&#8199;42.5%&#8199;&plusmn;<small>&#8199;1.0%</small></td>
+      <td>&#8199;43.9%&#8199;&plusmn;<small>&#8199;1.2%</small></td>
+    </tr>
+    <tr>
+      <th>opt-2.7b</th>
+      <td>CLM</td>
+      <td>2.7b</td>
+      <td>&#8199;37.3%&#8199;&plusmn;<small>&#8199;0.9%</small></td>
+      <td>&#8199;35.6%&#8199;&plusmn;<small>&#8199;0.7%</small></td>
+      <td>&#8199;37.5%&#8199;&plusmn;<small>&#8199;1.0%</small></td>
+    </tr>
+    <tr>
+      <th>opt-1.3b</th>
+      <td>CLM</td>
+      <td>1.3b</td>
+      <td>&#8199;31.5%&#8199;&plusmn;<small>&#8199;0.8%</small></td>
+      <td>&#8199;31.3%&#8199;&plusmn;<small>&#8199;0.6%</small></td>
+      <td>&#8199;31.5%&#8199;&plusmn;<small>&#8199;0.9%</small></td>
+    </tr>
+    <tr>
+      <th>gpt2-xl</th>
+      <td>CLM</td>
+      <td>1.6b</td>
+      <td>&#8199;26.2%&#8199;&plusmn;<small>&#8199;0.7%</small></td>
+      <td>&#8199;24.1%&#8199;&plusmn;<small>&#8199;1.6%</small></td>
+      <td>&#8199;26.5%&#8199;&plusmn;<small>&#8199;0.6%</small></td>
+    </tr>
+    <tr>
+      <th>gpt2-large</th>
+      <td>CLM</td>
+      <td>812M</td>
+      <td>&#8199;22.2%&#8199;&plusmn;<small>&#8199;0.6%</small></td>
+      <td>&#8199;20.1%&#8199;&plusmn;<small>&#8199;1.8%</small></td>
+      <td>&#8199;22.5%&#8199;&plusmn;<small>&#8199;0.5%</small></td>
+    </tr>
+    <tr>
+      <th>roberta-large</th>
+      <td>MLM</td>
+      <td>355M</td>
+      <td>&#8199;21.5%&#8199;&plusmn;<small>&#8199;0.8%</small></td>
+      <td>&#8199;22.0%&#8199;&plusmn;<small>&#8199;1.1%</small></td>
+      <td>&#8199;21.5%&#8199;&plusmn;<small>&#8199;0.8%</small></td>
+    </tr>
+    <tr>
+      <th>bert-large-cased</th>
+      <td>MLM</td>
+      <td>335M</td>
+      <td>&#8199;19.9%&#8199;&plusmn;<small>&#8199;0.5%</small></td>
+      <td>&#8199;16.6%&#8199;&plusmn;<small>&#8199;1.0%</small></td>
+      <td>&#8199;20.3%&#8199;&plusmn;<small>&#8199;0.5%</small></td>
+    </tr>
+    <tr>
+      <th>opt-350m</th>
+      <td>CLM</td>
+      <td>350M</td>
+      <td>&#8199;19.6%&#8199;&plusmn;<small>&#8199;0.6%</small></td>
+      <td>&#8199;18.6%&#8199;&plusmn;<small>&#8199;1.2%</small></td>
+      <td>&#8199;19.7%&#8199;&plusmn;<small>&#8199;0.6%</small></td>
+    </tr>
+    <tr>
+      <th>gpt2-medium</th>
+      <td>CLM</td>
+      <td>355M</td>
+      <td>&#8199;19.0%&#8199;&plusmn;<small>&#8199;0.8%</small></td>
+      <td>&#8199;16.0%&#8199;&plusmn;<small>&#8199;2.6%</small></td>
+      <td>&#8199;19.4%&#8199;&plusmn;<small>&#8199;0.6%</small></td>
+    </tr>
+    <tr>
+      <th>bert-base-cased</th>
+      <td>MLM</td>
+      <td>109M</td>
+      <td>&#8199;18.4%&#8199;&plusmn;<small>&#8199;0.4%</small></td>
+      <td>&#8199;15.0%&#8199;&plusmn;<small>&#8199;1.1%</small></td>
+      <td>&#8199;18.8%&#8199;&plusmn;<small>&#8199;0.4%</small></td>
+    </tr>
+    <tr>
+      <th>roberta-base</th>
+      <td>MLM</td>
+      <td>125M</td>
+      <td>&#8199;16.4%&#8199;&plusmn;<small>&#8199;0.7%</small></td>
+      <td>&#8199;15.8%&#8199;&plusmn;<small>&#8199;1.8%</small></td>
+      <td>&#8199;16.5%&#8199;&plusmn;<small>&#8199;0.8%</small></td>
+    </tr>
+    <tr>
+      <th>opt-125m</th>
+      <td>CLM</td>
+      <td>125M</td>
+      <td>&#8199;16.4%&#8199;&plusmn;<small>&#8199;0.5%</small></td>
+      <td>&#8199;14.0%&#8199;&plusmn;<small>&#8199;1.3%</small></td>
+      <td>&#8199;16.7%&#8199;&plusmn;<small>&#8199;0.4%</small></td>
+    </tr>
+    <tr>
+      <th>xlm-roberta-large</th>
+      <td>MLM</td>
+      <td>561M</td>
+      <td>&#8199;14.3%&#8199;&plusmn;<small>&#8199;0.3%</small></td>
+      <td>&#8199;14.9%&#8199;&plusmn;<small>&#8199;1.7%</small></td>
+      <td>&#8199;14.3%&#8199;&plusmn;<small>&#8199;0.5%</small></td>
+    </tr>
+    <tr>
+      <th>gpt2</th>
+      <td>CLM</td>
+      <td>137M</td>
+      <td>&#8199;13.5%&#8199;&plusmn;<small>&#8199;0.8%</small></td>
+      <td>&#8199;&#8199;9.4%&#8199;&plusmn;<small>&#8199;2.1%</small></td>
+      <td>&#8199;14.0%&#8199;&plusmn;<small>&#8199;0.7%</small></td>
+    </tr>
+    <tr>
+      <th>xlm-roberta-base</th>
+      <td>MLM</td>
+      <td>279M</td>
+      <td>&#8199;11.4%&#8199;&plusmn;<small>&#8199;0.2%</small></td>
+      <td>&#8199;11.4%&#8199;&plusmn;<small>&#8199;1.1%</small></td>
+      <td>&#8199;11.4%&#8199;&plusmn;<small>&#8199;0.2%</small></td>
+    </tr>
+    <tr>
+      <th>Random Baseline</th>
+      <td>-</td>
+      <td>-</td>
+      <td>&#8199;&#8199;4.7%</td>
+      <td>&#8199;&#8199;1.7%</td>
+      <td>&#8199;&#8199;5.1%</td>
+    </tr>
+  </tbody>
+</table>
+    </section>
+    <section class="shadow-box">
+      <h2>Citation</h2>
+      <p>When using the dataset or library, please cite the following paper:</p>
+      <pre>@inproceedings{wiland-ploner-akbik-2024-bear,
+    title = "BEAR: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models",
+    author = "Wiland, Jacek and Ploner, Max  and Akbik, Alan",
+    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2024",
+    year = "2024",
+    publisher = "Association for Computational Linguistics",
+}</pre>
+    </section>
+</body>
 </html>