From b95dfec4f40cf5e03d56501a2421b92bcc396325 Mon Sep 17 00:00:00 2001
From: Frederik Arnold <frederik.arnold@hu-berlin.de>
Date: Wed, 29 Nov 2023 10:57:05 +0100
Subject: [PATCH] Improvements to key passages

---
 _drafts/2023-11-28-Key-Passages.markdown | 127 ++++++++++++-----------
 1 file changed, 65 insertions(+), 62 deletions(-)
diff --git a/_drafts/2023-11-28-Key-Passages.markdown b/_drafts/2023-11-28-Key-Passages.markdown
index 122e7c5..bd0ce82 100644
--- a/_drafts/2023-11-28-Key-Passages.markdown
+++ b/_drafts/2023-11-28-Key-Passages.markdown
@@ -1,9 +1,9 @@
 ---
-title: "TBD"
+title: "Working Title: Key Passages in Literary Works"
 layout: post
 author: [robert, frederik]
 comments: true
-date: 2023-12-24
+date: 2023-11-28
 ---
 
 # Context
@@ -16,7 +16,7 @@ TBD
 
 # Automatic Identification of Quotations
 
-Scholarly texts contain a number of different types of references. For example, verbatim quotes from short lengths of single words to longer quotations spanning multiple sentences, and indirect quotations in the form of summarizations or re-narrations. In the first phase of the project, we focused on the automatic identification of linking of direct quotations starting with quotations of a length of five or more words. In [Lotte and Annette: And Framwork for Finding and Exploring Key Passages in Literary Works](https://aclanthology.org/2021.nlp4dh-1.7.pdf)[^1], we outline the current landscape for text reuse detection and the development of our tool [Quid](https://hu.berlin/quid). Although there are a number of existing tools, we found that all had limitations for our specific use case. We evaluate Quid and compare it to the  existing tools.
+Scholarly texts contain a number of different types of quotations. For example, verbatim quotes from short lengths of single words to longer quotations spanning multiple sentences, and indirect quotations in the form of summarizations or re-narrations. In the first phase of the project, we focused on the automatic identification of linking of direct quotations starting with quotations of a length of five or more words. In [Lotte and Annette: And Framwork for Finding and Exploring Key Passages in Literary Works](https://aclanthology.org/2021.nlp4dh-1.7.pdf)[^1], we outline the current landscape for text reuse detection and the development of our tool [Quid](https://hu.berlin/quid). Although there are a number of existing tools, we found that all had limitations for our specific use case. We evaluated Quid and compared it to the existing tools.
 
 [^1]: Lotte and Annette have since been renamed to Quid and QuidEx, respectively.
 
@@ -30,64 +30,65 @@ Scholarly texts contain a number of different types of references. For example,
     <tr>
       <th align="left">Precision</th>
       <th align="left">Recall</th>
-      <th align="left">F1</th>
+      <th align="left">F<sub>1</sub></th>
       <th align="left">Precision</th>
       <th align="left">Recall</th>
-      <th align="left">F1</th>
+      <th align="left">F<sub>1</sub></th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td align="left">BLAST</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
+      <td align="right">0.59</td>
+      <td align="right">0.61</td>
+      <td align="right">0.60</td>
+      <td align="right">0.37</td>
+      <td align="right">0.59</td>
+      <td align="right">0.45</td>
     </tr>
     <tr>
       <td align="left">Copyfind</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
+      <td align="right">0.85</td>
+      <td align="right">0.75</td>
+      <td align="right">0.79</td>
+      <td align="right">0.76</td>
+      <td align="right">0.79</td>
+      <td align="right">0.78</td>
     </tr>
     <tr>
       <td align="left">SimT</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
+      <td align="right"><strong>0.91</strong></td>
+      <td align="right">0.64</td>
+      <td align="right">0.76</td>
+      <td align="right"><strong>0.83</strong></td>
+      <td align="right">0.74</td>
+      <td align="right"><strong>0.79</strong></td>
     </tr>
     <tr>
       <td align="left">Textmatcher</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
+      <td align="right">0.69</td>
+      <td align="right">0.37</td>
+      <td align="right">0.48</td>
+      <td align="right">0.68</td>
+      <td align="right">0.42</td>
+      <td align="right">0.52</td>
     </tr>
     <tr>
       <td align="left">Quid</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
+      <td align="right">0.82</td>
+      <td align="right"><strong>0.90</strong></td>
+      <td align="right"><strong>0.86</strong></td>
+      <td align="right">0.70</td>
+      <td align="right"><strong>0.90</strong></td>
+      <td align="right">0.78</td>
     </tr>
   </tbody>
 </table>
 
-Considerably more difficult to identify are quotations which are shorter than 5 words. In A Novel Approach for Identification and Linking of Short Quotations in Scholarly Texts and Literary Works, we develop and compare two approaches to tackle this challenge, ProQuo and ProQuoLM. ProQuo is a pipeline consisting of three steps.
+Considerably more difficult to identify are quotations which are shorter than 5 words. In A Novel Approach for Identification and Linking of Short Quotations in Scholarly Texts and Literary Works, we develop and compare two approaches to tackle this challenge, _ProQuo_ and _ProQuoLM_.
 >Our main idea behind ProQuo is to use the references corresponding to long quotations as examples to distinguish references corresponding to short quotations from other text in parentheses and other references, for example, Bible references or references to other literary works. We then extract relations between short quotations and references and use that information and the position of long quotations as anchors to link short quotations to the literary work.
-The second approach is a more general, language model based approach where we fine-tune a German Bert for classification. For this second approach, we first extract candidates for short quotations and then use a fine-tuned language model to filter the candidates.
+>
+>The second approach is a more general, language model based approach where we fine-tune a German Bert for classification. For this second approach, we first extract candidates for short quotations and then use a fine-tuned language model to filter the candidates.
 
 <table>
   <thead>
@@ -98,40 +99,40 @@ The second approach is a more general, language model based approach where we fi
     </tr>
     <tr>
       <th align="left">Precision</th>
-      <th align="left">Recall</th^>
-      <th align="left">F1</th>
+      <th align="left">Recall</th>
+      <th align="left">F<sub>1</sub></th>
       <th align="left">Precision</th>
       <th align="left">Recall</th>
-      <th align="left">F1</th>
+      <th align="left">F<sub>1</sub></th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td align="left">Baseline</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
+      <td align="right">0.65</td>
+      <td align="right"><strong>0.78</strong></td>
+      <td align="right">0.71</td>
+      <td align="right">0.59</td>
+      <td align="right"><strong>0.75</strong></td>
+      <td align="right">0.66</td>
     </tr>
     <tr>
       <td align="left">ProQuo</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
+      <td align="right">0.87</td>
+      <td align="right">0.72</td>
+      <td align="right">0.79</td>
+      <td align="right"><strong>0.87</strong></td>
+      <td align="right">0.66</td>
+      <td align="right">0.75</td>
     </tr>
     <tr>
       <td align="left">ProQuoML</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
-      <td align="right">0</td>
+      <td align="right"><strong>0.88</strong></td>
+      <td align="right">0.75</td>
+      <td align="right"><strong>0.81</strong></td>
+      <td align="right"><strong>0.87</strong></td>
+      <td align="right">0.69</td>
+      <td align="right"><strong>0.77</strong></td>
     </tr>
   </tbody>
 </table>
@@ -140,7 +141,9 @@ The second approach is a more general, language model based approach where we fi
 
 To allow for exploration of the results, we created [QuidEx](https://hu.berlin/quidex), a visualization and exploration website.
 >On the left, a heatmap of the complete literary text shows the distribution of quoted passages. The darker the text, the more often it has been quoted and thus the more important it is assumed to be. Next to the heatmap, the literary work is shown. The grayscale is determined by how many scholarly works quote some part of a key passage. That is, the color is always the same for the whole key passage. The font size is determined by how often a minimal segment is quoted. At the bottom, next to the literary text, a list of all scholarly works is shown.
->
-><figure style="text-align:center;">
-  <img src="/images/key-passages-website.jpg" alt="Key passages, website" style="width:300px; border: 1px solid transparent; border-color: black;" />
-</figure>
\ No newline at end of file
+
+<figure style="text-align:center;">
+  <img src="/images/key-passages-website.jpg" alt="Key passages, website" style="width:900px; border: 1px solid transparent; border-color: black;" />
+</figure>
+
+In summary, TBD
\ No newline at end of file

Precision	F1	F₁
Baseline	0.65	0.78	0.71	0.59	0.75	0.66
ProQuo	0.87	0.72	0.79	0.87	0.66	0.75
ProQuoML	0.88	0.75	0.81	0.87	0.69	0.77