Skip to content

Commit

Permalink
Add first draft for key passages
Browse files Browse the repository at this point in the history
  • Loading branch information
Frederik Arnold committed Nov 28, 2023
1 parent e042184 commit f93596c
Show file tree
Hide file tree
Showing 2 changed files with 151 additions and 0 deletions.
9 changes: 9 additions & 0 deletions _config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,15 @@ authors:
orcid:
twitter:
description: Yiddish and digital humanities at the FU Berlin.
frederik:
name: Frederik Arnold
display_name: Frederik
email:
web: https://hu.berlin/fa
github:
orcid: 0000-0002-0417-4054
twitter:
description: Research fellow at Humboldt University of Berlin.

paginate: 10

Expand Down
142 changes: 142 additions & 0 deletions _drafts/yyyy-mm-dd-Key-Passages.markdown
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
---
title: "TBD"
layout: post
author: [robert, frederik]
comments: true
date: 2023-12-24
---

# Context

In the DFG project [What matters? Key Passages in Literary Works](https://www.projekte.hu-berlin.de/en/schluesselstellen/what-matters-key-passages-in-literary-works) which started with its first phase in 2020 and has been in the second phase, titled [Is Expert Knowledge Key? Scholarly Interpretations as Resource for the Analysis of Literary Texts in Computational Literary Studies](https://www.projekte.hu-berlin.de/en/schluesselstellen/index.html), since August 2023, we set out to identify and characterize key passages in literary works. We understand key passages as passages that are particularly important to expert readers when interpreting texts. In a mixed-methods approach, we primarily want to investigate empirically which textual characteristics of literary genres can be revealed through patterns of citation and quotation.

# Corpus

TBD

# Automatic Identification of Quotations

Scholarly texts contain a number of different types of references. For example, verbatim quotes from short lengths of single words to longer quotations spanning multiple sentences, and indirect quotations in the form of summarizations or re-narrations. In the first phase of the project, we focused on the automatic identification of linking of direct quotations starting with quotations of a length of five or more words. In [Lotte and Annette: And Framwork for Finding and Exploring Key Passages in Literary Works](https://aclanthology.org/2021.nlp4dh-1.7.pdf)[^1], we outline the current landscape for text reuse detection and the development of our tool [Quid](https://hu.berlin/quid). Although there are a number of existing tools, we found that all had limitations for our specific use case. We evaluate Quid and compare it to the existing tools.

[^1]: Lotte and Annette have since been renamed to Quid and QuidEx, respectively.

<table>
<thead>
<tr>
<th align="center" rowspan="3">Approach</th>
<th align="center" colspan="3">Die Judenbuche</th>
<th align="center" colspan="3">Michael Kohlhaas</th>
</tr>
<tr>
<th align="left">Precision</th>
<th align="left">Recall</th>
<th align="left">F1</th>
<th align="left">Precision</th>
<th align="left">Recall</th>
<th align="left">F1</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">BLAST</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">Copyfind</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">SimT</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">Textmatcher</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">Quid</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
</tr>
</tbody>
</table>

Considerably more difficult to identify are quotations which are shorter than 5 words. In A Novel Approach for Identification and Linking of Short Quotations in Scholarly Texts and Literary Works, we develop and compare two approaches to tackle this challenge, ProQuo and ProQuoLM. ProQuo is a pipeline consisting of three steps.
>Our main idea behind ProQuo is to use the references corresponding to long quotations as examples to distinguish references corresponding to short quotations from other text in parentheses and other references, for example, Bible references or references to other literary works. We then extract relations between short quotations and references and use that information and the position of long quotations as anchors to link short quotations to the literary work.
The second approach is a more general, language model based approach where we fine-tune a German Bert for classification. For this second approach, we first extract candidates for short quotations and then use a fine-tuned language model to filter the candidates.

<table>
<thead>
<tr>
<th align="center" rowspan="2">Approach</th>
<th align="center" colspan="3">Die Judenbuche</th>
<th align="center" colspan="3">Michael Kohlhaas</th>
</tr>
<tr>
<th align="left">Precision</th>
<th align="left">Recall</th^>
<th align="left">F1</th>
<th align="left">Precision</th>
<th align="left">Recall</th>
<th align="left">F1</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Baseline</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">ProQuo</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">ProQuoML</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
</tr>
</tbody>
</table>

# QuidEx - Visualization and Exploration

To allow for exploration of the results, we created [QuidEx](https://hu.berlin/quidex), a visualization and exploration website.
>On the left, a heatmap of the complete literary text shows the distribution of quoted passages. The darker the text, the more often it has been quoted and thus the more important it is assumed to be. Next to the heatmap, the literary work is shown. The grayscale is determined by how many scholarly works quote some part of a key passage. That is, the color is always the same for the whole key passage. The font size is determined by how often a minimal segment is quoted. At the bottom, next to the literary text, a list of all scholarly works is shown.

0 comments on commit f93596c

Please sign in to comment.