OCR and Annotations in "Basic Newspaper": How to tell they're the same text? #437

jbaiter · 2023-10-06T13:33:53Z

I'm currently encountering a minor issue with the way the OCR is referenced in the "Basic Newspapers" recipe.

For one, it's provided as an ALTO XML resource referenced in the rendering property. But additionally, it's provided as individual line annotations in the Canvas' annotations at https://iiif.io/api/cookbook/recipe/0068-newspaper/newspaper_issue_1-anno_p1.json.

Now the issue arises when a generic "Content Search API" indexer that supports both OCR and Annotations tries to index this canvas. Since the annotations make in no way clear that they contain the same text as the OCR, both will be indexed, and users will get duplicate search results for a content search in the canvas as a result.

Is there a way to make it clearer that the annotations are "the page content as text" (iirc there was a cnt:ContentAsText in IIIFv2?) so indexers can check for it?

The text was updated successfully, but these errors were encountered:

glenrobson · 2023-10-06T15:20:37Z

Can you use the fact the annotations are "motivation": "supplementing" or is that not specific enough? There is a new motivation TSG being formed that might coin a transicrption motivation. Would that solve the issue?

Do we need some link between the annotaitons and the ALTO to say they are different formats of the same text?

We could add a label to the annotaiton page to say its OCR data and then could your interface let the user choose which one they want?

jbaiter · 2023-10-06T19:51:09Z

Can you use the fact the annotations are "motivation": "supplementing" or is that not specific enough? There is a new motivation TSG being formed that might coin a transicrption motivation. Would that solve the issue?

I'm afraid supplementing is not specific enough, since the supplementing annotation could also be e.g. a translation of the text on the canvas (if I understood the spec correctly). A transcription motivation would indeed solve the issue, since I could simply ignore these in presence of a OCR rendering 👍🏾

Do we need some link between the annotaitons and the ALTO to say they are different formats of the same text?

I think the transcription motivation would probably be enough, something more advanced like this sounds like it could cause a lot more headaches than a simple motivation 😅

We could add a label to the annotaiton page to say its OCR data and then could your interface let the user choose which one they want?

In my use case no, since the indexing is a fully automatic process without user interaction. And selecting between different indices at query time is afaik not supported by the Content Search API (except for the motivation query parameter).

glenrobson mentioned this issue Jan 19, 2024

I want to declare Annotations as transcriptions and/or translations. IIIF/iiif-stories#140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR and Annotations in "Basic Newspaper": How to tell they're the same text? #437

OCR and Annotations in "Basic Newspaper": How to tell they're the same text? #437

jbaiter commented Oct 6, 2023 •

edited

Loading

glenrobson commented Oct 6, 2023

jbaiter commented Oct 6, 2023

OCR and Annotations in "Basic Newspaper": How to tell they're the same text? #437

OCR and Annotations in "Basic Newspaper": How to tell they're the same text? #437

Comments

jbaiter commented Oct 6, 2023 • edited Loading

glenrobson commented Oct 6, 2023

jbaiter commented Oct 6, 2023

jbaiter commented Oct 6, 2023 •

edited

Loading