Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR and Annotations in "Basic Newspaper": How to tell they're the same text? #437

Open
jbaiter opened this issue Oct 6, 2023 · 2 comments

Comments

@jbaiter
Copy link

jbaiter commented Oct 6, 2023

I'm currently encountering a minor issue with the way the OCR is referenced in the "Basic Newspapers" recipe.

For one, it's provided as an ALTO XML resource referenced in the rendering property. But additionally, it's provided as individual line annotations in the Canvas' annotations at https://iiif.io/api/cookbook/recipe/0068-newspaper/newspaper_issue_1-anno_p1.json.

Now the issue arises when a generic "Content Search API" indexer that supports both OCR and Annotations tries to index this canvas. Since the annotations make in no way clear that they contain the same text as the OCR, both will be indexed, and users will get duplicate search results for a content search in the canvas as a result.

Is there a way to make it clearer that the annotations are "the page content as text" (iirc there was a cnt:ContentAsText in IIIFv2?) so indexers can check for it?

@glenrobson
Copy link
Member

Can you use the fact the annotations are "motivation": "supplementing" or is that not specific enough? There is a new motivation TSG being formed that might coin a transicrption motivation. Would that solve the issue?

Do we need some link between the annotaitons and the ALTO to say they are different formats of the same text?

We could add a label to the annotaiton page to say its OCR data and then could your interface let the user choose which one they want?

@jbaiter
Copy link
Author

jbaiter commented Oct 6, 2023

Can you use the fact the annotations are "motivation": "supplementing" or is that not specific enough? There is a new motivation TSG being formed that might coin a transicrption motivation. Would that solve the issue?

I'm afraid supplementing is not specific enough, since the supplementing annotation could also be e.g. a translation of the text on the canvas (if I understood the spec correctly). A transcription motivation would indeed solve the issue, since I could simply ignore these in presence of a OCR rendering 👍🏾

Do we need some link between the annotaitons and the ALTO to say they are different formats of the same text?

I think the transcription motivation would probably be enough, something more advanced like this sounds like it could cause a lot more headaches than a simple motivation 😅

We could add a label to the annotaiton page to say its OCR data and then could your interface let the user choose which one they want?

In my use case no, since the indexing is a fully automatic process without user interaction. And selecting between different indices at query time is afaik not supported by the Content Search API (except for the motivation query parameter).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants