You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now the issue arises when a generic "Content Search API" indexer that supports both OCR and Annotations tries to index this canvas. Since the annotations make in no way clear that they contain the same text as the OCR, both will be indexed, and users will get duplicate search results for a content search in the canvas as a result.
Is there a way to make it clearer that the annotations are "the page content as text" (iirc there was a cnt:ContentAsText in IIIFv2?) so indexers can check for it?
The text was updated successfully, but these errors were encountered:
Can you use the fact the annotations are "motivation": "supplementing" or is that not specific enough? There is a new motivation TSG being formed that might coin a transicrption motivation. Would that solve the issue?
Do we need some link between the annotaitons and the ALTO to say they are different formats of the same text?
We could add a label to the annotaiton page to say its OCR data and then could your interface let the user choose which one they want?
Can you use the fact the annotations are "motivation": "supplementing" or is that not specific enough? There is a new motivation TSG being formed that might coin a transicrption motivation. Would that solve the issue?
I'm afraid supplementing is not specific enough, since the supplementing annotation could also be e.g. a translation of the text on the canvas (if I understood the spec correctly). A transcription motivation would indeed solve the issue, since I could simply ignore these in presence of a OCR rendering 👍🏾
Do we need some link between the annotaitons and the ALTO to say they are different formats of the same text?
I think the transcription motivation would probably be enough, something more advanced like this sounds like it could cause a lot more headaches than a simple motivation 😅
We could add a label to the annotaiton page to say its OCR data and then could your interface let the user choose which one they want?
In my use case no, since the indexing is a fully automatic process without user interaction. And selecting between different indices at query time is afaik not supported by the Content Search API (except for the motivation query parameter).
I'm currently encountering a minor issue with the way the OCR is referenced in the "Basic Newspapers" recipe.
For one, it's provided as an ALTO XML resource referenced in the
rendering
property. But additionally, it's provided as individual line annotations in the Canvas'annotations
at https://iiif.io/api/cookbook/recipe/0068-newspaper/newspaper_issue_1-anno_p1.json.Now the issue arises when a generic "Content Search API" indexer that supports both OCR and Annotations tries to index this canvas. Since the annotations make in no way clear that they contain the same text as the OCR, both will be indexed, and users will get duplicate search results for a content search in the canvas as a result.
Is there a way to make it clearer that the annotations are "the page content as text" (iirc there was a
cnt:ContentAsText
in IIIFv2?) so indexers can check for it?The text was updated successfully, but these errors were encountered: