Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema documentation about use of <SP> needs clarification #54

Open
cneud opened this issue Nov 23, 2018 · 1 comment
Open

Schema documentation about use of <SP> needs clarification #54

cneud opened this issue Nov 23, 2018 · 1 comment

Comments

@cneud
Copy link
Member

cneud commented Nov 23, 2018

See the discussion at UB-Mannheim/ocr-fileformat#78

@artunit
Copy link
Member

artunit commented Dec 3, 2018

This issue generated considerable discussion at the 2018-11-29 Board Meeting. Although it is agreed
that ALTO does not strictly require the SP element according to the schema, there is ambiguity about whether it is expected. ABBYY FineReader exports SP by default, and docWorks, which makes use of FineReader, produces SP elements, but there are also many ALTO documents without SP and the XML used for ALTO can balloon when SP is included.

One of the concerns identified in the current implementation of SP is the handling of different unicode sequences for whitespaces, like the Chinese ideographic space character. Another, related concern, is when OCR is used as the basis of transcriptions. If a Fortan listing was OCRed, ALTO could not directly encode the 6 spaces before each statement. The SP element includes a width attribute but a human reader would be expected to infer what it denotes in this case. There are also semantics in some non-latin scripts where a word changes meaning based on spaces, and variants in spaces for some languages that are very unique, for example the use of the Zero-Width Non-Joiner (ZWNJ) in Persian, see this link that outlines the variations in spaces supported by unicode. There was general agreement that adding an optional CONTENT attribute to the SP element would be useful, and by extension, open up the use of the gylph element.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants