You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue generated considerable discussion at the 2018-11-29 Board Meeting. Although it is agreed
that ALTO does not strictly require the SP element according to the schema, there is ambiguity about whether it is expected. ABBYY FineReader exports SP by default, and docWorks, which makes use of FineReader, produces SP elements, but there are also many ALTO documents without SP and the XML used for ALTO can balloon when SP is included.
One of the concerns identified in the current implementation of SP is the handling of different unicode sequences for whitespaces, like the Chinese ideographic space character. Another, related concern, is when OCR is used as the basis of transcriptions. If a Fortan listing was OCRed, ALTO could not directly encode the 6 spaces before each statement. The SP element includes a width attribute but a human reader would be expected to infer what it denotes in this case. There are also semantics in some non-latin scripts where a word changes meaning based on spaces, and variants in spaces for some languages that are very unique, for example the use of the Zero-Width Non-Joiner (ZWNJ) in Persian, see this link that outlines the variations in spaces supported by unicode. There was general agreement that adding an optional CONTENT attribute to the SP element would be useful, and by extension, open up the use of the gylph element.
See the discussion at UB-Mannheim/ocr-fileformat#78
The text was updated successfully, but these errors were encountered: