Skip to content

Latest commit

 

History

History
21 lines (12 loc) · 1.8 KB

README.md

File metadata and controls

21 lines (12 loc) · 1.8 KB

ocr-tests

Test 1:

The diff between the actual text (left) and the OCR result (json) of this image (right)...

<iframe src="http://prose.io/#ngawangtrinley/starter"></iframe>

https://github.com/ngawangtrinley/ocr-tests/compare/f0035c4...3baffd5

...highlights several types of issues:

  • '༥' 0f25, at the end of the header wasn't detected, but somehow an extra '།' 0f0d appeared at the end of the text
  • '࿒' 0FD2 is replaced by a ':' 003a at the start of lines, and by '་' 0f0b in the middle of lines
  • 'ཿ' 0f7f are ignored
  • Tibetan enclosed alphanumerics (replaced by ①...) aren't detected at all. The reason most probably being that these aren't part of the Tibetan Unicode table
  • a '་' 0f0b has been added between two sentences in line 18, most probably from the text on the backside of the page.
  • the remaining issues are letter combinations used in transliterating sanskrit (very common in buddhist literature) and that might not have featured in training data.