-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synthetical comparison with Abbyy #108
Comments
Please test using script/Latin which supports all languages written in
Latin script. That may give better results than eng.
I am assuming that you used tesseract 4.0.0-beta. It is possible that
legacy tesseract gives better results than LSTM based.
…On Fri, Aug 3, 2018 at 2:58 PM jbarth-ubhd ***@***.***> wrote:
Dear Reader,
I've did some comparison with random text.
- Random text, to test the raw engine performance, not dictionaries
- because foreign, perhaps transcripted (foreign) names sometimes look
like "Dhagax", "Hlabisa", "Pniv", ...
here is the original random text:
original text <https://digi.ub.uni-heidelberg.de/diglitData/v/orig.txt>
here is the generated image (font: GaramondNo8):
[image: Image of "Scan"]
<https://camo.githubusercontent.com/3aa4d17c2d9486c47bc4f9c6e19cf2893d9f7c9d/68747470733a2f2f646967692e75622e756e692d68656964656c626572672e64652f6469676c6974446174612f762f6f7269673030312e746966>
Result:
Filename Levenshtein distance
abbyy11r8u3-English.txt 5
abbyy11r8u3-GermanLuxembourg.txt 2
orig.txt 0
tess-eng.txt 221
tess-engWithoutDict.txt 214
Abbyy language "GermanLuxembourg" has no "full dictionary", don't know,
what this exactly means, but results are better than "English", because
"itsan" would (using English) be recognized as "its an".
engWithoutDict has been made using combine_tessdata & rm *-dawg &
combine_tessdata.
Kind regards,
Jochen
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#108>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o2o4YIv-nergpBLmHEFcqnQAy-hWks5uNBfGgaJpZM4VtquT>
.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
I've updated the table above. Thanks for the hint with script/Latin.traineddata. Seems that tesseract3 has relatively Much better. |
Here the wdiff -3 from tess4-scriptLatinWithoutDict-aq.txt (ą replaced by q):
(statistics not including [-chwin-] {+thwtn+} ... )
|
If there is an actual use case for this, I would suggest to finetune Latin traineddata with similarly generated random training text - using the Garamond font being tested and finetune for IMPACT - 300-400 iterations only. |
Thanks for sharing! Testing with random characters can make the lstm-based recognizer less accurate than real world text sample, due to the fact that during training the network learns not just letters shapes, but also builds a language model. |
Results from ABBYY (GermanLuxembourg):
Execution time was 6.6 s. |
Result from Tesseract 4.0.0-beta.4 (tessdata/eng, --oem 0):
Execution time was 43.8 s. With
Using the
|
Result from Tesseract 4.0.0-beta.4 (tessdata/eng, --oem 1):
Execution time was 85.7 s. |
Result from Tesseract 4.0.0-beta.4 (tessdata/script/Latin):
Execution time was 121 s. Surprisingly the result becomes better with
All test runs with the default page segmentation mode report diacritics (although there are none) which might be related to bad recognition rates:
|
@stweil What about tessdata_fast and tessdata_best? |
Apart from Shree question, |
As ABBYY used a single thread. The Tesseract timings where also single threaded results. But my first focus is not execution time: quality of the OCR results is much more important for our application on old books and journals. We have other reports that Tesseract beats ABBYY when the text is already split in single lines. That would imply that Tesseract is less good than ABBYY for layout recognition (line separation), maybe also for binarization. |
Result from Tesseract 4.0.0-beta.4 (tessdata_best/script/Latin, --psm 6):
Execution time was 318 s. Result from Tesseract 4.0.0-beta.4 (tessdata_fast/script/Latin, --psm 6):
Execution time was 71 s. |
|
https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00#integration-with-tesseract
The Tesseract 4.00 neural network subsystem is integrated into Tesseract as
a line recognizer. It can be used with the existing layout analysis to
recognize text within a large document, or it can be used in conjunction
with an external text detector to recognize text from an image of a single
textline.
The neural network engine is the default for 4.00. To recognize text from
an image of a single text line, use SetPageSegMode(PSM_RAW_LINE). This can
be used from the command-line with -psm 13
@
stweil Would --psm 13 give better results?
|
Maybe – what would you suggest for the line separation? |
Oh, thanks for pointing this out, that would need to be done externally.
…On Wed, Aug 15, 2018 at 1:58 AM Stefan Weil ***@***.***> wrote:
Would --psm 13 give better results?
Maybe – what would you suggest for the line separation?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#108 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_oyyqRoRAoDLDORyvt7GsCnHVm4tmks5uQzL-gaJpZM4VtquT>
.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
@stweil In case, you are looking at improving line detection/page
segmentation in tesseract, leptonica has some 'newer' functions which gave
good results with test of Arabic and Devanagari.
DanBloomberg/leptonica#236
On Wed, Aug 15, 2018 at 9:40 AM, Shree Devi Kumar <[email protected]>
wrote:
… Oh, thanks for pointing this out, that would need to be done externally.
On Wed, Aug 15, 2018 at 1:58 AM Stefan Weil ***@***.***>
wrote:
> Would --psm 13 give better results?
>
> Maybe – what would you suggest for the line separation?
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#108 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AE2_oyyqRoRAoDLDORyvt7GsCnHVm4tmks5uQzL-gaJpZM4VtquT>
> .
>
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
Dear Reader,
I've did some comparison with random text.
here is the original random text:
original text
here is the generated image (font: GaramondNo8):
Result:
Abbyy language "GermanLuxembourg" has no "full dictionary", don't know, what this exactly means, but results are better than "English", because "itsan" would (using English) be recognized as "its an".
engWithoutDict has been made using
Kind regards,
Jochen
The text was updated successfully, but these errors were encountered: