-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word list in eng.traineddata #179
Comments
all traineddata files inspected:
|
The script mentioned above is specialized to english, german, french, italian, spanish and portuguese. In many languages the strange "word" |
Therefore dictionaries for non-Latin scripts like Chinese, Hebrew, Arabic, ... show a high percentage of "strange" characters (which are not strange at all, but simply not based on Latin script).
The reason might be simple: As far as I know most models where not only trained to recognize a specific language or script, but also to recognize a minimum of English. And I assume that the same basic English dictionary was then added to each language specific dictionary. |
It depends on the implementation whether strings with all uppercase characters or non alphabetic characters like punctuation, digits or others make sense in the "dictionary" or not. Tesseract's output handles any sequence of characters which are not spaces as a word. Therefore many "words" can contain a trailing punctuation, and words like "a=accepted" or "IBM" are perfectly valid. In addition any lowercase word which might occur at the beginning of a sentence might require a tween which starts with an uppercase character. And there can be partial "words" as the result of hyphenation. Of course one could argue that certain "word" variants are rare and not worth being included in those special "directories". |
The word list in eng.traineddata contains relatively (in comparison with fra, deu, ita, spa) many ambigious words (checked with https://gist.github.com/jbarth-ubhd/8d5ceb4035bf2d89700117a311209f20 ):
AMBIGIOUS (EXCERPT): Abstract;In addRole Alberta.ca AngMarTV AppSight aXe BarCap Betting| BioTalent BOX/VPOWER B|S|T BTsites CafeMom CATEGORY:NONE ChemGrout classi®cation CMDs CyberCoders d’Alzon Disc™ DomainTools EARTHWEBNEWS.COM ebizQ EBV-infected Elly_Brown ESPN.com Fire).gba FishBowlDC GEO's getFieldType GFP-Fes GOV/PGC/A GreatSeats.com HKFlix HMSHost icon.gif IconLover image/file JobList KCAL/MOL kgw.com KrF LFTs liveCD load_five MbePoint McBurney McGrady MESSAGE Metz® MOVIES/HDTV NCN-pincer NetFlix ~NEW NotesViewColumn NowBuy NowVisit om/fresh PollDaddy <POSSIBLE <<PREVIOUS PRICES|TIPS ProGrad QCard Quotes.net RakionSEA Re:finlay RTDs SciencesLocation Security| >see SEOs ServerBeach Services/Armed Solution™ <STDIO.H> TheBlackElf T/L UNjobs.org usawallpaper.com Ventolin® ViewVC VivirLatino vWD WebCopier www.ask.com <?xml
PS: fra, deu, ita, spa contain also ~30% all-UPPERCASE words - is this intended?
The text was updated successfully, but these errors were encountered: