-
Notifications
You must be signed in to change notification settings - Fork 0
begemotv2718/cuneiform-dict
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is dictionary generating code for cuneiform. The main parts of the dictionary are the following: header tree endings list writeheader.c generates header speltree.hs generates tree endings.hs generates endings list Each of these three file can be compiled in a default way (gcc -o writeheader, ghc ) Steps to produce dictionary: 1) prepare table of word stems in the format: <stem> <frequency> <index in ending table> 2) prepare table of word endings (each line should contain a list of the separate nest for endings). 3) run "speltree" on stem table. Collect tree length from the output 4) run "endings" on the endings table. Note the length of the components reported by this program. 5) run "writeheader" feeding it table length from speltree and the three lengthes of the ending sections from the output of endings 6) concatenate output files endings usage: cat endingfile | endings <outputfile1> speltree usage: cat stemfile | speltree <outputfile1> example input files (note that for english language this is slightly overcomplicated, since it is not a flective language like russian) We have two word nests calm calming calmed create created creating creation creative We should assign some kind of frequencies to these word nests. The frequencies for the most frequent words should be close to 7 and the frequencies of the least used words should be close to 1. I think that the simple logarithmic measure of the word frequency is ok, something like 1+6*log(freq)/log(maxfreq) can do, but this is not yet settled for me. Let's assume that we assigned frequency 5.2 to the word nest (calm) and 6.1 to the word nest (create). The resulting endings list ---------------- e ed ing ion ive ed ing ---------------- The resulting stem file ----------------- calm 5.2 calm 1 5.2 creat 0 6.1 --------------------- Note that we repeated the word calm twice, this indicates that the word stem without ending is a valid word by itself. For this type of word we do not write ending file line. The line 'calm 1 5.2' means 'use stem calm and suffix table from line 1 of suffix file, and assign frequency 5.2 to it'. Correspondingly 'creat 0 6.1' means 'use stem creat with suffix table from line 0 of suffix file, assign frequency 6.1 to it'.
About
Dictionary generating for cuneiform
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published