GSOC 2019 - Development of a Greek open source Morphological dictionary and application of it to Greek spelling tools
- An SQL database containing the following data
- A morphological dictionary containing about
900.000
entries, with518.000
distinct surface forms with information described according to Universal Dependencies. - Definitions for most lemmas
- Etymologies for most lemmas
18500
Synonyms,12500
of which are for Greek5500
Antonyms,4300
of which are for Greek3310
Normalizations of words- Almost
150.000
Translations
- A spelling dictionary with
1.047.200
words, up from the828.807
of the previous dictionary used in open source programs. The dictionary also includes frequencies for all words. It will be integrated into spelling dictionaries of Firefox and Thunderbird.
Documentation can be in the directory data
Information about running the script is found here
You can find the final report in the following gist.
During the summer a Morphological dictionary in sqlite3 format will be created. Information will be extracted automatically with a python script and using the pymediawiki library. In addition words and morphological information will be added to the spelling tool dictionaries.
Creation of a parsing tool for Greek wiktionary that parses nouns, adjectives, verbs using Universal Dependencies POS tags
Addition of remaining parts of speech to the Morphological dictionary and addition of further information tags like toponyms and terminology extracted from page categories.
Addition of extracted surface forms to Greek spelling dictionaries including words from reliable sources like European parliament translations.
- Google summer of code participant: Konstantinos Agiannis
- Mentor: Kostas Papadimas
- Mentor: Theodoros Karounos
- Mentor: Alexios Zavras
The source code is under GPLv3.
The produced database with the morphological dictionary is under CC BY-SA 3.0