Word vector representations have been extensively studied in large text datasets. However, only a few studies analyze semantic representations of low resource languages, particularly when only small corpus is available. In most cases, low resource languages lack traditional natгral language processing instruments like lemmatizer and stemmer. In this study, we introduced a methodology to build word embeddings of low resource languages. The proposed methodology consists of defining accurate preprocessings steps, applying language-independent stemmer, introducing techniques for building word vector representations. In addition, we proposed a simple word embedding evaluation scheme that can be easily adapted to any language. By using this methodology we trained word embeddings for Buryat language. We made the source code and the resulting word embeddings corpus publicly available in order to promote further research.
Buryat Language Embeddings:
2 | 5 | 10 | |
---|---|---|---|
50 | CBOW SG GloVe SVD | CBOW SG GloVe SVD | CBOW SG GloVe SVD |
100 | CBOW SG GloVe SVD | CBOW SG GloVe SVD | CBOW SG GloVe SVD |
500 | CBOW SG GloVe SVD | CBOW SG GloVe SVD | CBOW SG GloVe SVD |
Erzya Language Embeddings:
2 | 5 | 10 | |
---|---|---|---|
50 | CBOW SG GloVe SVD | CBOW SG GloVe SVD | CBOW SG GloVe SVD |
100 | CBOW SG GloVe SVD | CBOW SG GloVe SVD | CBOW SG GloVe SVD |
500 | CBOW SG GloVe SVD | CBOW SG GloVe SVD | CBOW SG GloVe SVD |
Komi Language Embeddings:
2 | 5 | 10 | |
---|---|---|---|
50 | CBOW SG GloVe SVD | CBOW SG GloVe SVD | CBOW SG GloVe SVD |
100 | CBOW SG GloVe SVD | CBOW SG GloVe SVD | CBOW SG GloVe SVD |
500 | CBOW SG GloVe SVD | CBOW SG GloVe SVD | CBOW SG GloVe SVD |
Files for evaluation: bxr myv kv
For any question, please contact [email protected]
@inproceedings{konovalov2018learning,
title={Learning word embeddings for low resource languages: the case of Buryat},
author={Konovalov, VP and Tumunbayarova, ZB},
booktitle={Komp'juternaja Lingvistika i Intellektual'nye Tehnologii},
pages={331--341},
year={2018}
}