Linear algebra course project
In this project we utilized available topic modeling methods — LSA, pLSA, LDA — on Ukrainian corpora, evaluated it with the help of coherence measures and achieved the best performance — 63% on TF-IDF features with the LSA algorithm.
- Preprocessing: tokenization, stopwords, punctuation, hyperlinks and numbers removal. Also, we extracted lemmas from each words using stanza pretrained model for Ukrainian language and created two models for further comparison - with and without lemmatization.
- To feed text data into algorithms we vectorized it with Bag of Words model and TF-IDF.
- Utilized gensim implementation of LSA, pLSA and LDA.
- Evaluated their performance using topic coherence score.
git clone https://github.com/romanyshyn-natalia/Topic-modeling-on-Ukrainian-text.git
cd Topic-modeling-on-Ukrainian-text
pip3 install -r requirements.txt --no-cache-dir
The dataset which we have used is UA-GEC corpus. This is a collection of texts written by ordinary people: essays, blog and social network posts, reviews, letters, etc., which are splitted in two parts - train and test, we used both to have bigger corpus. All data has 3 types of representations: in annotated format (annotated), original (source) and the corrected (target which we have used) versions of documents.
From Plot 1 we see that the best model is LSA with 63% coherence score on TF-IDF not lemmatizated features. From Plot 2 we see that lemmatized features work better for LDA — 62% coherence score.
- Natalia Romanyshyn
- Daria Omelkina
- Anna Korabliova
Ukrainian Catholic University, 2021