Topic-modeling-on-Ukrainian-text

Linear algebra course project

Description:

In this project we utilized available topic modeling methods — LSA, pLSA, LDA — on Ukrainian corpora, evaluated it with the help of coherence measures and achieved the best performance — 63% on TF-IDF features with the LSA algorithm.

Pipeline:

Preprocessing: tokenization, stopwords, punctuation, hyperlinks and numbers removal. Also, we extracted lemmas from each words using stanza pretrained model for Ukrainian language and created two models for further comparison - with and without lemmatization.
To feed text data into algorithms we vectorized it with Bag of Words model and TF-IDF.
Utilized gensim implementation of LSA, pLSA and LDA.
Evaluated their performance using topic coherence score.

Quick Start

git clone https://github.com/romanyshyn-natalia/Topic-modeling-on-Ukrainian-text.git
cd Topic-modeling-on-Ukrainian-text
pip3 install -r requirements.txt --no-cache-dir

Dataset:

The dataset which we have used is UA-GEC corpus. This is a collection of texts written by ordinary people: essays, blog and social network posts, reviews, letters, etc., which are splitted in two parts - train and test, we used both to have bigger corpus. All data has 3 types of representations: in annotated format (annotated), original (source) and the corrected (target which we have used) versions of documents.

Results:

From Plot 1 we see that the best model is LSA with 63% coherence score on TF-IDF not lemmatizated features. From Plot 2 we see that lemmatized features work better for LDA — 62% coherence score.

Credits:

Natalia Romanyshyn
Daria Omelkina
Anna Korabliova

Ukrainian Catholic University, 2021

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
English_LDA		English_LDA
English_LSA		English_LSA
English_pLSA		English_pLSA
Ukrainian_TopicModeling		Ukrainian_TopicModeling
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topic-modeling-on-Ukrainian-text

Description:

Pipeline:

Quick Start

Dataset:

Results:

Credits:

About

Releases

Packages

Contributors 3

Languages

romanyshyn-natalia/Topic-modeling-on-Ukrainian-text

Folders and files

Latest commit

History

Repository files navigation

Topic-modeling-on-Ukrainian-text

Description:

Pipeline:

Quick Start

Dataset:

Results:

Credits:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages