The repository contains a novel implementation for Topic Extraction.
In short:
Over the last years the necessity to automatize information retrieval from texts has been growing considerably as part of media digitization. The emerging flow of data requires efficient techniques in the field of Natural Language Processing (NLP), such as Keyword Extraction (KE) algorithms among many others. Unfortunately, languages with a small number of speakers like Nordic languages have to face a lack of resources that weighs down the potential benefits of NLP applications.
This thesis introduces a novel KE model for topic extraction that follows an unsupervised hybrid approach making use of statistical and linguistic text features, such as syntactic analysis within sentences, entity recognition, word frequency and position in the text, or semantic similarity. The model can be easily configured to predict keyphrases for different KE scenarios and it is able to make predictions for English, Danish, Swedish and Norwegian. The novel algorithm has shown a competitive performance compared to other state-of-the-art unsupervised KE methods, being evaluated using 7 annotated datasets with texts in four different languages.
The report provides a research of the current state of the art KE for Scandinavian languages and suggests to consider KE not as a final step but a initial or complementary phase for other NLP tasks. The proposed implementation can be used for document retrieval in many NLP applications, such as topic clustering, summarization or data visualization.
For the moment the Python package allows to perform 2 operations:
-
Predict keywords and save the results in a CSV file.
-
Evaluate predicted keywords with human-annotated keywords.
See scripts folder to execute the mentioned operations.