Manually tokenized and POS-tagged proverb in Northern Kurdish ‘As every tree stands over its roots, so does every human blossom with their mother tongue.’
This repository provides access to all trained and ready-to-use POS models and the training and test data discussed in our MWE-UD 2024 paper.
This folder contains the:
gold_data.tsv
the gold-standard manually annotated and tokenized dataset which consists of 136 sentences(2,937 tokens).kmr_mg-ud-complete-augmented.conllu
the UD Kurmanji augmented dataset.
This folder contains 7 sub folders where you can find all our trained POS models. All sub folders contain two model variants original
and augmented
.
Baseline
HMM
ExtraTrees
AveragedPerceptron
BiLSTM
CRF
NK-XLMR
The folders BiLSTM
and NK-XLMR
are empty since their models are very big and therefore uploaded to Google Drive.
BiLSTM
model variants link: Google Drive (5.02 GB). Download the zip file and extract its content to BiLSTM
sub folder.
NK-XLMR
model variants link: Google Drive (2.46 GB). Download the zip file and extract its content to NK-XLMR
sub folder.
This file offers a CLI interface to interact with the models. The file can be called within the command line with the following arguments:
python pos_cli --pos_model CRF --training_data_type augmented --sentence "Leyla Qasim dixwest dengê kurdan li cîhanê bide bihîstin." --tokenization_method KLPT
An alternative interface to the cli one to interact all POS models. Rund this file and then navigate to http://127.0.0.1:5000/
The web interface to interact with all POS models.
- Operating system: macOS / OS X · Linux · Windows
- Python version: Python 3.9.19
Clone the repo and run pip install -r requirements.txt
to install all dependencies.
Please consider citing this paper, if you use any part of the data or the POS models:
@inproceedings{morad-etal-2024-part,
title = "Part-of-Speech Tagging for {N}orthern {K}urdish",
author = "Morad, Peshmerge and
Ahmadi, Sina and
Gatti, Lorenzo",
editor = "Bhatia, Archna and
Bouma, Gosse and
Dogruoz, A. Seza and
Evang, Kilian and
Garcia, Marcos and
Giouli, Voula and
Han, Lifeng and
Nivre, Joakim and
Rademaker, Alexandre",
booktitle = "Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.mwe-1.11",
pages = "70--80",
abstract = "In the growing domain of natural language processing, low-resourced languages like Northern Kurdish remain largely unexplored due to the lack of resources needed to be part of this growth. In particular, the tasks of part-of-speech tagging and tokenization for Northern Kurdish are still insufficiently addressed. In this study, we aim to bridge this gap by evaluating a range of statistical, neural, and fine-tuned-based models specifically tailored for Northern Kurdish. Leveraging limited but valuable datasets, including the Universal Dependency Kurmanji treebank and a novel manually annotated and tokenized gold-standard dataset consisting of 136 sentences (2,937 tokens). We evaluate several POS tagging models and report that the fine-tuned transformer-based model outperforms others, achieving an accuracy of 0.87 and a macro-averaged F1 score of 0.77. Data and models are publicly available under an open license at https://github.com/peshmerge/northern-kurdish-pos-tagging",
}
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.