We trained a Malayalam language model on the Wikipedia article dump from Oct, 2018. The Wikipedia dump had 55k+ articles. The difficuly in training a Malayalam language model is text tokenization, since Malayalam is a highly inflectional and agglutinative language. In the current model, we are using nltk tokenizer
(will try better alternative in the future) and the vocab size is 30k. The language model was used to train a classifier which classifies a news into 5 categories (India, Kerala, Sports, Business, Entertainment). Our classifier came out with a whooping 92% accuracy in the classification task.
- Proccessed wikipedia dump of articles split into test and train.
- Script and weights for Malayalam Language model.
- Malayalam text classifier with pretrained weights.
- Inference code for text classifier.
-
Pretrained Malayalam News Classifier - to run only the prediction, use this.
-
Raw Datadump of malayalam wikipedia articles : Malayalam Articles
python3.6>=
If you are using virtualenvwrapper use the following steps:
git clone https://github.com/adamshamsudeen/Vaaku2Vec.git
mkvirtualenv -p python3.6 venv
workon venv
cd Vaaku2Vec
pip install -r requirements.txt
- Download the pretrained language model folder, it contains the preprocessed test and train csv. If you would like to preproccess and retrain the LM using the latest dump article dump using the scripts provided here.
- Create tokens:
python lm/create_toks.py <path_to_processed_wiki_dump>
eg:python lm/create_toks.py /home/adamshamsudeen/mal/Vaaku2Vec/wiki/ml/
- Create a token to id mapping:
python lm/tok2id.py <path_to_processed_wiki_dump>
eg:python lm/tok2id.py /home/adamshamsudeen/mal/Vaaku2Vec/wiki/ml/
- Train language model:
python lm/pretrain_lm.py <path_to_processed_wiki_dump> 0 --lr 1e-3 --cl 40
eg: python lm/pretrain_lm.py /home/adamshamsudeen/mal/Vaaku2Vec/wiki/ml/ 0 --lr 1e-3 --cl 40
lr
is the learning rate andcl
is the no of epochs.
- Use
train_classifier.ipynb
to train a malayalam text classifier. - We have not released the news dataset, raise a request if you want to experiment with the same.
- To test the classifier trained on Manorama news, download the
Pretrained Malyalam Text Classifier
mentioned in the downloads. - Use
prediction.ipynb
and test out your input.
We manually tested the model on news from other leading news paper and the model performed pretty well.
- We also trained a word2vec model using gensim with the Wikipedia dump.
- You can also use word2vec model to train a text classifier. News Classifier
- You can see the word2vec demo in the below link.
- Malayalam Language modeling based on wikipedia articles.
- Release Trained Language Models weights.
- Malayalam Text classifier script.
- Benchmark with mlmorph for tokenization.
- Benchmark with Byte pair encoding for tokenization
- UI to train and test classifier.
- Basic Chatbot using this implementation.
- Special thanks to Sebastian Ruder and Jeremy Howard and other contributors to fastai and ULTMFiT.
- Logo base design
- Raeesa for designing the logo.
- Kamal K Raj
- Adam Shamsudeen