OPUS the open parrallel corpus
A Dravidian Etymological Dictionary
Byte Pair Encoding - Pretrained for 275 language
FastText word vectors for 157 languages
Indian Language Technology Proliferation and Deployment Center
Center For Indian Language Technology - CFILT FB page
Indian Institute of Language Studies (IILS)
Central Institute of Indian Languages
Central Institute of Indian Languages
Survey:Natural Language Parsing For Indian Languages
mlmorph - Malayalam Morphological Analyzer using Finite State Transducer
Open Tamil Suite of tools for operating on tamil text.
Tokenizer, Language model and Classifier for Tamil language by Ravi Annaswamy
Text Classification model in Pytorch: Can be easily applied to other datasets, infact the linked repository also contains a dataset for film reviews in tamil.
- Contains Wikipedia Articles Dataset (72,374 articles) and scripts which were used to scrape Wikipedia and clean that dataset
- Contains Language Model with Perplexity ~41
- Contains Bengali News Classification Model with 94% accuracy
Telugu-NLP - Contains NLP tools developed for telugu
Research Papers in Bengali NLP
Language | Repository | Perplexity of Language model | Wikipedia Articles Dataset | Classification accuracy | Classification Kappa score |
---|---|---|---|---|---|
Hindi | NLP for Hindi | ~36 | 55,000 articles | ~79 (News Classification) | ~30 (Movie Review Classification) |
Punjabi | NLP for Punjabi | ~13 | 44,000 articles | ~89 (News Classification) | ~60 (News Classification) |
Sanskrit | NLP for Sanskrit | ~6 | 22,273 articles | ~70 (Shloka Classification) | ~56 (Shloka Classification) |
Gujarati | NLP for Gujarati | ~34 | 31,913 articles | ~91 (News Classification) | ~85 (News Classification) |
Kannada | NLP for Kannada | ~70 | 32,997 articles | ~94 (News Classification) | ~90 (News Classification) |
Malyalam | NLP for Malyalam | ~26 | 12,388 articles | ~94 (News Classification) | ~91 (News Classification) |
Nepali | NLP for Nepali | ~32 | 38,757 articles | ~97 (News Classification) | ~96 (News Classification) |
Odia | NLP for Odia | ~27 | 17,781 articles | ~95 (News Classification) | ~92 (News Classification) |
Marathi | NLP for Marathi | ~18 | 85,537 articles | ~91 (News Classification) | ~84 (News Classification) |
Bengali | NLP for Bengali | ~41 | 72,374 articles | ~94 (News Classification) | ~92 (News Classification) |