This repo house the collection of Manipuri multimodal datatset for Natural Language Processing (NLP) Applications.
The dataset consists of the following tuples:
- Image
- Caption in English
- Translation of the English text in Manipuri, Bengali, Hindi and German.
- Audio recording of the Manipuri text by the native speakers.
Version 1:
The process begins by collecting English text and images from the local newspaper Imphal Free Press using our in-house web scraper. Subsequently, the English text undergoes manual translation into Manipuri, followed by machine translation into Bengali, Hindi, and German.
Version 2:
Version 1 + Manipuri Text and Images are collected from a local newspaper Huiyen Lanpao. Manipuri Text is manual translated into English, and then the translated English text is machine-translated into Bengali, Hindi, and German.
This comprehensive approach allows us to leverage both human expertise and automated translation technologies to facilitate multilingual access to the content.
Translation approach:
English to Manipuri and Manipuri to English: Manual Translation + Manual Post-editing
English to Bengali and Hindi: Indic-Trans
English to German: DeepL
A sample dataset is provided for reference.
Please fill up this form for a request to access the data and other supplementary files