Generated product reviews dataset for machine translation quality prediction, part of scb-mt-en-th-2020
- Dataset Description
- Dataset Structure
- Dataset Creation
- Considerations for Using the Data
- Additional Information
- Homepage: ttp://airesearch.in.th/
- Repository: https://github.com/vistec-ai/generated_reviews_enth
- Paper: https://arxiv.org/pdf/2007.03541.pdf
- Leaderboard:
- Point of Contact: AIResearch
generated_reviews_enth
is created as part of scb-mt-en-th-2020 for machine translation task. This dataset (referred to as generated_reviews_yn
in scb-mt-en-th-2020) are English product reviews generated by CTRL, translated by Google Translate API and annotated as accepted or rejected (correct
) based on fluency and adequacy of the translation by human annotators. This allows it to be used for English-to-Thai translation quality esitmation (binary label), machine translation, and sentiment analysis.
English-to-Thai translation quality esitmation (binary label) is the intended use. Other uses include machine translation and sentiment analysis.
English, Thai
{'en_segment': 'Best value in a hard to obtain item. The item was new with the box. I have purchased many times, and will continue to purchase as needed. Thank you very much!', 'th_segment': 'คุ้มค่าที่สุดในการรับไอเท็มยาก ๆ สินค้าเป็นของใหม่พร้อมกล่อง ฉันซื้อหลายครั้งและจะซื้อต่อไปตามความจำเป็น ขอบคุณมาก!', 'review_star': 5, 'correct': 0}
{'en_segment': 'I am an avid Amazon reader since my first Kindle and I love to purchase books, movies, etc. from Amazon. This book was not the typical style of a John Grisham book. It came as a surprise from all the great reviews. The book starts out with no explanation on how these victims are able to tell each other apart. Not very believable. But, like most books by John, this one held my interest.', 'th_segment': 'ฉันเป็นนักอ่านตัวยงของ Amazon ตั้งแต่ Kindle ตัวแรกและฉันชอบซื้อหนังสือภาพยนตร์และอื่น ๆ จาก Amazon หนังสือเล่มนี้ไม่ได้เป็นแบบฉบับของหนังสือ John Grisham มันมาจากความคิดเห็นที่ยอดเยี่ยมทั้งหมด
{'en_segment': 'These things rip so easily.', 'th_segment': 'สิ่งเหล่านี้ฉีกง่ายมาก', 'review_star': 1, 'correct': 1
en_segment
: English product reviews generated by CTRLth_segment
: Thai product reviews translated fromen_segment
by Google Translate APIreview_star
: Stars of the generated reviews, put as condition for CTRLcorrect
: 1 if the English-to-Thai translation is accepted (correct
) based on fluency and adequacy of the translation by human annotators else 0
train | valid | test | |
---|---|---|---|
# samples | 141369 | 15708 | 17453 |
# correct:0 | 99296 | 10936 | 12208 |
# correct:1 | 42073 | 4772 | 5245 |
# review_star:1 | 50418 | 5628 | 6225 |
# review_star:2 | 22876 | 2596 | 2852 |
# review_star:3 | 22825 | 2521 | 2831 |
# review_star:1 | 22671 | 2517 | 2778 |
# review_star:5 | 22579 | 2446 | 2767 |
generated_reviews_enth
is created as part of scb-mt-en-th-2020 for machine translation task. This dataset (referred to as generated_reviews_yn
in scb-mt-en-th-2020) are English product reviews generated by CTRL, translated by Google Translate API and annotated as accepted or rejected (correct
) based on fluency and adequacy of the translation by human annotators. This allows it to be used for English-to-Thai translation quality esitmation (binary label), machine translation, and sentiment analysis.
The data generation process is as follows:
en_segment
is generated using conditional generation of CTRL, stating a star review for each generated product review.th_segment
is translated fromen_segment
using Google Translate APIcorrect
is annotated as accepted or rejected (1 or 0) based on fluency and adequacy of the translation by human annotators
For this specific dataset for translation quality estimation task, we apply the following preprocessing:
- Drop duplciates on
en_segment
,th_segment
,review_star
,correct
; duplicates might exist because the translation checking is done by annotators. - Remove reviews that are not between 1-5 stars.
- Remove reviews whose
correct
are not 0 or 1. - Deduplicate on
en_segment
which contains the source sentences.
Annotators are given English and Thai product review pairs. They are asked to label the pair as acceptable translation or not based on fluency and adequacy of the translation.
Human annotators of Hope Data Annotations hired by AIResearch.in.th
The authors do not expect any personal or sensitive information to be in the generated product reviews, but they could slip through from pretraining of CTRL.
- English-Thai translation quality estimation for machine translation
- Product review classification for Thai
[More Information Needed]
Due to annotation process constraints, the number of one-star reviews are notably higher than other-star reviews. This makes the dataset slighly imbalanced.
The dataset was created by AIResearch.in.th
CC BY-SA 4.0
@article{lowphansirikul2020scb,
title={scb-mt-en-th-2020: A Large English-Thai Parallel Corpus},
author={Lowphansirikul, Lalita and Polpanumas, Charin and Rutherford, Attapol T and Nutanong, Sarana},
journal={arXiv preprint arXiv:2007.03541},
year={2020}
}