Skip to content

Generated product reviews dataset for machine translation quality estimation, part of [scb-mt-en-th-2020](https://arxiv.org/pdf/2007.03541.pdf)

License

Notifications You must be signed in to change notification settings

vistec-AI/generated_reviews_enth

Repository files navigation

generated_reviews_enth

Generated product reviews dataset for machine translation quality prediction, part of scb-mt-en-th-2020

Table of Contents

Dataset Description

Dataset Summary

generated_reviews_enth is created as part of scb-mt-en-th-2020 for machine translation task. This dataset (referred to as generated_reviews_yn in scb-mt-en-th-2020) are English product reviews generated by CTRL, translated by Google Translate API and annotated as accepted or rejected (correct) based on fluency and adequacy of the translation by human annotators. This allows it to be used for English-to-Thai translation quality esitmation (binary label), machine translation, and sentiment analysis.

Supported Tasks and Leaderboards

English-to-Thai translation quality esitmation (binary label) is the intended use. Other uses include machine translation and sentiment analysis.

Languages

English, Thai

Dataset Structure

Data Instances

{'en_segment': 'Best value in a hard to obtain item. The item was new with the box. I have purchased many times, and will continue to purchase as needed. Thank you very much!', 'th_segment': 'คุ้มค่าที่สุดในการรับไอเท็มยาก ๆ สินค้าเป็นของใหม่พร้อมกล่อง ฉันซื้อหลายครั้งและจะซื้อต่อไปตามความจำเป็น ขอบคุณมาก!', 'review_star': 5, 'correct': 0}

{'en_segment': 'I am an avid Amazon reader since my first Kindle and I love to purchase books, movies, etc. from Amazon. This book was not the typical style of a John Grisham book. It came as a surprise from all the great reviews. The book starts out with no explanation on how these victims are able to tell each other apart. Not very believable. But, like most books by John, this one held my interest.', 'th_segment': 'ฉันเป็นนักอ่านตัวยงของ Amazon ตั้งแต่ Kindle ตัวแรกและฉันชอบซื้อหนังสือภาพยนตร์และอื่น ๆ จาก Amazon หนังสือเล่มนี้ไม่ได้เป็นแบบฉบับของหนังสือ John Grisham มันมาจากความคิดเห็นที่ยอดเยี่ยมทั้งหมด 

{'en_segment': 'These things rip so easily.', 'th_segment': 'สิ่งเหล่านี้ฉีกง่ายมาก', 'review_star': 1, 'correct': 1

Data Fields

  • en_segment: English product reviews generated by CTRL
  • th_segment: Thai product reviews translated from en_segment by Google Translate API
  • review_star: Stars of the generated reviews, put as condition for CTRL
  • correct: 1 if the English-to-Thai translation is accepted (correct) based on fluency and adequacy of the translation by human annotators else 0

Data Splits

train valid test
# samples 141369 15708 17453
# correct:0 99296 10936 12208
# correct:1 42073 4772 5245
# review_star:1 50418 5628 6225
# review_star:2 22876 2596 2852
# review_star:3 22825 2521 2831
# review_star:1 22671 2517 2778
# review_star:5 22579 2446 2767

Dataset Creation

Curation Rationale

generated_reviews_enth is created as part of scb-mt-en-th-2020 for machine translation task. This dataset (referred to as generated_reviews_yn in scb-mt-en-th-2020) are English product reviews generated by CTRL, translated by Google Translate API and annotated as accepted or rejected (correct) based on fluency and adequacy of the translation by human annotators. This allows it to be used for English-to-Thai translation quality esitmation (binary label), machine translation, and sentiment analysis.

Source Data

Initial Data Collection and Normalization

The data generation process is as follows:

  • en_segment is generated using conditional generation of CTRL, stating a star review for each generated product review.
  • th_segment is translated from en_segment using Google Translate API
  • correct is annotated as accepted or rejected (1 or 0) based on fluency and adequacy of the translation by human annotators

For this specific dataset for translation quality estimation task, we apply the following preprocessing:

  • Drop duplciates on en_segment,th_segment,review_star,correct; duplicates might exist because the translation checking is done by annotators.
  • Remove reviews that are not between 1-5 stars.
  • Remove reviews whose correct are not 0 or 1.
  • Deduplicate on en_segment which contains the source sentences.

Who are the source language producers?

CTRL

Annotations

Annotation process

Annotators are given English and Thai product review pairs. They are asked to label the pair as acceptable translation or not based on fluency and adequacy of the translation.

Who are the annotators?

Human annotators of Hope Data Annotations hired by AIResearch.in.th

Personal and Sensitive Information

The authors do not expect any personal or sensitive information to be in the generated product reviews, but they could slip through from pretraining of CTRL.

Considerations for Using the Data

Social Impact of Dataset

  • English-Thai translation quality estimation for machine translation
  • Product review classification for Thai

Discussion of Biases

[More Information Needed]

Other Known Limitations

Due to annotation process constraints, the number of one-star reviews are notably higher than other-star reviews. This makes the dataset slighly imbalanced.

Additional Information

Dataset Curators

The dataset was created by AIResearch.in.th

Licensing Information

CC BY-SA 4.0

Citation Information

@article{lowphansirikul2020scb,
  title={scb-mt-en-th-2020: A Large English-Thai Parallel Corpus},
  author={Lowphansirikul, Lalita and Polpanumas, Charin and Rutherford, Attapol T and Nutanong, Sarana},
  journal={arXiv preprint arXiv:2007.03541},
  year={2020}
}

About

Generated product reviews dataset for machine translation quality estimation, part of [scb-mt-en-th-2020](https://arxiv.org/pdf/2007.03541.pdf)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published