`generated_reviews_enth`

Generated product reviews dataset for machine translation quality prediction, part of scb-mt-en-th-2020

Dataset Description

Homepage: ttp://airesearch.in.th/
Repository: https://github.com/vistec-ai/generated_reviews_enth
Paper: https://arxiv.org/pdf/2007.03541.pdf
Leaderboard:
Point of Contact: AIResearch

Dataset Summary

generated_reviews_enth is created as part of scb-mt-en-th-2020 for machine translation task. This dataset (referred to as generated_reviews_yn in scb-mt-en-th-2020) are English product reviews generated by CTRL, translated by Google Translate API and annotated as accepted or rejected (correct) based on fluency and adequacy of the translation by human annotators. This allows it to be used for English-to-Thai translation quality esitmation (binary label), machine translation, and sentiment analysis.

Supported Tasks and Leaderboards

English-to-Thai translation quality esitmation (binary label) is the intended use. Other uses include machine translation and sentiment analysis.

Languages

English, Thai

Dataset Structure

Data Instances

{'en_segment': 'Best value in a hard to obtain item. The item was new with the box. I have purchased many times, and will continue to purchase as needed. Thank you very much!', 'th_segment': 'คุ้มค่าที่สุดในการรับไอเท็มยาก ๆ สินค้าเป็นของใหม่พร้อมกล่อง ฉันซื้อหลายครั้งและจะซื้อต่อไปตามความจำเป็น ขอบคุณมาก!', 'review_star': 5, 'correct': 0}

{'en_segment': 'I am an avid Amazon reader since my first Kindle and I love to purchase books, movies, etc. from Amazon. This book was not the typical style of a John Grisham book. It came as a surprise from all the great reviews. The book starts out with no explanation on how these victims are able to tell each other apart. Not very believable. But, like most books by John, this one held my interest.', 'th_segment': 'ฉันเป็นนักอ่านตัวยงของ Amazon ตั้งแต่ Kindle ตัวแรกและฉันชอบซื้อหนังสือภาพยนตร์และอื่น ๆ จาก Amazon หนังสือเล่มนี้ไม่ได้เป็นแบบฉบับของหนังสือ John Grisham มันมาจากความคิดเห็นที่ยอดเยี่ยมทั้งหมด 

{'en_segment': 'These things rip so easily.', 'th_segment': 'สิ่งเหล่านี้ฉีกง่ายมาก', 'review_star': 1, 'correct': 1

Data Fields

en_segment: English product reviews generated by CTRL
th_segment: Thai product reviews translated from en_segment by Google Translate API
review_star: Stars of the generated reviews, put as condition for CTRL
correct: 1 if the English-to-Thai translation is accepted (correct) based on fluency and adequacy of the translation by human annotators else 0

Data Splits

	train	valid	test
# samples	141369	15708	17453
# correct:0	99296	10936	12208
# correct:1	42073	4772	5245
# review_star:1	50418	5628	6225
# review_star:2	22876	2596	2852
# review_star:3	22825	2521	2831
# review_star:1	22671	2517	2778
# review_star:5	22579	2446	2767

Dataset Creation

Curation Rationale

generated_reviews_enth is created as part of scb-mt-en-th-2020 for machine translation task. This dataset (referred to as generated_reviews_yn in scb-mt-en-th-2020) are English product reviews generated by CTRL, translated by Google Translate API and annotated as accepted or rejected (correct) based on fluency and adequacy of the translation by human annotators. This allows it to be used for English-to-Thai translation quality esitmation (binary label), machine translation, and sentiment analysis.

Source Data

Initial Data Collection and Normalization

The data generation process is as follows:

en_segment is generated using conditional generation of CTRL, stating a star review for each generated product review.
th_segment is translated from en_segment using Google Translate API
correct is annotated as accepted or rejected (1 or 0) based on fluency and adequacy of the translation by human annotators

For this specific dataset for translation quality estimation task, we apply the following preprocessing:

Drop duplciates on en_segment,th_segment,review_star,correct; duplicates might exist because the translation checking is done by annotators.
Remove reviews that are not between 1-5 stars.
Remove reviews whose correct are not 0 or 1.
Deduplicate on en_segment which contains the source sentences.

Who are the source language producers?

CTRL

Annotations

Annotation process

Annotators are given English and Thai product review pairs. They are asked to label the pair as acceptable translation or not based on fluency and adequacy of the translation.

Who are the annotators?

Human annotators of Hope Data Annotations hired by AIResearch.in.th

Personal and Sensitive Information

The authors do not expect any personal or sensitive information to be in the generated product reviews, but they could slip through from pretraining of CTRL.

Considerations for Using the Data

Social Impact of Dataset

English-Thai translation quality estimation for machine translation
Product review classification for Thai

Discussion of Biases

[More Information Needed]

Other Known Limitations

Due to annotation process constraints, the number of one-star reviews are notably higher than other-star reviews. This makes the dataset slighly imbalanced.

Additional Information

Dataset Curators

The dataset was created by AIResearch.in.th

Licensing Information

CC BY-SA 4.0

Citation Information

@article{lowphansirikul2020scb,
  title={scb-mt-en-th-2020: A Large English-Thai Parallel Corpus},
  author={Lowphansirikul, Lalita and Polpanumas, Charin and Rutherford, Attapol T and Nutanong, Sarana},
  journal={arXiv preprint arXiv:2007.03541},
  year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data.zip		data.zip
exploration.ipynb		exploration.ipynb
process_huggingface.ipynb		process_huggingface.ipynb
thsarabunnew-webfont.ttf		thsarabunnew-webfont.ttf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`generated_reviews_enth`

Table of Contents

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

About

Releases

Packages

Languages

License

vistec-AI/generated_reviews_enth

Folders and files

Latest commit

History

Repository files navigation

generated_reviews_enth

Table of Contents

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`generated_reviews_enth`

Packages