AttrPrompt

This repo contains the code and dataset used in the paper Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias, which will appear at NeurIPS 2023 (D&B Track). It also provides a framework for developing and evaluating your training data generation pipelines with Large Language Models.

Framework

Dataset

Generated Datasets

The datasets, including the original train/validation/test data, the generated training data, as well as label names are available in Huggingface Dataset Hub:

Dataset	# Train	# Test	# Class	Task	Domain	Link
NYT	9k	1.15k	26	Multiclass	News	nyt-attrprompt
Amazon	13.8k	1.1k	23	Multiclass	Review	amazon-attrprompt
Reddit	27k	2.3k	45	Multiclass	Social Media	reddit-attrprompt
StackExchange	27k	2.5k	50	Multiclass	Web Forum	stackexchange-attrprompt
arXiv	26.1k	27.8k	98	Multilabel	Paper	arxiv-attrprompt

Besides, we also provide the generated dataset for AG News, SST-2/IMDB, and Yelp, which is studied in the Appendix. The detailed information is listed as follows:

Dataset	# Train	# Test	# Class	Task	Domain	Link
AG News	6k	7.6k	4	Multiclass	News	agnews-attrprompt
SST-2	6k	0.8k	2	Multiclass	Movie Review	SST-2-attrprompt
Yelp	6k	38k	2	Multiclass	Restaurant Review	yelp-attrprompt

Load Datasets

For the original train/valid/test set, we use the following commands for loading the data from the huggingface data hub (we use nyt dataset as an example, same as follows):

from datasets import load_dataset

train = load_dataset("yyu/nyt-attrprompt", split="train")
valid = load_dataset("yyu/nyt-attrprompt", split="valid")
test = load_dataset("yyu/nyt-attrprompt", split="test")

For attrprompt, simprompt, progen, regen and regen_llm_augmented, we use the following commands for loading the data from the huggingface data hub:

from datasets import load_dataset

attrprompt = load_dataset("yyu/nyt-attrprompt", data_files="attrprompt-v1.jsonl", split = 'train')

simprompt = load_dataset("yyu/nyt-attrprompt", data_files="simprompt.jsonl", split = 'train')

progen = load_dataset("yyu/nyt-attrprompt", data_files="progen.jsonl", split = 'train')

regen = load_dataset("yyu/nyt-simprompt", data_files="regen.jsonl", split = 'train')

regen_llm_augmented = load_dataset("yyu/nyt-simprompt", data_files="regen_llm_augmented.jsonl", split = 'train')

Dataset Attributes

Please see the subfolders on the ./datasets directory for attribute information.

Code for Training Data Generation

See gen_train_data for details.

Code for Classifier Training

See train_classifier for details.

Questions?

Feel free to contact yueyu at gatech.edu for any questions regarding this repo. Please try to specify the problem with details so we can help you better and quicker!

Citation

If you find this repository helpful, please kindly consider citing the corresponding paper. Thanks in advance!

@inproceedings{yu2023large,
  title={Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias},
  author={Yu, Yue and Zhuang, Yuchen and Zhang, Jieyu and Meng, Yu and Ratner, Alexander and Krishna, Ranjay and Shen, Jiaming and Zhang, Chao},
  booktitle={Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
datasets		datasets
figure		figure
gen_train_data		gen_train_data
train_classifier		train_classifier
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AttrPrompt

Framework

Dataset

Generated Datasets

Load Datasets

Dataset Attributes

Code for Training Data Generation

Code for Classifier Training

Questions?

Citation

About

Releases

Packages

Contributors 3

Languages

License

yueyu1030/AttrPrompt

Folders and files

Latest commit

History

Repository files navigation

AttrPrompt

Framework

Dataset

Generated Datasets

Load Datasets

Dataset Attributes

Code for Training Data Generation

Code for Classifier Training

Questions?

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages