This repository contains the code for our paper “SunGen: Self-Guided High-Quality Data Generation in Efficient Zero-Shot Learning”.
Data generation
For data generation via PLM, the implementation is built on the source code from ZeroGen. For movie review sentiment classification tasks (imdb, sst-2, rotten tomato), we use the same prompts as ZeroGen. For other tasks, we provide the detailed prompts for each task in this repository under ./tasks/
.
We provide sample codes for yelp data generation:
(1) generate restaurant name
python main.py --reload_model --task_file tasks/yelp/yelp-x1.json --input_file_type plain --output_dir yelp/output/yelp-x1-gen/ --model_name gpt2-xl --small_model_name distilbert -base-uncased --min_length 1 --max_length 5 --top_k 0 --top_p 0.9 --decay_constant 200 --batch_size 2048 --train_batch_size 32 --learning_rate 2e-5 --num_entries_per_input 500000
(2) generate restaurant review dataes given restaurant name
python main.py --reload_model --task_file tasks/yelp/yelp-x2.json --output_dir yelp/output/yelp-x1/ --input_file_type 'plain' --input_file tasks/subj/res_names.txt --model_name gpt2-xl --small_model_name distilbert-base-uncased --min_length 10 --max_length 100 --top_k 0 --top_p 0.9 --decay_constant 200 --batch_size 180 --train_batch_size 32 --learning_rate 2e-5 --num_entries_per_input 1000000
More details can be found on our paper.
Run with generated data
After dataset generation, we save the synthetic dataset at train.jsonl
. The file is in json line format (e.g., {"idx": 0, "text": "The Book of Mormon Musical brings all the drama and excitement of a real revival of the Broadway production to the big screen.", "label": 0}
). We provide some sample synthetic set and standard sets in this google drive link.
To learn the sample reweighs using LSTM as TAM, please use the following script.
python run_reweight.py --gold_data_path data/imdb/std/ --syn_data_path data/imdb/gpt2-xl/ --task_name imdb --num_use_samples_inner 1000000 --num_use_samples_outer 50000 --epoch_converge 1 --outer_lr 2.5e-1 --inner_lr 1e-3 --seed 12345 --backward_batch_size 4096 --wandb --outer_obj combined --inner_obj ce --init_label 10 --theta_upper_lim 1 --check_ft_every 5 --epoch_converge_fully_train 5 --threshold 0.9 --optim Adam --max_outer_iter 100 --hard --init_theta 1 --subset_outer --use_sigmoid --disable_outer_scheduler --shuffle_train
Acknowledgement
If you find our code useful, please cite our paper:
@inproceedings{
gao2023selfguided,
title={Self-Guided Noise-Free Data Generation for Efficient Zero-Shot Learning},
author={Jiahui Gao and Renjie Pi and LIN Yong and Hang Xu and Jiacheng Ye and Zhiyong Wu and WEIZHONG ZHANG and Xiaodan Liang and Zhenguo Li and Lingpeng Kong},
booktitle={International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=h5OpjGd_lo6}
}
@inproceedings{ye-etal-2022-progen,
title = "ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback",
author = "Ye, Jiacheng and Gao, Jiahui and Wu, Zhiyong and Feng, Jiangtao and Yu,Tao and Kong, Lingpeng",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-emnlp.269",
pages = "3671--3683"
}
@inproceedings{ye-etal-2022-zerogen,
title = "{Z}ero{G}en: Efficient Zero-shot Learning via Dataset Generation",
author = "Ye, Jiacheng and Gao, Jiahui and Li, Qintong and Xu, Hang and Feng, Jiangtao and Wu, Zhiyong and Yu, Tao and Kong, Lingpeng",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
year = "2022"
}