Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation
Official PyTorch Implementation
Paper | Project Page | Run Analysis Baseline
This repo contains the official implementation of our paper "Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation". You can find more details in our project page and our paper.
Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation
Esteban Garces Arias, Meimingwei Li, Christian Heumann,Matthias AΓenmacher
Department of Statistics, LMU Munich, Munich Center for Machine Learning (MCML)
- [2024/11/21] We have released whole pre-generated dataset! π€©.
- [2024/11/16] We have released the official code implementation of our paper π€©.
- [2024/10/08] First version of our paper is available on arXiv now!
π Table of Contents [Back to Top]
- Download Pre-generated Dataset
- Dependency Installation
- Run LLM Inference Experiments
- Benchmark Decoding Methods
- Log Benchmark Results
- Enhancements
- BibTeX
- License
- Contributions
π Download Pre-generated Dataset [Back to Top]
To download the pre-generated dataset used in our paper, please run the following command:
gdown --folder https://drive.google.com/drive/folders/1Xa1ZtZpqL7bySVEy_Q8fqGjfNN7L-xvG
πΈ Dependency Installation [Back to Top]
To install all the dependencies for our paper, run the following command:
pip install -r requirements.txt
SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True pip install simctg
We recommend you to build a new conda environment to use the repository.
conda create -n decoding-decoded python=3.11
conda activate decoding-decoded
pip install -r requirements.txt
SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True pip install simctg
π Run LLM Inference Experiments [Back to Top]
We compared 5 different decoding methods in our paper, those are: Contrastive Search, Top-k Sampling, Top-p Sampling, Beam Search and Temperature Scaling. We compare those methods with the following hyperparameter combinations:
- Contrastive Search: alpha=0.2, 0.4, 0.6, 0.8, 1.0, k=1, 3, 5, 10, 15, 20, 50
- Top-k Sampling: k=1, 3, 5, 10, 15, 20, 50
- Top-p Sampling: p=0.6, 0.7, 0.8, 0.9, 0.95
- Beam Search: beam_size=3, 5, 10, 15, 20, 50
- Temperature Scaling: temperature=0.1, 0.3, 0.5, 0.7, 0.9, 1.0
We run the decoding methods on the following 6 models:
We then benchmark the decoding quality and perplexity of those decoding methods. Please check the Benchmark Decoding Methods section for more details.
You may need to authorize by logging in to the Hugging Face to run the experiments for Llama-3.1 and Mistral-7B-v0.3.
huggingface-cli login
To run the LLM inference experiments for contrastive search decoding method, run the following command:
python llm_exp/llm_contrastive_search.py \
--dataset wikitext \
--k 20 \
--alpha 0.8 \
--save_file misrtalv03 \
--save_path_prefix Mistralv03-alpha08 \
--model_name mistralai/Mistral-7B-v0.3 \
--cuda 0 \
--dataset_prefix ./data
To run the LLM inference experiments for top-k sampling decoding method, run the following command:
python llm_exp/llm_top-k.py \
--k 20 \
--save_file gpt2-xl \
--save_path_prefix GPT2-XL-topk \
--dataset wikitext \
--model_name openai-community/gpt2-xl \
--cuda 0 \
To run the LLM inference experiments for top-p sampling decoding method, run the following command:
python llm_exp/llm_top-p.py \
--p 0.95 \
--save_file qwen2 \
--save_path_prefix Qwen2-topp \
--dataset wikitext \
--model_name Qwen/Qwen2-7B \
--cuda 0 \
To run the LLM inference experiments for beam search decoding method, run the following command:
python llm_exp/llm_beam-search.py \
--num_beams 5 \
--save_file llama-3_1 \
--dataset wikinews \
--model_name meta-llama/Meta-Llama-3.1-8B \
--save_path_prefix Llama-3_1-beam \
--cuda 0 \
To run the LLM inference experiments for temperature scaling decoding method, run the following command:
python llm_exp/llm_temp.py \
--temp 0.1 \
--save_file mistralv03 \
--dataset wikitext \
--model_name mistralai/Mistral-7B-v0.1 \
--save_path_prefix mistralv01-temp \
--cuda 0 \
π§ͺ Benchmark Decoding Methods [Back to Top]
To benchmark the decoding methods, please make sure you have all the dependencies installed.
We provide several ways of measuring the diversity, generation length and MAUVE score of the generated texts. Please choose one of them to measure the quality of the generated texts based on your needs.
π§ͺ Measure Diversity, Generation Length and MAUVE Score [Back to Top]
To measure the diversity, generation length and MAUVE score of the generated texts for a single generated text file, please run the following command:
# change the test path to the file path you want to evaluate
bash scripts/measure_single_mauve.sh YOUR_TEST_PATH
bash scripts/measure_single_coherence.sh YOUR_TEST_PATH
To measure the diversity, generation length and MAUVE score of the generated texts for a folder of generated text files, for example ".../Qwen-beam/Qwen2-beam/" which contains three subfolders of generated text files, please run the following command:
bash scripts/measure_mauve.sh YOUR_FOLDER_PATH
bash scripts/measure_coherence.sh YOUR_FOLDER_PATH
for all the generated text files under the root directory, please run the following command:
bash scripts/mauve_pipe.sh
bash scripts/coherence_pipe.sh
You may need to change the DATA_DIR
in the script to the root directory of your generated text files. You may also need to change the BASE_DIR
in the script based on your used models' names.
π Log Benchmark Results [Back to Top]
To log the benchmark results based on different decoding methods, please run the following command:
To log the result folder generated by using the "contrastive search" decoding method, please run the following command:
python scripts/log_cs.py --folder_path YOUR_RESULT_PATH --save_path YOUR_SAVE_PATH
To log the result folder generated by using the "top-k sampling" decoding method, please run the following command:
python scripts/log_topk.py --folder_path YOUR_RESULT_PATH --save_path YOUR_SAVE_PATH
To log the result folder generated by using the "top-p sampling" decoding method, please run the following command:
python scripts/log_topp.py --folder_path YOUR_RESULT_PATH --save_path YOUR_SAVE_PATH
To log the result folder generated by using the "beam search" decoding method, please run the following command:
python scripts/log_beam.py --folder_path YOUR_RESULT_PATH --save_path YOUR_SAVE_PATH
To log the result folder generated by using the "temperature scaling" decoding method, please run the following command:
python scripts/log_temp.py --folder_path YOUR_RESULT_PATH --save_path YOUR_SAVE_PATH
python scripts/log_beam.py --folder_path YOUR_RESULT_PATH --save_path YOUR_SAVE_PATH
πͺ Enhancements [Back to Top]
Generation could likely be speed-up by:
- using
torch.compile
in PyTorch 2.0, we implemented this by usingmax_autotune
mode in the generation scripts, you may need to modify thetorch.compile
codes to fit your needs.
TF32 Note (important for Ampere, Hopper, and other recent NVIDIA GPUs users).
When we ran the above generation scripts, TF32 matmuls were disabled per PyTorch's defaults.
We've enabled them at the top of measure_CD_mauve_diversity_gen_len.py
and measure_diversity_mauve_gen_length.py
because it makes sampling way way way faster on
those GPUs, but note that the use of TF32 may lead to some differences in the results. Those differences are likely to be negligible for most comparison purposes.
@article{garces2024decoding,
title={Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation},
author={Garces Arias, Esteban and Li, Meimingwei and Heumann, Christian and A{\ss}enmacher, Matthias},
journal={arXiv preprint arXiv:2410.06097},
year={2024}
}
See LICENSE.txt
for details.
π€ Contributions [Back to Top]
This repository is based on the following repositories:
- Adaptive Contrastive Search: Uncertainty-Guided Decoding for Open-Ended Text Generation
- Contrastive Search versus Contrastive Decoding
We thank the authors for their open-sourced code.