GitHub - YecanLee/Decoding-Decoded: [COLING 2025] Official PyTorch Implementation of "Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation"

Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation
_{Official PyTorch Implementation}

Paper | Project Page | Run Analysis Baseline

This repo contains the official implementation of our paper "Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation". You can find more details in our project page and our paper.

Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation
Esteban Garces Arias, Meimingwei Li, Christian Heumann,Matthias Aßenmacher
Department of Statistics, LMU Munich, Munich Center for Machine Learning (MCML)

📅 Timeline

[2024/11/21] We have released whole pre-generated dataset! 🤩.
[2024/11/16] We have released the official code implementation of our paper 🤩.
[2024/10/08] First version of our paper is available on arXiv now!

📖 Table of Contents [Back to Top]

Download Pre-generated Dataset
Dependency Installation
Run LLM Inference Experiments
Benchmark Decoding Methods
- Measure Diversity, Generation Length and MAUVE Score
Log Benchmark Results
Enhancements
BibTeX
License
Contributions

🌠 Download Pre-generated Dataset [Back to Top]

To download the pre-generated dataset used in our paper, please run the following command:

gdown --folder https://drive.google.com/drive/folders/1Xa1ZtZpqL7bySVEy_Q8fqGjfNN7L-xvG

🛸 Dependency Installation [Back to Top]

To install all the dependencies for our paper, run the following command:

pip install -r requirements.txt
SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True pip install simctg

We recommend you to build a new conda environment to use the repository.

conda create -n decoding-decoded python=3.11
conda activate decoding-decoded
pip install -r requirements.txt
SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True pip install simctg

🚀 Run LLM Inference Experiments [Back to Top]

We compared 5 different decoding methods in our paper, those are: Contrastive Search, Top-k Sampling, Top-p Sampling, Beam Search and Temperature Scaling. We compare those methods with the following hyperparameter combinations:

Contrastive Search: alpha=0.2, 0.4, 0.6, 0.8, 1.0, k=1, 3, 5, 10, 15, 20, 50
Top-k Sampling: k=1, 3, 5, 10, 15, 20, 50
Top-p Sampling: p=0.6, 0.7, 0.8, 0.9, 0.95
Beam Search: beam_size=3, 5, 10, 15, 20, 50
Temperature Scaling: temperature=0.1, 0.3, 0.5, 0.7, 0.9, 1.0

We run the decoding methods on the following 6 models:

We then benchmark the decoding quality and perplexity of those decoding methods. Please check the Benchmark Decoding Methods section for more details.

You may need to authorize by logging in to the Hugging Face to run the experiments for Llama-3.1 and Mistral-7B-v0.3.

huggingface-cli login

To run the LLM inference experiments for contrastive search decoding method, run the following command:

python llm_exp/llm_contrastive_search.py \
--dataset wikitext \
--k 20 \
--alpha 0.8 \
--save_file misrtalv03 \
--save_path_prefix Mistralv03-alpha08 \
--model_name mistralai/Mistral-7B-v0.3 \
--cuda 0 \
--dataset_prefix ./data

To run the LLM inference experiments for top-k sampling decoding method, run the following command:

python llm_exp/llm_top-k.py \
--k 20 \
--save_file gpt2-xl \
--save_path_prefix GPT2-XL-topk \
--dataset wikitext \
--model_name openai-community/gpt2-xl \
--cuda 0 \

To run the LLM inference experiments for top-p sampling decoding method, run the following command:

python llm_exp/llm_top-p.py \
--p 0.95 \
--save_file qwen2 \
--save_path_prefix Qwen2-topp \
--dataset wikitext \
--model_name Qwen/Qwen2-7B \
--cuda 0 \

To run the LLM inference experiments for beam search decoding method, run the following command:

python llm_exp/llm_beam-search.py \
--num_beams 5 \
--save_file llama-3_1 \
--dataset wikinews \
--model_name meta-llama/Meta-Llama-3.1-8B \
--save_path_prefix Llama-3_1-beam \
--cuda 0 \

To run the LLM inference experiments for temperature scaling decoding method, run the following command:

python llm_exp/llm_temp.py \
--temp 0.1 \
--save_file mistralv03 \
--dataset wikitext \
--model_name mistralai/Mistral-7B-v0.1 \
--save_path_prefix mistralv01-temp \
--cuda 0 \

🧪 Benchmark Decoding Methods [Back to Top]

To benchmark the decoding methods, please make sure you have all the dependencies installed.

We provide several ways of measuring the diversity, generation length and MAUVE score of the generated texts. Please choose one of them to measure the quality of the generated texts based on your needs.

🧪 Measure Diversity, Generation Length and MAUVE Score [Back to Top]

Measure Diversity, Generation Length and MAUVE Score for a single generated text file

To measure the diversity, generation length and MAUVE score of the generated texts for a single generated text file, please run the following command:

# change the test path to the file path you want to evaluate
bash scripts/measure_single_mauve.sh YOUR_TEST_PATH
bash scripts/measure_single_coherence.sh YOUR_TEST_PATH

Measure Diversity, Generation Length and MAUVE Score for a folder of generated text files

To measure the diversity, generation length and MAUVE score of the generated texts for a folder of generated text files, for example ".../Qwen-beam/Qwen2-beam/" which contains three subfolders of generated text files, please run the following command:

bash scripts/measure_mauve.sh YOUR_FOLDER_PATH
bash scripts/measure_coherence.sh YOUR_FOLDER_PATH

Measure Diversity, Generation Length and MAUVE Score with one command pipeline

for all the generated text files under the root directory, please run the following command:

bash scripts/mauve_pipe.sh
bash scripts/coherence_pipe.sh

You may need to change the DATA_DIR in the script to the root directory of your generated text files. You may also need to change the BASE_DIR in the script based on your used models' names.

📝 Log Benchmark Results [Back to Top]

To log the benchmark results based on different decoding methods, please run the following command:

To log the result folder generated by using the "contrastive search" decoding method, please run the following command:

python scripts/log_cs.py --folder_path YOUR_RESULT_PATH --save_path YOUR_SAVE_PATH

To log the result folder generated by using the "top-k sampling" decoding method, please run the following command:

python scripts/log_topk.py --folder_path YOUR_RESULT_PATH --save_path YOUR_SAVE_PATH

To log the result folder generated by using the "top-p sampling" decoding method, please run the following command:

python scripts/log_topp.py --folder_path YOUR_RESULT_PATH --save_path YOUR_SAVE_PATH

To log the result folder generated by using the "beam search" decoding method, please run the following command:

python scripts/log_beam.py --folder_path YOUR_RESULT_PATH --save_path YOUR_SAVE_PATH

To log the result folder generated by using the "temperature scaling" decoding method, please run the following command:

python scripts/log_temp.py --folder_path YOUR_RESULT_PATH --save_path YOUR_SAVE_PATH

python scripts/log_beam.py --folder_path YOUR_RESULT_PATH --save_path YOUR_SAVE_PATH

💪 Enhancements [Back to Top]

Generation could likely be speed-up by:

using torch.compile in PyTorch 2.0, we implemented this by using max_autotune mode in the generation scripts, you may need to modify the torch.compile codes to fit your needs.

TF32 Note (important for Ampere, Hopper, and other recent NVIDIA GPUs users).
When we ran the above generation scripts, TF32 matmuls were disabled per PyTorch's defaults.
We've enabled them at the top of measure_CD_mauve_diversity_gen_len.py and measure_diversity_mauve_gen_length.py because it makes sampling way way way faster on those GPUs, but note that the use of TF32 may lead to some differences in the results. Those differences are likely to be negligible for most comparison purposes.

📚 BibTeX

@article{garces2024decoding,
  title={Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation},
  author={Garces Arias, Esteban and Li, Meimingwei and Heumann, Christian and A{\ss}enmacher, Matthias},
  journal={arXiv preprint arXiv:2410.06097},
  year={2024}
}

📄 License

See LICENSE.txt for details.

🤝 Contributions [Back to Top]

This repository is based on the following repositories:

We thank the authors for their open-sourced code.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
helpers		helpers
llm_exp		llm_exp
scripts		scripts
utils		utils
LICENSE		LICENSE
README.md		README.md
measure_CD_mauve_diversity_gen_len.py		measure_CD_mauve_diversity_gen_len.py
measure_diversity_mauve_gen_length.py		measure_diversity_mauve_gen_length.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation
_{Official PyTorch Implementation}

Paper | Project Page | Run Analysis Baseline

📅 Timeline

📖 Table of Contents [Back to Top]

🌠 Download Pre-generated Dataset [Back to Top]

🛸 Dependency Installation [Back to Top]

🚀 Run LLM Inference Experiments [Back to Top]

🧪 Benchmark Decoding Methods [Back to Top]

🧪 Measure Diversity, Generation Length and MAUVE Score [Back to Top]

Measure Diversity, Generation Length and MAUVE Score for a single generated text file

Measure Diversity, Generation Length and MAUVE Score for a folder of generated text files

Measure Diversity, Generation Length and MAUVE Score with one command pipeline

📝 Log Benchmark Results [Back to Top]

💪 Enhancements [Back to Top]

📚 BibTeX

📄 License

🤝 Contributions [Back to Top]

About

Releases

Packages

Languages

License

YecanLee/Decoding-Decoded

Folders and files

Latest commit

History

Repository files navigation

Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation Official PyTorch Implementation

Paper | Project Page | Run Analysis Baseline

📅 Timeline

📖 Table of Contents [Back to Top]

🌠 Download Pre-generated Dataset [Back to Top]

🛸 Dependency Installation [Back to Top]

🚀 Run LLM Inference Experiments [Back to Top]

🧪 Benchmark Decoding Methods [Back to Top]

🧪 Measure Diversity, Generation Length and MAUVE Score [Back to Top]

Measure Diversity, Generation Length and MAUVE Score for a single generated text file

Measure Diversity, Generation Length and MAUVE Score for a folder of generated text files

Measure Diversity, Generation Length and MAUVE Score with one command pipeline

📝 Log Benchmark Results [Back to Top]

💪 Enhancements [Back to Top]

📚 BibTeX

📄 License

🤝 Contributions [Back to Top]

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation
_{Official PyTorch Implementation}

Packages