Skip to content

[EMNLP 2024] CodeJudge: Evaluating Code Generation with Large Language Models

License

Notifications You must be signed in to change notification settings

VichyTong/CodeJudge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeJudge: Evaluating Code Generation with Large Language Models

If you like our project, please give us a star ⭐ on GitHub for the latest update.

😮 Highlights

CodeJudge is a code evaluation framework that leverages LLMs to evaluate the semantic correctness of generated code without the need of test cases.

💡 Simple insight but efficient

Results show that CodeJudge significantly outperformed existing methods across the four LLMs we tested. Furthermore, compared to a SOTA GPT-3.5-based code evaluation method, CodeJudge achieved better results even when using a much smaller model, Llama-3-8B-Instruct.

⚡ Off-the-shelf framework that is easy to use

CodeJudge is an off-the-shelf evaluation framework that can be easily integrated into new LLM-based code generation systems.

🛠️ Requirements and Installation

  • Python >= 3.10
  • Pytorch >= 2.2.0
  • CUDA Version >= 11.7
  • Install required packages:
git clone https://github.com/VichyTong/CodeJudge
cd CodeJudge
conda create -n codejudge python=3.10 -y
conda activate codejudge
pip install -r requirements.txt

💾 Dataset Preparation

We uploaded datasets we used to test CodeJudge at evaluation/data. If you want to generate code samples by your self on HumanEval-X dataset, please follow the instruction in the markdown file.

🔽 Model Preparation

OpenAI models

export OPENAI_API_KEY=<YOUR_API_KEY>

Meta Models

  1. Please follow the instructions at the official website of Code Llama and LLama-3 to get download URLs.

  2. Download the model For Code Llama:

cd evaluation/model/codellama
bash download.sh

For Llama-3:

cd evaluation/model/llama3
bash download.sh
  1. Convert the model to Huggingface 🤗 format:
cd evaluation/model
bash convert.sh

🚀 Run CodeJudge

Please refer to sample scripts under sample_scripts folder for every dataset (HumanEval-X, CoNaLa, APPS, BigCodeBench).

You can choose models from: gpt-3.5-turbo-1106, CodeLlama-34b-Instruct, Meta-Llama-3-8B-Instruct, and Meta-Llama-3-70B-Instruct.

For exmaple, you can run HumanEval-X test by:

cd evaluation
bash humaneval/sample_script/gpt-3.5-turbo-python.sh

You can also download our test results here.

👍 Acknowledgement

We thank these greate works:

  • HumanEval is a widely used Python dataset to evaluate code generation.
  • HumanEval-X is a multi-language extension of HumanEval, including C++, Python, Java, JavaScript, Go, and Rust.
  • CoNaLa is a Python code generation benchmark with 472 tasks collected from StackOverflow.
  • We especially thank Dr. Evtikhiev of JetBrains Research for generously sharing the human-labeled data of the CoNaLa dataset.
  • APPS is a Python code generation benchmark that includes introductory, interview-level, and competition-level tasks collected from coding competition websites.
  • BigCodeBench is a recently released code generation benchmark with 1,140 practical and challenging programming tasks.
  • We also thank MuliPL-E for their excellent code for sampling programs using code generation LLMs.

Citation

If you find our work helpful, please consider citing our paper:

@inproceedings{tong-zhang-2024-codejudge,
    title = "{C}ode{J}udge: Evaluating Code Generation with Large Language Models",
    author = "Tong, Weixi  and  Zhang, Tianyi",
    editor = "Al-Onaizan, Yaser  and  Bansal, Mohit  and  Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.1118",
    pages = "20032--20051"
}