GitHub - VichyTong/CodeJudge: [EMNLP 2024] CodeJudge: Evaluating Code Generation with Large Language Models

CodeJudge: Evaluating Code Generation with Large Language Models

If you like our project, please give us a star ⭐ on GitHub for the latest update.

😮 Highlights

CodeJudge is a code evaluation framework that leverages LLMs to evaluate the semantic correctness of generated code without the need of test cases.

💡 Simple insight but efficient

Results show that CodeJudge significantly outperformed existing methods across the four LLMs we tested. Furthermore, compared to a SOTA GPT-3.5-based code evaluation method, CodeJudge achieved better results even when using a much smaller model, Llama-3-8B-Instruct.

⚡ Off-the-shelf framework that is easy to use

CodeJudge is an off-the-shelf evaluation framework that can be easily integrated into new LLM-based code generation systems.

🛠️ Requirements and Installation

Python >= 3.10
Pytorch >= 2.2.0
CUDA Version >= 11.7
Install required packages:

git clone https://github.com/VichyTong/CodeJudge
cd CodeJudge
conda create -n codejudge python=3.10 -y
conda activate codejudge
pip install -r requirements.txt

💾 Dataset Preparation

We uploaded datasets we used to test CodeJudge at evaluation/data. If you want to generate code samples by your self on HumanEval-X dataset, please follow the instruction in the markdown file.

🔽 Model Preparation

OpenAI models

export OPENAI_API_KEY=<YOUR_API_KEY>

Meta Models

Please follow the instructions at the official website of Code Llama and LLama-3 to get download URLs.
Download the model For Code Llama:

cd evaluation/model/codellama
bash download.sh

For Llama-3:

cd evaluation/model/llama3
bash download.sh

Convert the model to Huggingface 🤗 format:

cd evaluation/model
bash convert.sh

🚀 Run CodeJudge

Please refer to sample scripts under sample_scripts folder for every dataset (HumanEval-X, CoNaLa, APPS, BigCodeBench).

You can choose models from: gpt-3.5-turbo-1106, CodeLlama-34b-Instruct, Meta-Llama-3-8B-Instruct, and Meta-Llama-3-70B-Instruct.

For exmaple, you can run HumanEval-X test by:

cd evaluation
bash humaneval/sample_script/gpt-3.5-turbo-python.sh

You can also download our test results here.

👍 Acknowledgement

We thank these greate works:

HumanEval is a widely used Python dataset to evaluate code generation.
HumanEval-X is a multi-language extension of HumanEval, including C++, Python, Java, JavaScript, Go, and Rust.
CoNaLa is a Python code generation benchmark with 472 tasks collected from StackOverflow.
We especially thank Dr. Evtikhiev of JetBrains Research for generously sharing the human-labeled data of the CoNaLa dataset.
APPS is a Python code generation benchmark that includes introductory, interview-level, and competition-level tasks collected from coding competition websites.
BigCodeBench is a recently released code generation benchmark with 1,140 practical and challenging programming tasks.
We also thank MuliPL-E for their excellent code for sampling programs using code generation LLMs.

Citation

If you find our work helpful, please consider citing our paper:

@inproceedings{tong-zhang-2024-codejudge,
    title = "{C}ode{J}udge: Evaluating Code Generation with Large Language Models",
    author = "Tong, Weixi  and  Zhang, Tianyi",
    editor = "Al-Onaizan, Yaser  and  Bansal, Mohit  and  Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.1118",
    pages = "20032--20051"
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
code_model_score		code_model_score
evaluation		evaluation
paper		paper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeJudge: Evaluating Code Generation with Large Language Models

If you like our project, please give us a star ⭐ on GitHub for the latest update.

😮 Highlights

💡 Simple insight but efficient

⚡ Off-the-shelf framework that is easy to use

🛠️ Requirements and Installation

💾 Dataset Preparation

🔽 Model Preparation

OpenAI models

Meta Models

🚀 Run CodeJudge

👍 Acknowledgement

Citation

About

Releases

Packages

Contributors 2

Languages

License

VichyTong/CodeJudge

Folders and files

Latest commit

History

Repository files navigation

CodeJudge: Evaluating Code Generation with Large Language Models

If you like our project, please give us a star ⭐ on GitHub for the latest update.

😮 Highlights

💡 Simple insight but efficient

⚡ Off-the-shelf framework that is easy to use

🛠️ Requirements and Installation

💾 Dataset Preparation

🔽 Model Preparation

OpenAI models

Meta Models

🚀 Run CodeJudge

👍 Acknowledgement

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages