Multi-Programming Language Evaluation of Large Language Models of Code (MultiPL-E)

MultiPL-E is a system for translating unit test-driven neural code generation benchmarks to new languages. We have used MultiPL-E to translate two popular Python benchmarks (HumanEval and MBPP) to 18 other programming languages.

For more information:

MultiPL-E is part of the BigCode Code Generation LM Harness. This is the easiest way to use MultiPL-E.
The Multilingual Code Models Evaluation by BigCode evaluates Code LLMs using several benchmarks, including MultiPL-E.
We have a tutorial on how to use MultiPL-E directly.
Read our paper MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation.
The MultiPL-E dataset of translated prompts is available on the Hugging Face Hub.

Versions

Version 3.0
- We are going to maintain the changelog on the dataset page: https://huggingface.co/datasets/nuprl/MultiPL-E
- The dataset was versioned at 3.0, and we are bumping the software version to stay in sync.
- We have published several new PLs in the dataset. However, we have not included these PLs at this time: Dafny, Coq, Lean, Luau, and MATLAB.
Version 0.5.0: Instruction-following support and new languages
- New languages: Luau, Elixir, Lean, Coq, Dafny
- Support for instruction-following prompts
- vLLM support for faster evaluation
Version 0.4.0: QoL improvements and new languages
- New languages: OCaml, MATLAB
- Using .jsonl instead of .json for prompts
- Several bugfixes to prompts
Version 0.3.0: used to evaluate StarCoder
- This version corrects several bugs in prompts and test cases that resulted in lower pass@k rates for some of the statically typed languages. The most significant difference is that the pass@k for Java increases by about 2% on HumanEval.
Version 0.2.0: used to evaluate SantaCoder

Name		Name	Last commit message	Last commit date
Latest commit History 1,085 Commits
analysis		analysis
chat-templates		chat-templates
cluster		cluster
dataset_builder		dataset_builder
datasets		datasets
docs		docs
evaluation		evaluation
humaneval_plus		humaneval_plus
multipl_e		multipl_e
notebooks		notebooks
prompts		prompts
results		results
src		src
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
autochatmodel.py		autochatmodel.py
automodel.py		automodel.py
automodel_vllm.py		automodel_vllm.py
bad_jsongz_files.py		bad_jsongz_files.py
bigcode15b.py		bigcode15b.py
check_test_consistency.py		check_test_consistency.py
count_completions.py		count_completions.py
find_potential_faults.py		find_potential_faults.py
fix_stop_tokens.py		fix_stop_tokens.py
incoder.py		incoder.py
openai_model.py		openai_model.py
pass_k.py		pass_k.py
peftmodel.py		peftmodel.py
per_problem_pass_rates.py		per_problem_pass_rates.py
santacoder.py		santacoder.py
starcoder2.py		starcoder2.py
upload_completions.py		upload_completions.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Programming Language Evaluation of Large Language Models of Code (MultiPL-E)

Versions

About

Contributors 21

Languages

License

nuprl/MultiPL-E

Folders and files

Latest commit

History

Repository files navigation

Multi-Programming Language Evaluation of Large Language Models of Code (MultiPL-E)

Versions

About

Resources

License

Stars

Watchers

Forks

Contributors 21

Languages