Skip to content

Ollama-MMLU-Pro fork, using a smaller IRT-tuned subset of MMLU-Pro

License

Notifications You must be signed in to change notification settings

sam-paech/Ollama-MMLU-Pro-IRT

 
 

Repository files navigation

Ollama-MMLU-Pro

This is a modified version of TIGER-AI-Lab/MMLU-Pro, and it lets you run MMLU-Pro benchmark via the OpenAI Chat Completion API. It's tested on Ollama and Llama.cpp, but it should also work with LMStudio, Koboldcpp, Oobabooga with openai extension, etc.

Open In Colab

Usage

For example, in order to run benchmark against Phi3 on Ollama, use:

pip install -r requirements.txt
python run_openai.py --url http://localhost:11434/v1 --model phi3

As default, it tests against all subjects, but you can use --category option to test only specific subject.

Subjects include: 'business', 'law', 'psychology', 'biology', 'chemistry', 'history', 'other', 'health', 'economics', 'math', 'physics', 'computer science', 'philosophy', 'engineering'

The default timeout is 600 seconds (10 minutes). If the model being tested takes a long time to respond, and you encounter "error Request timed out" message, use --timeout number_of_seconds option to increase.

Parallelism

You can optionally run multiple tests in parallel by using --parallel option. For example, to run 2 tests in parallel:

python run_openai.py --url http://localhost:11434/v1 --model llama3 --parallel 2

About

Ollama-MMLU-Pro fork, using a smaller IRT-tuned subset of MMLU-Pro

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 71.1%
  • Python 28.9%