SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories

A benchmark and resources for evaluation of LLM agents on setting up and executing ML/NLP tasks from research repositories in the GitHub wild.

[arxiv]

📝 Benchmark Tasks

Dataset tasks are available in HuggingFace Hub 🤗.

We provide three sets: Expert (45 problems), Masked (152) and AutoGen (602).

Agents trajectories from the paper's experiments are available here.

🚀 Quick Start: Running the Agent

Setup

1. Clone the repo and install the requirements:

git clone https://github.com/allenai/super-benchmark.git
cd super-benchmark
pip install -r requirements.txt

2. Fill in your OpenAI API key:

echo "OPENAI_API_KEY=your-openai-api-key" > .env

Running queries

The following command will run the agent locally, which may incur risks as it will execute code on your machine. We provide the option to run the agent inside a Docker container, and using modal.com. We use the latter for the benchmark evaluation.

python -m super.run_single_query --env-backend local --query "Download the OpenBookQA dataset at https://github.com/allenai/OpenBookQA and tell me how many examples are in the train, dev, and test splits of the datasets."

🤖 Running & Evaluating Agents on SUPER

We provide code to evaluate our implemented agents on SUPER.

To run tasks safely and concurrently, we use modal.com. Modal isn't free, but is quite cheap: running an average problem from the benchmark should generally cost 2-3 cents (assuming CPU). In addition users receive $30 credit per month, which should be enough to run the benchmark evaluation multiple times.

python -m super.run_on_benchmark --set Expert

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
super		super
trajectories		trajectories
.env.example		.env.example
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories

📝 Benchmark Tasks

🚀 Quick Start: Running the Agent

Setup

1. Clone the repo and install the requirements:

2. Fill in your OpenAI API key:

Running queries

🤖 Running & Evaluating Agents on SUPER

About

Releases

Packages

Languages

License

allenai/super-benchmark

Folders and files

Latest commit

History

Repository files navigation

SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories

📝 Benchmark Tasks

🚀 Quick Start: Running the Agent

Setup

1. Clone the repo and install the requirements:

2. Fill in your OpenAI API key:

Running queries

🤖 Running & Evaluating Agents on SUPER

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages