Toolkit for collecting datasets for Agents and Planning models and running evaluation pipelines.
pip install requirements.txt
We use Hydra library for evaluation pipeline.
Each configuration is specified in eval.yaml
format:
# @package _global_
hydra:
job:
name: ${agent.name}_${agent.model_name}_[YOUR_ADDITIONAL_TOKEN_OR_NOTHING]
run:
dir:[YOUR_PATH_TO_OUTPUT_DIR]/${hydra:job.name}
job_logging:
root:
handlers: [console, file]
defaults:
- _self_
- data_source: hf
- env: code_engine
- agent: planning
Where you can define the datasource, env and agent you want to evaluate. We present several implementations for each defined in sub yamls:\
field | options |
---|---|
data_source |
hf.yaml |
env |
code_engine.yaml http.yaml few_shot.yaml |
agent |
few_shot.yaml planning.yaml vanilla.yaml reflexion.yaml tree_of_thoughts.yaml adapt.yaml |
The challenge is to generate project template -- small compilable project that can be described in 1-5 sentences containing small examples of all mentioned libraries/technologies/functionality.
Dataset of template-related repos collected GitHub are published to HuggingFace 🤗. Details about the dataset collection and source code is placed in template_generation directory.
To run the evaluation pipeline, please execute the following command in your console:
python3 -m src/template_generation/run_eval --multirun agent=planning agent.model_name=gpt-3.5-turbo-1106,gpt-4-1106-preview
Model | Metrics |
---|---|