InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

This Python project provides a framework for creating and evaluating the models in InterpBench, a collection of semi-synthetic transformers with known circuits for evaluating mechanistic interpretability techniques

Setup

This project can be setup by either downloading it and installing the dependencies, or by using the Docker image. We use Poetry to manage the dependencies, which you can install by following the instructions here. If you are in Ubuntu, you can install Poetry by running the following commands:

apt update && apt install -y --no-install-recommends pipx
pipx ensurepath
source ~/.*rc
pipx install poetry

Run the following Bash commands to download the project and its dependencies (assuming Ubuntu, adjust accordingly if you are in a different OS):

apt update && apt install -y --no-install-recommends git build-essential python3-dev graphviz graphviz-dev libgl1
git clone [email protected]:FlyingPumba/circuits-benchmark.git
cd circuits-benchmark
poetry env use 3
poetry install

Then, to activate the virtual environment: poetry shell

How to use it

You can either use InterpBench by downloading the pre-trained models from the Hugging Face repository (see an example here), or by running the commands available in the Python framework.

Training commands

The main two options for training models in the benchmark are "iit" and "ioi". The first option is used for training SIIT models based on Tracr circuits, and the second one for training a on a simplified version of the IOI circuit. As an example, the following command will train a model on the Tracr circuit with id 3, for 30 epochs, using weights 0.4, 1, and 1, for SIIT loss, IIT loss, and behaviour loss, respectively.

./main.py train iit -i 3 --epochs 30 -s 0.4 -iit 1 -b 1 --early-stop

To check the arguments available for a specific command, you can use the --help flag. E.g., ./main.py train iit --help

Circuit discovery commands

There are three circuit discovery techniques that are supported for now: ACDC, SP, and EAP. Some examples:

Running ACDC on Tracr-generated model for task 3: ./main.py run acdc -i 3 --tracr --threshold 0.71
Running SP on a locally trained SIIT model for task 1: ./main.py run sp -i 1 --siit-weights 510 --lambda-reg 0.5
Running edgewise SP on InterpBench models for all tasks: ./main.py run sp --interp-bench --edgewise
Running EAP with integrated gradients on a locally trained SIIT model for IOI task: ./main.py run eap -i ioi --siit-weights 510 --integrated-grad-steps=5

After running an algorith, the output can be found in the results folder.

Evaluation commands

There are several evaluations that can be run using the framework. Options are: iit, iit_acdc, node_realism, ioi, ioi_acdc, and gt_node_realism. See EXPERIMENTS.md for a list of the commands used in the paper's empirical study.

Tests

To run the tests, you can just run poetry run pytest tests in the root directory of the project. If you want to run specific tests, you can use the -k flag: poetry run pytest tests -k "get_cases_test".

Name		Name	Last commit message	Last commit date
Latest commit History 655 Commits
.circleci		.circleci
circuits_benchmark		circuits_benchmark
metadata		metadata
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
DEMO_InterpBench.ipynb		DEMO_InterpBench.ipynb
Dockerfile		Dockerfile
EXPERIMENTS.md		EXPERIMENTS.md
LICENSE		LICENSE
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Setup

How to use it

Training commands

Circuit discovery commands

Evaluation commands

Tests

About

Releases

Packages

Contributors 4

Languages

License

FlyingPumba/InterpBench

Folders and files

Latest commit

History

Repository files navigation

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Setup

How to use it

Training commands

Circuit discovery commands

Evaluation commands

Tests

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages