Syntax-based Contextual Visualizations for SAE & LLM Interpretability

Project Overview

This project aims to improve interpretability measures by developing a new visualization method for Sparse Autoencoder (SAE) feature contexts. Specifically, we proposed using syntactic dependencies to illuminate similarities between contexts.

Project Features

We were able to develop three novel views for activation contexts, two of which utilize syntactic dependency structures and one which uses branching trees. We use the SpaCy dependency parser and sentence tagger on the backend. These new views are meant to supplement activation context lists, e.g. those developed by Anthropic:

The joint view shows individual contexts side by side. You can enable part of speech tagging or view inactive tokens through the top panel:

The merged view aggregates commonly occurring contexts and displays them in a branching format. These trees are instantiated as list structures and subtree matches are located where possible.

The updated merged view simplifies the presentation to primarily consider cooccurrence information, giving an overall picture of relevant contexts for a feature. It fixes the issues with overlap encountered earlier:

Acknowledgements

This work was done for David Laidlaw's CSCI2370: Interdisciplinary Scientific Visualization class.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
images		images
sae-auto-interp @ 89bd693		sae-auto-interp @ 89bd693
static		static
templates		templates
.DS_Store		.DS_Store
.gitignore		.gitignore
.gitmodules		.gitmodules
all_16384_POS.npy		all_16384_POS.npy
all_16384_errors.npy		all_16384_errors.npy
app.py		app.py
build.sh		build.sh
graphs.py		graphs.py
output.png		output.png
pos_map.npy		pos_map.npy
readme.md		readme.md
requirements.txt		requirements.txt
sae_auto_interp.log		sae_auto_interp.log
temp.npy		temp.npy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Syntax-based Contextual Visualizations for SAE & LLM Interpretability

Project Overview

Project Features

Acknowledgements

About

Releases

Packages

Languages

rkique/syntax-sae

Folders and files

Latest commit

History

Repository files navigation

Syntax-based Contextual Visualizations for SAE & LLM Interpretability

Project Overview

Project Features

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages