Skip to content

[NAACL 2025] Extracting and Understanding the Superficial Knowledge in Alignment, Runjin Chen, Gabriel Jacob Perin, Xuxi Chen, Xilun Chen, Yan Han, Nina S. T. Hirata , Junyuan Hong, Bhavya Kailkhura

Notifications You must be signed in to change notification settings

VITA-Group/Superficial_Alignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Superficial Alignment

This repository contains code for the paper "Extracting and Understanding the Superficial Knowledge in Alignment (NAACL 2025)"

Step1: Extract token logits

bash scripts/extract_logit.sh

Step2: Train linear model

bash scripts/train_logit.sh

Step3: Run eval

bash scripts/run_eval.sh

About

[NAACL 2025] Extracting and Understanding the Superficial Knowledge in Alignment, Runjin Chen, Gabriel Jacob Perin, Xuxi Chen, Xilun Chen, Yan Han, Nina S. T. Hirata , Junyuan Hong, Bhavya Kailkhura

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published