You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[NAACL 2025] Extracting and Understanding the Superficial Knowledge in Alignment, Runjin Chen, Gabriel Jacob Perin, Xuxi Chen, Xilun Chen, Yan Han, Nina S. T. Hirata , Junyuan Hong, Bhavya Kailkhura
This repository contains code for the paper "Extracting and Understanding the Superficial Knowledge in Alignment (NAACL 2025)"
Step1: Extract token logits
bash scripts/extract_logit.sh
Step2: Train linear model
bash scripts/train_logit.sh
Step3: Run eval
bash scripts/run_eval.sh
About
[NAACL 2025] Extracting and Understanding the Superficial Knowledge in Alignment, Runjin Chen, Gabriel Jacob Perin, Xuxi Chen, Xilun Chen, Yan Han, Nina S. T. Hirata , Junyuan Hong, Bhavya Kailkhura