Eliciting Language Model Behaviors using Reverse Language Models

Abstract

Despite advances in fine-tuning methods, language models (LMs) continue to output toxic and harmful responses on worst-case inputs, including adversarial attacks and jailbreaks. We train an LM on tokens in reverse order---a '\textit{reverse LM}---as a tool for identifying such worst-case inputs. By prompting a reverse LM with a problematic string, we can sample prefixes that are likely to precede the problematic suffix. We test our reverse LM by using it to guide beam search for prefixes that have high probability of generating toxic statements when input to a forward LM. Our 160m parameter reverse LM outperforms the existing state-of-the-art adversarial attack method, Greedy Coordinate Gradient ~\citep{zou2023universal}, when measuring the probability of toxic continuations from the Pythia-160m LM. Unlike GCG, our method is black-box and does not require access to model weights to compute gradients. We also find that the prefixes generated by our reverse LM for the Pythia model are more likely to transfer to other models, eliciting toxic responses also from Llama 2 when compared to GCG-generated attacks.

Name		Name	Last commit message	Last commit date
Latest commit History 245 Commits
data		data
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Eliciting Language Model Behaviors using Reverse Language Models

Abstract

About

Releases

Packages

Contributors 2

Languages

abhay-sheshadri/reverse-dynamics-nlp

Folders and files

Latest commit

History

Repository files navigation

Eliciting Language Model Behaviors using Reverse Language Models

Abstract

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages