Tutorial Materials: Defending Against Generative AI Threats in NLP

Materials and paper list for the SBP-BRiMS 2024 tutorial: "Defending Against Generative AI Threats in NLP".

Tutorial authors and organizers:

Amrita Bhattacharjee, Arizona State University
Raha Moraffah, Worcester Polytechnic Institute
Christopher Parisien, NVIDIA
Huan Liu, Arizona State University

Paper and Resource List

❗ (Will be periodically updated. Please star or watch this repo to get notified of updates!)

📚 Introduction to Generative AI

Overview paper: Generative AI [paper link]
Generative AI vs. Discriminative AI [blogpost link]
A Comprehensive Overview of Large Language Models [paper link]
Examples of AI Image Generators [list + blogpost]
Generative AI model for Audio and Music: text-to-music and text-to-audio via [AudioCraft by Meta AI], text-to-symbolic-music via [MuseCoco by Microsoft]
Examples of AI Video Generators [list + blogpost]
Examples of open-source AI Video Generators: [CogVideo] , [Text2Video-Zero] , [Open-Sora]

📚 Language Modeling and LLMs

History

Class-Based n-gram Models of Natural Language [paper link]
n-gram models [lecture notes]
LSTM paper [paper link]
LSTM Neural Networks for Language Modeling [paper link]
Attention Is All You Need [paper link]
An Overview and History of Large Language Models [blogpost link]

Training-related Steps

Cross-Task Generalization via Natural Language Crowdsourcing Instructions [paper link]
Fine-tuned Language Models are Zero-shot Learners [paper link]
Training language models to follow instructions with human feedback [paper link]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [paper link]

Evaluation

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [paper link]
MMLU: Measuring Massive Multitask Language Understanding [paper link]
MMLU Learderboard [leaderboard]
LM Sys Chatbot Arena Leaderboard on Huggingface [paper link]
Cool Github Repo with LLM Evaluation Benchmark resources [github repo]

📚 LLM Threats

Part 1: Attacks on LLMs

Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection [paper link]
Universal and Transferable Adversarial Attacks on Aligned Language Models [paper link]
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs [paper link]
Stealing Part of a Production Language Model [paper link]
Bing Chat: Data Exfiltration Exploit Explained [blogpost link]
FakeToxicityPrompts: Automatic Red Teaming [blogpost link]

Part 2: Misuse of LLMs

Defending Against Social Engineering Attacks in the Age of LLMs [paper link]
Exploring the Deceptive Power of LLM-Generated Fake News: A Study of Real-World Detection Challenges [paper link]
Can LLM-generated Misinformation be Detected? [paper link]
The Looming Threat of Fake and LLM-generated LinkedIn Profiles: Challenges and Opportunities for Detection and Prevention [paper link]

📚 LLM Defenses and Safety

Tools

garak: LLM vulnerability scanner

NVIDIA NeMo Guardrails

AEGIS from NVIDIA

AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts [paper]
AEGIS Dataset [dataset on higgingface]
AEGIS Defensive model [model on huggingface]
AEGIS Permissive model [model on huggingface]

Standard Defenses

Improving Neural Language Modeling via Adversarial Training [paper link]
Certifying LLM Safety against Adversarial Prompting [paper link]
Towards Improving Adversarial Training of NLP Models [paper link]
Adversarial Training for Large Neural Language Models [paper link]
Adversarial Text Purification: A Large Language Model Approach for Defense [paper link]

Model Editing and Parameter-efficent methods

DeTox: Toxic Subspace Projection for Model Editing [paper link]
Editing Models with Task Arithmetic [paper link]
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks [paper link]
Model Surgery: Modulating LLM’s Behavior Via Simple Parameter Editing [paper link]
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing [paper link]
Activation Addition: Steering Language Models Without Optimization [paper link]
Steering Llama 2 via Contrastive Activation Addition [paper link]
Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models [paper link]
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [paper link]

Decoding-time methods

RAIN: Your Language Models Can Align Themselves without Finetuning [paper link]
Parameter-Efficient Detoxification with Contrastive Decoding [paper link]
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates [paper link]

Contact

For any questions, feedback, comments, or just to say hi, contact Amrita at [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
Tutorial-Slide-Deck-v2.pdf		Tutorial-Slide-Deck-v2.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tutorial Materials: Defending Against Generative AI Threats in NLP

Paper and Resource List

📚 Introduction to Generative AI

📚 Language Modeling and LLMs

History

Training-related Steps

Evaluation

📚 LLM Threats

Part 1: Attacks on LLMs

Part 2: Misuse of LLMs

📚 LLM Defenses and Safety

Tools

Standard Defenses

Model Editing and Parameter-efficent methods

Decoding-time methods

Contact

About

Releases

Packages

License

AmritaBh/sbp24-llm-attack-defense-tutorial

Folders and files

Latest commit

History

Repository files navigation

Tutorial Materials: Defending Against Generative AI Threats in NLP

Paper and Resource List

📚 Introduction to Generative AI

📚 Language Modeling and LLMs

History

Training-related Steps

Evaluation

📚 LLM Threats

Part 1: Attacks on LLMs

Part 2: Misuse of LLMs

📚 LLM Defenses and Safety

Tools

Standard Defenses

Model Editing and Parameter-efficent methods

Decoding-time methods

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages