Skip to content

Latest commit

 

History

History
214 lines (202 loc) · 140 KB

Security&Discussion.md

File metadata and controls

214 lines (202 loc) · 140 KB

Security

Different from the main README🕵️

  • Within this subtopic, we will be updating with the latest articles. This will help researchers in this area to quickly understand recent trends.
  • In addition to providing the most recent updates, we will also add keywords to each subtopic to help you find content of interest more quickly.
  • Within each subtopic, we will also update with profiles of scholars we admire and endorse in the field. Their work is often of high quality and forward-looking!"

📑Papers

Date Institute Publication Paper Keywords
20.10 Facebook AI Research arxiv Recipes for Safety in Open-domain Chatbots Toxic Behavior&Open-domain
22.02 DeepMind EMNLP2022 Red Teaming Language Models with Language Model Red Teaming&Harm Test
22.03 OpenAI NIPS2022 Training language models to follow instructions with human feedback InstructGPT&RLHF&Harmless
22.04 Anthropic arxiv Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Helpful&Harmless
22.05 UCSD EMNLP2022 An Empirical Analysis of Memorization in Fine-tuned Autoregressive Language Models Privacy Risks&Memorization
22.09 Anthropic arxiv Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned Red Teaming&Harmless&Helpful
22.12 Anthropic arxiv Constitutional AI: Harmlessness from AI Feedback Harmless&Self-improvement&RLAIF
23.07 UC Berkeley NIPS2023 Jailbroken: How Does LLM Safety Training Fail? Jailbreak&Competing Objectives&Mismatched Generalization
23.08 The Chinese University of Hong Kong Shenzhen China, Tencent AI Lab, The Chinese University of Hong Kong arxiv GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs Via Cipher Safety Alignment&Adversarial Attack
23.08 University College London, University College London, Tilburg University arxiv Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities Security&AI Alignment
23.09 Peking University arxiv RAIN: Your Language Models Can Align Themselves without Finetuning Self-boosting&Rewind Mechanisms
23.10 Princeton University, Virginia Tech, IBM Research, Stanford University arxiv FINE-TUNING ALIGNED LANGUAGE MODELS COMPROMISES SAFETY EVEN WHEN USERS DO NOT INTEND TO! Fine-tuning****Safety Risks&Adversarial Training
23.10 UC Riverside arXiv Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks Adversarial Attacks&Vulnerabilities&Model Security
23.10 Rice University NAACL2024(findings) Secure Your Model: An Effective Key Prompt Protection Mechanism for Large Language Models Key Prompt Protection&Large Language Models&Unauthorized Access Prevention
23.11 KAIST AI arxiv HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning Hate Speech&Detection
23.11 CMU AACL2023(ART or Safety workshop) Measuring Adversarial Datasets Adversarial Robustness&AI Safety&Adversarial Datasets
23.11 UIUC arxiv Removing RLHF Protections in GPT-4 via Fine-Tuning Remove Protection&Fine-Tuning
23.11 IT University of Copenhagen,University of Washington arxiv Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild Red Teaming
23.11 Fudan University&Shanghai AI lab arxiv Fake Alignment: Are LLMs Really Aligned Well? Alignment Failure&Safety Evaluation
23.11 University of Southern California arxiv SAFER-INSTRUCT: Aligning Language Models with Automated Preference Data RLHF&Safety
23.11 Google Research arxiv AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications Adversarial Testing&AI-Assisted Red Teaming&Application Safety
23.11 Tencent AI Lab arxiv ADVERSARIAL PREFERENCE OPTIMIZATION Human Preference Alignment&Adversarial Preference Optimization&Annotation Reduction
23.11 Docta.ai arxiv Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models Data Credibility&Safety alignment
23.11 CIIRC CTU in Prague arxiv A Security Risk Taxonomy for Large Language Models Security risks&Taxonomy&Prompt-based attacks
23.11 Meta&University of Illinois Urbana-Champaign NAACL2024 MART: Improving LLM Safety with Multi-round Automatic Red-Teaming Automatic Red-Teaming&LLM Safety&Adversarial Prompt Writing
23.11 The Ohio State University&University of California, Davis NAACL2024 How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities Open-Source LLMs&Malicious Demonstrations&Trustworthiness
23.12 Drexel University arXiv A Survey on Large Language Model (LLM) Security and Privacy: The Good the Bad and the Ugly Security&Privacy&Attacks
23.12 Tenyx arXiv Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation Geometric Interpretation&Intrinsic Dimension&Toxicity Detection
23.12 Independent (Now at Google DeepMind) arXiv Scaling Laws for Adversarial Attacks on Language Model Activations Adversarial Attacks&Language Model Activations&Scaling Laws
23.12 University of Liechtenstein, University of Duesseldorf arxiv NEGOTIATING WITH LLMS: PROMPT HACKS, SKILL GAPS, AND REASONING DEFICITS Negotiation&Reasoning&Prompt Hacking
23.12 University of Wisconsin Madison, University of Michigan Ann Arbor, ASU, Washington University arXiv Exploring the Limits of ChatGPT in Software Security Applications Software Security
23.12 GenAI at Meta arxiv Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations Human-AI Conversation&Safety Risk taxonomy
23.12 University of California Riverside, Microsoft arxiv Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack Safety Alignment&Summarization&Vulnerability
23.12 MIT, Harvard NIPS2023(Workshop) Forbidden Facts: An Investigation of Competing Objectives in Llama-2 Competing Objectives&Forbidden Fact Task&Model Decomposition
23.12 University of Science and Technology of China arxiv Silent Guardian: Protecting Text from Malicious Exploitation by Large Language Models Text Protection&Silent Guardian
23.12 OpenAI Open AI Practices for Governing Agentic AI Systems Agentic AI Systems&LM Based Agent
23.12 University of Massachusetts Amherst, Columbia University, Google, Stanford University, New York University arxiv Learning and Forgetting Unsafe Examples in Large Language Models Safety Issues&ForgetFilter Algorithm&Unsafe Content
23.12 Tencent AI Lab, The Chinese University of Hong Kong arxiv Aligning Language Models with Judgments Judgment Alignment&Contrastive Unlikelihood Training
24.01 Delft University of Technology arxiv Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks Red Teaming&Hallucinations&Mathematics Tasks
24.01 Apart Research, University of Edinburgh, Imperial College London, University of Oxford arxiv Large Language Models Relearn Removed Concepts Neuroplasticity&Concept Redistribution
24.01 Tsinghua University, Xiaomi AI Lab, Huawei, Shenzhen Heytap Technology, vivo AI Lab, Viomi Technology, Li Auto, Beijing University of Posts and Telecommunications, Soochow University arxiv PERSONAL LLM AGENTS: INSIGHTS AND SURVEY ABOUT THE CAPABILITY EFFICIENCY AND SECURITY Intelligent Personal Assistant&LLM Agent&Security and Privacy
24.01 Zhongguancun Laboratory, Tsinghua University, Institute of Information Engineering Chinese Academy of Sciences, Ant Group arxiv Risk Taxonomy Mitigation and Assessment Benchmarks of Large Language Model Systems Safety&Risk Taxonomy&Mitigation Strategies
24.01 Google Research arxiv Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models Interpretability
24.01 Ben-Gurion University of the Negev Israel arxiv GPT IN SHEEP’S CLOTHING: THE RISK OF CUSTOMIZED GPTS GPTs&Cybersecurity&ChatGPT
24.01 Shanghai Jiao Tong University arxiv R-Judge: Benchmarking Safety Risk Awareness for LLM Agents LLM Agents&Safety Risk Awareness&Benchmark
24.01 Ant Group arxiv A FAST PERFORMANT SECURE DISTRIBUTED TRAINING FRAMEWORK FOR LLM Distributed LLM&Security
24.01 Shanghai Artificial Intelligence Laboratory, Dalian University of Technology, University of Science and Technology of China arxiv PsySafe: A Comprehensive Framework for Psychological-based Attack Defense and Evaluation of Multi-agent System Safety Multi-agent Systems&Agent Psychology&Safety
24.01 Rochester Institute of Technology arxiv Mitigating Security Threats in LLMs Security Threats&Prompt Injection&Jailbreaking
24.01 Johns Hopkins University, University of Pennsylvania, Ohio State University arxiv The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts Multilingualism&Safety&Resource Disparity
24.01 University of Florida arxiv Adaptive Text Watermark for Large Language Models Text Watermarking&Robustness&Security
24.01 The Hebrew University arXiv TRADEOFFS BETWEEN ALIGNMENT AND HELPFULNESS IN LANGUAGE MODELS Language Model Alignment&AI Safety&Representation Engineering
24.01 Google Research, Anthropic arxiv Gradient-Based Language Model Red Teaming Red Teaming&Safety&Prompt Learning
24.01 National University of Singapore, Pennsylvania State University arxiv Provably Robust Multi-bit Watermarking for AI-generated Text via Error Correction Code Watermarking&Error Correction Code&AI Ethics
24.01 Tsinghua University, University of California Los Angeles, WeChat AI Tencent Inc. arxiv Prompt-Driven LLM Safeguarding via Directed Representation Optimization Safety Prompts&Representation Optimization
24.02 Rensselaer Polytechnic Institute, IBM T.J. Watson Research Center, IBM Research arxiv Adaptive Primal-Dual Method for Safe Reinforcement Learning Safe Reinforcement Learning&Adaptive Primal-Dual&Adaptive Learning Rates
24.02 Jagiellonian University, University of Modena and Reggio Emilia, Alma Mater Studiorum University of Bologna, European University Institute arxiv No More Trade-Offs: GPT and Fully Informative Privacy Policies ChatGPT&Privacy Policies&Legal Requirements
24.02 Florida International University arxiv Security and Privacy Challenges of Large Language Models: A Survey Security&Privacy Challenges&Suevey
24.02 Rutgers University, University of California, Santa Barbara, NEC Labs America arxiv TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution LLM-based Agents&Safety&Trustworthiness
24.02 University of Maryland College Park, JPMorgan AI Research, University of Waterloo, Salesforce Research arxiv Shadowcast: Stealthy Data Poisoning Attacks against VLMs Vision-Language Models&Data Poisoning&Security
24.02 Shanghai Artificial Intelligence Laboratory, Harbin Institute of Technology, Beijing Institute of Technology, Chinese University of Hong Kong arxiv SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models Safety Benchmark&Safety Evaluation**&Hierarchical Taxonomy
24.02 Fudan University arxiv ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages Tool Learning&Large Language Models (LLMs)&Safety Issues&ToolSword
24.02 Paul G. Allen School of Computer Science & Engineering, University of Washington arxiv SPML: A DSL for Defending Language Models Against Prompt Attacks Domain-Specific Language (DSL)&Chatbot Definitions&System Prompt Meta Language (SPML)
24.02 Tsinghua University arxiv ShieldLM: Empowering LLMs as Aligned Customizable and Explainable Safety Detectors Safety Detectors&Customizable&Explainable
24.02 Dalhousie University arxiv Immunization Against Harmful Fine-tuning Attacks Fine-tuning Attacks&Immunization
24.02 Chinese Academy of Sciences, University of Chinese Academy of Sciences, Alibaba Group arxiv SoFA: Shielded On-the-fly Alignment via Priority Rule Following Priority Rule Following&Alignment
24.02 Universidade Federal de Santa Catarina arxiv A Survey of Large Language Models in Cybersecurity Cybersecurity&Vulnerability Assessment
24.02 Zhejiang University arxiv PRSA: Prompt Reverse Stealing Attacks against Large Language Models Prompt Reverse Stealing Attacks&Security
24.02 Shanghai Artificial Intelligence Laboratory NAACL2024 Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey Large Language Models&Conversation Safety&Survey
24.03 Tulane University arxiv ENHANCING LLM SAFETY VIA CONSTRAINED DIRECT PREFERENCE OPTIMIZATION Reinforcement Learning&Human Feedback&Safety Constraints
24.03 University of Illinois Urbana-Champaign arxiv INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents Tool Integration&Security&Indirect Prompt Injection
24.03 Harvard University arxiv Towards Safe and Aligned Large Language Models for Medicine *Medical Safety&Alignment&Ethical Principles
24.03 Rensselaer Polytechnic Institute, University of Michigan, IBM Research, MIT-IBM Watson AI Lab arxiv ALIGNERS: DECOUPLING LLMS AND ALIGNMENT Alignment&Synthetic Data
24.03 MIT, Princeton University, Stanford University, Georgetown University, AI Risk and Vulnerability Alliance, Eleuther AI, Brown University, Carnegie Mellon University, Virginia Tech, Northeastern University, UCSB, University of Pennsylvania, UIUC arxiv A Safe Harbor for AI Evaluation and Red Teaming AI Evaluation&Red Teaming&Safe Harbor
24.03 University of Southern California arxiv Logits of API-Protected LLMs Leak Proprietary Information API-Protected LLMs&Softmax Bottleneck&Embedding Size Detection
24.03 University of Bristol arxiv Helpful or Harmful? Exploring the Efficacy of Large Language Models for Online Grooming Prevention Safety&Prompt Engineering
24.03 XiaMen University, Yanshan University, IDEA Research, Inner Mongolia University, Microsoft, Microsoft Research Asia arxiv Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models Safety&Guidelines&Alignment
24.03 Tianjin University, Tianjin University, Zhengzhou University, China Academy of Information and Communications Technology arxiv OpenEval: Benchmarking Chinese LLMs across Capability, Alignment, and Safety Chinese LLMs&Benchmarking&Safety
24.03 Center for Cybersecurity Systems and Networks, AIShield Bosch Global Software Technologies Bengaluru India arxiv Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal LLM Security&Threat modeling&Risk Assessment
24.03 Queen’s University Belfast arxiv AI Safety: Necessary but insufficient and possibly problematic AI Safety&Transparency&Structural Harm
24.04 Provable Responsible AI and Data Analytics (PRADA) Lab, King Abdullah University of Science and Technology arxiv Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs Dialectical Alignment&3H Principle&Security Threats
24.04 LibrAI, Tsinghua University, Harbin Institute of Technology, Monash University, The University of Melbourne, MBZUAI arxiv Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models Red Teaming&Safety
24.04 University of California, Santa Barbara, Meta AI arxiv Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models Safety&Helpfulness&Controllability
24.04 School of Information and Software Engineering, University of Electronic Science and Technology of China arxiv Exploring Backdoor Vulnerabilities of Chat Models Backdoor Attacks&Chat Models&Security
24.04 Enkrypt AI arxiv INCREASED LLM VULNERABILITIES FROM FINE-TUNING AND QUANTIZATION Fine-tuning&Quantization&LLM Vulnerabilities
24.04 TongJi University, Tsinghua University&, eijing University of Technology, Nanyang Technological University, Peng Cheng Laboratory arxiv Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security Multimodal Large Language Models&Security Vulnerabilities&Image Inputs
24.04 University of Washington, Carnegie Mellon University, University of British Columbia, Vector Institute for AI arxiv CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge AI-Assisted Red-Teaming&Multicultural Knowledge
24.04 Nanjing University DLSP 2024 Subtoxic Questions: Dive Into Attitude Change of LLM’s Response in Jailbreak Attempts Jailbreak&Subtoxic Questions&GAC Model
24.04 Innodata arxiv Benchmarking Llama2, Mistral, Gemma, and GPT for Factuality, Toxicity, Bias, and Propensity for Hallucinations Evaluation&Safety
24.04 University of Cambridge, New York University, ETH Zurich arxiv Foundational Challenges in Assuring Alignment and Safety of Large Language Models Alignment&Safety
24.04 Zhejiang University arxiv TransLinkGuard: Safeguarding Transformer Models Against Model Stealing in Edge Deployment Intellectual Property Protection&Edge-deployed Transformer Model
24.04 Harvard University arxiv More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness Reinforcement Learning from Human Feedback&Trustworthiness
24.05 University of Maryland arxiv Constrained Decoding for Secure Code Generation Code Generation&Code LLM&Secure Code Generation&AI Safety
24.05 Huazhong University of Science and Technology arxiv Large Language Models for Cyber Security: A Systematic Literature Review Cybersecurity&Systematic Review
24.04 CSIRO’s Data61 ACM International Conference on AI-powered Software An AI System Evaluation Framework for Advancing AI Safety: Terminology, Taxonomy, Lifecycle Mapping AI Safety&Evaluation Framework&AI Lifecycle Mapping
24.05 CSAIL and CBMM, MIT arxiv SecureLLM: Using Compositionality to Build Provably Secure Language Models for Private, Sensitive, and Secret Data SecureLLM&Compositionality
24.05 Carnegie Mellon University arxiv Human–AI Safety: A Descendant of Generative AI and Control Systems Safety Human–AI Safety&Generative AI
24.05 University of York arxiv Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding Safe Reinforcement Learning&Black-Box Environments&Adaptive Shielding
24.05 Princeton University arxiv AI Risk Management Should Incorporate Both Safety and Security AI Safety&AI Security&Risk Management
24.05 University of Oslo arxiv AI Safety: A Climb to Armageddon? AI Safety&Existential Risk&AI Governance
24.06 Zscaler, Inc. arxiv Exploring Vulnerabilities and Protections in Large Language Models: A Survey Prompt Hacking&Adversarial Attacks&Suvery
24.06 Texas A & M University - San Antonio arxiv Transforming Computer Security and Public Trust Through the Exploration of Fine-Tuning Large Language Models Fine-Tuning&Cyber Security
24.06 Alibaba Group arxiv How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States LLM Safety&Alignment&Jailbreak
24.06 UC Davis arxiv Security of AI Agents Security&AI Agents&Vulnerabilities
24.06 University of Connecticut USENIX Security ‘24 An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulnerabilities against Strong Detection Backdoor Attack&Code Completion Models&Vulnerability Detection
24.06 University of California, Irvine arxiv TorchOpera: A Compound AI System for LLM Safety TorchOpera&LLM Safety&Compound AI System
24.06 NVIDIA Corporation arxiv garak: A Framework for Security Probing Large Language Models garak&Security Probing
24.06 Carnegie Mellon University arxiv Current State of LLM Risks and AI Guardrails LLM Risks&AI Guardrails
24.06 Johns Hopkins University arxiv Every Language Counts: Learn and Unlearn in Multilingual LLMs Multilingual LLMs&Fake Information&Unlearning
24.06 Tsinghua University arxiv Finding Safety Neurons in Large Language Models Safety Neurons&Mechanistic Interpretability&AI Safety
24.06 Center for AI Safety and Governance, Institute for AI, Peking University arxiv SAFESORA: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset Safety Alignment&Text2Video Generation
24.06 Samsung R&D Institute UK, KAUST, University of Oxford arxiv Model Merging and Safety Alignment: One Bad Model Spoils the Bunch Model Merging&Safety Alignment
24.06 Hofstra University arxiv Analyzing Multi-Head Attention on Trojan BERT Models Trojan Attack&BERT Models&Multi-Head Attention
24.06 Fudan University arxiv SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance Safety Alignment&Jailbreak Attacks&Response Disparity
24.06 Stony Brook University NAACL 2024 Workshop Automated Adversarial Discovery for Safety Classifiers Safety Classifiers&Adversarial Attacks&Toxicity
24.07 University of Utah arxiv Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression Model Compression&Safety Evaluation
24.07 University of Alberta arxiv Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture Multilingual Blending&LLM Safety Alignment&Language Mixture
24.07 Singapore National Eye Centre arxiv A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models – Safety, Consensus, Objectivity, Reproducibility and Explainability Evaluation Framework
24.07 Microsoft arxiv SLIP: Securing LLM’s IP Using Weights Decomposition Hybrid Inference&Model Security&Weights Decomposition
24.07 Microsoft arxiv Phi-3 Safety Post-Training: Aligning Language Models with a “Break-Fix” Cycle Phi-3&Safety Post-Training
24.07 Tsinghua University arxiv Course-Correction: Safety Alignment Using Synthetic Preferences Course-Correction&Safety Alignment&Synthetic Preferences
24.07 Northwestern University arxiv From Sands to Mansions: Enabling Automatic Full-Life-Cycle Cyberattack Construction with LLM Cyberattack Construction&Full-Life-Cycle
24.07 Singapore University of Technology and Design arxiv AI Safety in Generative AI Large Language Models: A Survey Generative AI&AI Safety
24.07 Lehigh University arxiv Blockchain for Large Language Model Security and Safety: A Holistic Survey Blockchain&Security&Safety
24.08 OpenAI openai Rule-Based Rewards for Language Model Safety Reinforcement Learning&Safety&Rule-Based Rewards
24.08 University of Texas at Austin arxiv HIDE AND SEEK: Fingerprinting Large Language Models with Evolutionary Learning Model Fingerprinting&In-context Learning
24.08 Technical University of Munich arxiv Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study Secure Code Assessment&Vulnerability Detection
24.08 Offenburg University of Applied Sciences arxiv "You still have to study" - On the Security of LLM generated code Code Security&Prompting Techniques
24.08 University of Connecticut arxiv Clip2Safety: A Vision Language Model for Interpretable and Fine-Grained Detection of Safety Compliance in Diverse Workplaces Vision Language Model&Safety Compliance&Personal Protective Equipment Detection
24.08 Pabna University of Science and Technology arxiv Risks, Causes, and Mitigations of Widespread Deployments of Large Language Models (LLMs): A Survey Privacy&Bias&Interpretability
24.08 Quinnipiac University arxiv Is Generative AI the Next Tactical Cyber Weapon For Threat Actors? Unforeseen Implications of AI Generated Cyber Attacks Generative AI&Cybersecurity&Cyber Attacks
24.08 Nanyang Technological University arxiv Trustworthy, Responsible, and Safe AI: A Comprehensive Architectural Framework for AI Safety with Challenges and Mitigations AI Safety&Trustworthy&Responsible
24.08 King Abdullah University of Science and Technology arxiv Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models Safety&Helpfulness&LLM Alignment
24.08 University of Calgary arxiv Trustworthy and Responsible AI for Human-Centric Autonomous Decision-Making Systems Trustworthy AI&Algorithmic Bias&Responsible AI
24.08 University of Oxford arxiv AI Security Audits: Challenges and Innovations in Assessing Large Language Models AI Security Audits&Vulnerability Assessment&AI Ethics
24.08 University of Science and Technology of China arxiv Safety Layers of Aligned Large Language Models: The Key to LLM Security Aligned LLM&Safety Layers&Security Degradation
24.09 University of Texas at San Antonio arxiv Enhancing Source Code Security with LLMs: Demystifying The Challenges and Generating Reliable Repairs Source Code Security&LLMs&Reinforcement Learning
24.09 The Hong Kong Polytechnic University arxiv Alignment-Aware Model Extraction Attacks on Large Language Models Model Extraction Attacks&LLM Alignment&Watermark Resistance
24.09 University of Oxford, Redwood Research arxiv Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols AI Control&Safety Protocols&Game Theory
24.09 University of Galway ECAI AIEB Workshop Ethical AI Governance: Methods for Evaluating Trustworthy AI Trustworthy AI&Ethics&AI Evaluation
24.09 University of Texas at San Antonio arxiv AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing Multi-Agent Systems&Code Security&Fuzz Testing&Static Analysis
24.09 Tsinghua University arxiv Language Models Learn to Mislead Humans via RLHF Reinforcement Learning from Human Feedback (RLHF)&U-SOPHISTRY&Misleading AI
24.09 Stevens Institute of Technology arxiv Measuring Copyright Risks of Large Language Model via Partial Information Probing Copyright&Partial Information Probing
24.09 IBM Research arxiv Attack Atlas: A Practitioner’s Perspective on Challenges and Pitfalls in Red Teaming GenAI Red Teaming&LLM Security&Adversarial Attacks
24.09 Pengcheng Laboratory arxiv Multi-Designated Detector Watermarking for Language Models Watermarking&Claimability&Multi-designated Verifier Signature
24.09 ETH Zurich arxiv An Adversarial Perspective on Machine Unlearning for AI Safety Machine Unlearning&Adversarial Attacks&Unlearning Robustness
24.10 Google DeepMind arxiv A Watermark for Black-Box Language Models Watermarking&Black-Box Models&LLM Detection
24.10 Mohamed Bin Zayed University of Artificial Intelligence arxiv Optimizing Adaptive Attacks Against Content Watermarks for Language Models Watermarking&Adaptive Attacks&LLM Security
24.10 Rice University, Rutgers University arxiv Taylor Unswift: Secured Weight Release for Large Language Models via Taylor Expansion Taylor Expansion&Model Security
24.10 PeopleTec arxiv Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders Cybersecurity&Hallucinations
24.10 Fondazione Bruno Kessler, Université Côte d’Azur EMNLP 2024 Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering Counterspeech&Safety Guardrails
24.10 University of California, Davis, AWS AI Labs arxiv Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models Safety alignment&Vision-Language models&Cross-modality representation manipulation
24.10 North Carolina State University arxiv Superficial Safety Alignment Hypothesis: The Need for Efficient and Robust Safety Mechanisms in LLMs Superficial safety alignment&Safety mechanisms&Safety-critical components
24.10 Shanghai Jiao Tong University, Chinese University of Hong Kong (Shenzhen), Tsinghua University arxiv ARCHILLES’ HEEL IN SEMI-OPEN LLMS: HIDING BOTTOM AGAINST RECOVERY ATTACKS Semi-open LLMs&Recovery attacks&Model resilience
24.10 University of Tulsa arxiv Weak-to-Strong Generalization beyond Accuracy: A Pilot Study in Safety, Toxicity, and Legal Reasoning Weak-to-Strong Generalization&Safety&Toxicity
24.10 Aalborg University arxiv Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis Language confusion&Multilingual LLMs&Security vulnerabilities
24.10 Carnegie Mellon University arxiv Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents LLM Safety&Browser Agents&Red Teaming
24.10 Palisade Research arxiv LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild LLM Agents&Honeypots&Cybersecurity
24.10 University of Pittsburgh arxiv Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents Embodied Agents&Multimodal Safety&Active Learning
24.10 CSIRO’s Data61 arxiv From Solitary Directives to Interactive Encouragement! LLM Secure Code Generation by Natural Language Prompting Secure Code Generation&Encouragement Prompting
24.10 AppCubic arxiv Jailbreaking and Mitigation of Vulnerabilities in Large Language Models Prompt Injection&Jailbreaking&AI Security
24.10 UC Berkeley arxiv SAFETYANALYST: Interpretable, Transparent, and Steerable LLM Safety Moderation LLM Safety&Interpretability&Content Moderation
24.10 ShanghaiTech University arxiv Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization Safety Alignment&Reinforcement Learning&Policy Optimization
24.11 Zhejiang University arxiv Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control Trustworthiness&Sparse Activation Control&Representation Control
24.11 University of California, Riverside arxiv Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models Vision-Language Models&Safety Alignment&Cross-Layer Vulnerability
24.11 National University of Singapore EMNLP 2024 Multi-expert Prompting Improves Reliability, Safety and Usefulness of Large Language Models Multi-expert Prompting&LLM Safety&Reliability&Usefulness
24.11 OpenAI NeurIPS 2024 Rule Based Rewards for Language Model Safety Rule Based Rewards&Safety Alignment&AI Feedback
24.11 Center for Automation and Robotics, Spanish National Research Council arXiv Can Adversarial Attacks by Large Language Models Be Attributed? Adversarial Attribution&LLM Security&Formal Language Theory
24.11 McGill University arXiv Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset Helpful and Harmless Dataset&Safety Trade-offs&Bias Analysis
24.11 Fudan University arxiv Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding Text-to-Image Generation&Safety&Prompt Embedding Sanitization
24.11 Meta arxiv Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations Multimodal LLM&Content Moderation&Adversarial Robustness
24.11 Columbia University arxiv When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations Backdoor Attacks&Explainability
24.11 Ben-Gurion University of the Negev arxiv The Information Security Awareness of Large Language Models Information Security Awareness&Benchmarking
24.11 Fordham University arxiv Next-Generation Phishing: How LLM Agents Empower Cyber Attackers Phishing Detection&Cybersecurity
24.12 UC Berkeley arxiv Trust & Safety of LLMs and LLMs in Trust & Safety Trust and Safety&Prompt Injection
24.12 Harvard Kennedy School, Avant Research Group arxiv Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects Phishing Attacks&Human-in-the-loop
24.12 University of Massachusetts arxiv Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness Instruction Tuning&Safety&Helpfulness
24.11 University of Pennsylvania, IBM T.J. Watson Research Center arxiv Cyber-Attack Technique Classification Using Two-Stage Trained Large Language Models Cyber-Attack Classification&Two-Stage Training
24.12 University of New South Wales arxiv How Can LLMs and Knowledge Graphs Contribute to Robot Safety? A Few-Shot Learning Approach Robot Safety&Few-Shot Learning&Knowledge Graph Prompting
24.12 Örebro University arxiv Large Language Models and Code Security: A Systematic Literature Review LLM-Generated Code&Vulnerability Detection&Data Poisoning Attacks
24.12 Algiers Research Institute arxiv On the Validity of Traditional Vulnerability Scoring Systems for Adversarial Attacks against LLMs Adversarial Attacks&Vulnerability Metrics&Risk Assessment
25.01 Meta arxiv MLLM-as-a-Judge for Image Safety without Human Labeling Image Safety&Zero-Shot Judgment&Multimodal Large Language Models
25.01 FAU Erlangen-Nürnberg arxiv Refusal Behavior in Large Language Models: A Nonlinear Perspective Refusal Behavior&Mechanistic Interpretability&AI Alignment

💻Presentations & Talks

📖Tutorials & Workshops

Date Type Title URL
23.10 Tutorials Awesome-LLM-Safety link

📰News & Articles

Date Type Title URL
23.01 video ChatGPT and InstructGPT: Aligning Language Models to Human Intention link
23.06 Report “Dual-use dilemma” for GenAI Workshop Summarization link
23.10 News Joint Statement on AI Safety and Openness link

🧑‍🏫Scholars