20.10 |
Facebook AI Research |
arxiv |
Recipes for Safety in Open-domain Chatbots |
Toxic Behavior&Open-domain |
22.02 |
DeepMind |
EMNLP2022 |
Red Teaming Language Models with Language Model |
Red Teaming&Harm Test |
22.03 |
OpenAI |
NIPS2022 |
Training language models to follow instructions with human feedback |
InstructGPT&RLHF&Harmless |
22.04 |
Anthropic |
arxiv |
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback |
Helpful&Harmless |
22.05 |
UCSD |
EMNLP2022 |
An Empirical Analysis of Memorization in Fine-tuned Autoregressive Language Models |
Privacy Risks&Memorization |
22.09 |
Anthropic |
arxiv |
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned |
Red Teaming&Harmless&Helpful |
22.12 |
Anthropic |
arxiv |
Constitutional AI: Harmlessness from AI Feedback |
Harmless&Self-improvement&RLAIF |
23.07 |
UC Berkeley |
NIPS2023 |
Jailbroken: How Does LLM Safety Training Fail? |
Jailbreak&Competing Objectives&Mismatched Generalization |
23.08 |
The Chinese University of Hong Kong Shenzhen China, Tencent AI Lab, The Chinese University of Hong Kong |
arxiv |
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs Via Cipher |
Safety Alignment&Adversarial Attack |
23.08 |
University College London, University College London, Tilburg University |
arxiv |
Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities |
Security&AI Alignment |
23.09 |
Peking University |
arxiv |
RAIN: Your Language Models Can Align Themselves without Finetuning |
Self-boosting&Rewind Mechanisms |
23.10 |
Princeton University, Virginia Tech, IBM Research, Stanford University |
arxiv |
FINE-TUNING ALIGNED LANGUAGE MODELS COMPROMISES SAFETY EVEN WHEN USERS DO NOT INTEND TO! |
Fine-tuning****Safety Risks&Adversarial Training |
23.10 |
UC Riverside |
arXiv |
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks |
Adversarial Attacks&Vulnerabilities&Model Security |
23.10 |
Rice University |
NAACL2024(findings) |
Secure Your Model: An Effective Key Prompt Protection Mechanism for Large Language Models |
Key Prompt Protection&Large Language Models&Unauthorized Access Prevention |
23.11 |
KAIST AI |
arxiv |
HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning |
Hate Speech&Detection |
23.11 |
CMU |
AACL2023(ART or Safety workshop) |
Measuring Adversarial Datasets |
Adversarial Robustness&AI Safety&Adversarial Datasets |
23.11 |
UIUC |
arxiv |
Removing RLHF Protections in GPT-4 via Fine-Tuning |
Remove Protection&Fine-Tuning |
23.11 |
IT University of Copenhagen,University of Washington |
arxiv |
Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild |
Red Teaming |
23.11 |
Fudan University&Shanghai AI lab |
arxiv |
Fake Alignment: Are LLMs Really Aligned Well? |
Alignment Failure&Safety Evaluation |
23.11 |
University of Southern California |
arxiv |
SAFER-INSTRUCT: Aligning Language Models with Automated Preference Data |
RLHF&Safety |
23.11 |
Google Research |
arxiv |
AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications |
Adversarial Testing&AI-Assisted Red Teaming&Application Safety |
23.11 |
Tencent AI Lab |
arxiv |
ADVERSARIAL PREFERENCE OPTIMIZATION |
Human Preference Alignment&Adversarial Preference Optimization&Annotation Reduction |
23.11 |
Docta.ai |
arxiv |
Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models |
Data Credibility&Safety alignment |
23.11 |
CIIRC CTU in Prague |
arxiv |
A Security Risk Taxonomy for Large Language Models |
Security risks&Taxonomy&Prompt-based attacks |
23.11 |
Meta&University of Illinois Urbana-Champaign |
NAACL2024 |
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming |
Automatic Red-Teaming&LLM Safety&Adversarial Prompt Writing |
23.11 |
The Ohio State University&University of California, Davis |
NAACL2024 |
How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities |
Open-Source LLMs&Malicious Demonstrations&Trustworthiness |
23.12 |
Drexel University |
arXiv |
A Survey on Large Language Model (LLM) Security and Privacy: The Good the Bad and the Ugly |
Security&Privacy&Attacks |
23.12 |
Tenyx |
arXiv |
Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation |
Geometric Interpretation&Intrinsic Dimension&Toxicity Detection |
23.12 |
Independent (Now at Google DeepMind) |
arXiv |
Scaling Laws for Adversarial Attacks on Language Model Activations |
Adversarial Attacks&Language Model Activations&Scaling Laws |
23.12 |
University of Liechtenstein, University of Duesseldorf |
arxiv |
NEGOTIATING WITH LLMS: PROMPT HACKS, SKILL GAPS, AND REASONING DEFICITS |
Negotiation&Reasoning&Prompt Hacking |
23.12 |
University of Wisconsin Madison, University of Michigan Ann Arbor, ASU, Washington University |
arXiv |
Exploring the Limits of ChatGPT in Software Security Applications |
Software Security |
23.12 |
GenAI at Meta |
arxiv |
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations |
Human-AI Conversation&Safety Risk taxonomy |
23.12 |
University of California Riverside, Microsoft |
arxiv |
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack |
Safety Alignment&Summarization&Vulnerability |
23.12 |
MIT, Harvard |
NIPS2023(Workshop) |
Forbidden Facts: An Investigation of Competing Objectives in Llama-2 |
Competing Objectives&Forbidden Fact Task&Model Decomposition |
23.12 |
University of Science and Technology of China |
arxiv |
Silent Guardian: Protecting Text from Malicious Exploitation by Large Language Models |
Text Protection&Silent Guardian |
23.12 |
OpenAI |
Open AI |
Practices for Governing Agentic AI Systems |
Agentic AI Systems&LM Based Agent |
23.12 |
University of Massachusetts Amherst, Columbia University, Google, Stanford University, New York University |
arxiv |
Learning and Forgetting Unsafe Examples in Large Language Models |
Safety Issues&ForgetFilter Algorithm&Unsafe Content |
23.12 |
Tencent AI Lab, The Chinese University of Hong Kong |
arxiv |
Aligning Language Models with Judgments |
Judgment Alignment&Contrastive Unlikelihood Training |
24.01 |
Delft University of Technology |
arxiv |
Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks |
Red Teaming&Hallucinations&Mathematics Tasks |
24.01 |
Apart Research, University of Edinburgh, Imperial College London, University of Oxford |
arxiv |
Large Language Models Relearn Removed Concepts |
Neuroplasticity&Concept Redistribution |
24.01 |
Tsinghua University, Xiaomi AI Lab, Huawei, Shenzhen Heytap Technology, vivo AI Lab, Viomi Technology, Li Auto, Beijing University of Posts and Telecommunications, Soochow University |
arxiv |
PERSONAL LLM AGENTS: INSIGHTS AND SURVEY ABOUT THE CAPABILITY EFFICIENCY AND SECURITY |
Intelligent Personal Assistant&LLM Agent&Security and Privacy |
24.01 |
Zhongguancun Laboratory, Tsinghua University, Institute of Information Engineering Chinese Academy of Sciences, Ant Group |
arxiv |
Risk Taxonomy Mitigation and Assessment Benchmarks of Large Language Model Systems |
Safety&Risk Taxonomy&Mitigation Strategies |
24.01 |
Google Research |
arxiv |
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models |
Interpretability |
24.01 |
Ben-Gurion University of the Negev Israel |
arxiv |
GPT IN SHEEP’S CLOTHING: THE RISK OF CUSTOMIZED GPTS |
GPTs&Cybersecurity&ChatGPT |
24.01 |
Shanghai Jiao Tong University |
arxiv |
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents |
LLM Agents&Safety Risk Awareness&Benchmark |
24.01 |
Ant Group |
arxiv |
A FAST PERFORMANT SECURE DISTRIBUTED TRAINING FRAMEWORK FOR LLM |
Distributed LLM&Security |
24.01 |
Shanghai Artificial Intelligence Laboratory, Dalian University of Technology, University of Science and Technology of China |
arxiv |
PsySafe: A Comprehensive Framework for Psychological-based Attack Defense and Evaluation of Multi-agent System Safety |
Multi-agent Systems&Agent Psychology&Safety |
24.01 |
Rochester Institute of Technology |
arxiv |
Mitigating Security Threats in LLMs |
Security Threats&Prompt Injection&Jailbreaking |
24.01 |
Johns Hopkins University, University of Pennsylvania, Ohio State University |
arxiv |
The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts |
Multilingualism&Safety&Resource Disparity |
24.01 |
University of Florida |
arxiv |
Adaptive Text Watermark for Large Language Models |
Text Watermarking&Robustness&Security |
24.01 |
The Hebrew University |
arXiv |
TRADEOFFS BETWEEN ALIGNMENT AND HELPFULNESS IN LANGUAGE MODELS |
Language Model Alignment&AI Safety&Representation Engineering |
24.01 |
Google Research, Anthropic |
arxiv |
Gradient-Based Language Model Red Teaming |
Red Teaming&Safety&Prompt Learning |
24.01 |
National University of Singapore, Pennsylvania State University |
arxiv |
Provably Robust Multi-bit Watermarking for AI-generated Text via Error Correction Code |
Watermarking&Error Correction Code&AI Ethics |
24.01 |
Tsinghua University, University of California Los Angeles, WeChat AI Tencent Inc. |
arxiv |
Prompt-Driven LLM Safeguarding via Directed Representation Optimization |
Safety Prompts&Representation Optimization |
24.02 |
Rensselaer Polytechnic Institute, IBM T.J. Watson Research Center, IBM Research |
arxiv |
Adaptive Primal-Dual Method for Safe Reinforcement Learning |
Safe Reinforcement Learning&Adaptive Primal-Dual&Adaptive Learning Rates |
24.02 |
Jagiellonian University, University of Modena and Reggio Emilia, Alma Mater Studiorum University of Bologna, European University Institute |
arxiv |
No More Trade-Offs: GPT and Fully Informative Privacy Policies |
ChatGPT&Privacy Policies&Legal Requirements |
24.02 |
Florida International University |
arxiv |
Security and Privacy Challenges of Large Language Models: A Survey |
Security&Privacy Challenges&Suevey |
24.02 |
Rutgers University, University of California, Santa Barbara, NEC Labs America |
arxiv |
TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution |
LLM-based Agents&Safety&Trustworthiness |
24.02 |
University of Maryland College Park, JPMorgan AI Research, University of Waterloo, Salesforce Research |
arxiv |
Shadowcast: Stealthy Data Poisoning Attacks against VLMs |
Vision-Language Models&Data Poisoning&Security |
24.02 |
Shanghai Artificial Intelligence Laboratory, Harbin Institute of Technology, Beijing Institute of Technology, Chinese University of Hong Kong |
arxiv |
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models |
Safety Benchmark&Safety Evaluation**&Hierarchical Taxonomy |
24.02 |
Fudan University |
arxiv |
ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages |
Tool Learning&Large Language Models (LLMs)&Safety Issues&ToolSword |
24.02 |
Paul G. Allen School of Computer Science & Engineering, University of Washington |
arxiv |
SPML: A DSL for Defending Language Models Against Prompt Attacks |
Domain-Specific Language (DSL)&Chatbot Definitions&System Prompt Meta Language (SPML) |
24.02 |
Tsinghua University |
arxiv |
ShieldLM: Empowering LLMs as Aligned Customizable and Explainable Safety Detectors |
Safety Detectors&Customizable&Explainable |
24.02 |
Dalhousie University |
arxiv |
Immunization Against Harmful Fine-tuning Attacks |
Fine-tuning Attacks&Immunization |
24.02 |
Chinese Academy of Sciences, University of Chinese Academy of Sciences, Alibaba Group |
arxiv |
SoFA: Shielded On-the-fly Alignment via Priority Rule Following |
Priority Rule Following&Alignment |
24.02 |
Universidade Federal de Santa Catarina |
arxiv |
A Survey of Large Language Models in Cybersecurity |
Cybersecurity&Vulnerability Assessment |
24.02 |
Zhejiang University |
arxiv |
PRSA: Prompt Reverse Stealing Attacks against Large Language Models |
Prompt Reverse Stealing Attacks&Security |
24.02 |
Shanghai Artificial Intelligence Laboratory |
NAACL2024 |
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey |
Large Language Models&Conversation Safety&Survey |
24.03 |
Tulane University |
arxiv |
ENHANCING LLM SAFETY VIA CONSTRAINED DIRECT PREFERENCE OPTIMIZATION |
Reinforcement Learning&Human Feedback&Safety Constraints |
24.03 |
University of Illinois Urbana-Champaign |
arxiv |
INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents |
Tool Integration&Security&Indirect Prompt Injection |
24.03 |
Harvard University |
arxiv |
Towards Safe and Aligned Large Language Models for Medicine |
*Medical Safety&Alignment&Ethical Principles |
24.03 |
Rensselaer Polytechnic Institute, University of Michigan, IBM Research, MIT-IBM Watson AI Lab |
arxiv |
ALIGNERS: DECOUPLING LLMS AND ALIGNMENT |
Alignment&Synthetic Data |
24.03 |
MIT, Princeton University, Stanford University, Georgetown University, AI Risk and Vulnerability Alliance, Eleuther AI, Brown University, Carnegie Mellon University, Virginia Tech, Northeastern University, UCSB, University of Pennsylvania, UIUC |
arxiv |
A Safe Harbor for AI Evaluation and Red Teaming |
AI Evaluation&Red Teaming&Safe Harbor |
24.03 |
University of Southern California |
arxiv |
Logits of API-Protected LLMs Leak Proprietary Information |
API-Protected LLMs&Softmax Bottleneck&Embedding Size Detection |
24.03 |
University of Bristol |
arxiv |
Helpful or Harmful? Exploring the Efficacy of Large Language Models for Online Grooming Prevention |
Safety&Prompt Engineering |
24.03 |
XiaMen University, Yanshan University, IDEA Research, Inner Mongolia University, Microsoft, Microsoft Research Asia |
arxiv |
Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models |
Safety&Guidelines&Alignment |
24.03 |
Tianjin University, Tianjin University, Zhengzhou University, China Academy of Information and Communications Technology |
arxiv |
OpenEval: Benchmarking Chinese LLMs across Capability, Alignment, and Safety |
Chinese LLMs&Benchmarking&Safety |
24.03 |
Center for Cybersecurity Systems and Networks, AIShield Bosch Global Software Technologies Bengaluru India |
arxiv |
Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal |
LLM Security&Threat modeling&Risk Assessment |
24.03 |
Queen’s University Belfast |
arxiv |
AI Safety: Necessary but insufficient and possibly problematic |
AI Safety&Transparency&Structural Harm |
24.04 |
Provable Responsible AI and Data Analytics (PRADA) Lab, King Abdullah University of Science and Technology |
arxiv |
Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs |
Dialectical Alignment&3H Principle&Security Threats |
24.04 |
LibrAI, Tsinghua University, Harbin Institute of Technology, Monash University, The University of Melbourne, MBZUAI |
arxiv |
Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models |
Red Teaming&Safety |
24.04 |
University of California, Santa Barbara, Meta AI |
arxiv |
Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models |
Safety&Helpfulness&Controllability |
24.04 |
School of Information and Software Engineering, University of Electronic Science and Technology of China |
arxiv |
Exploring Backdoor Vulnerabilities of Chat Models |
Backdoor Attacks&Chat Models&Security |
24.04 |
Enkrypt AI |
arxiv |
INCREASED LLM VULNERABILITIES FROM FINE-TUNING AND QUANTIZATION |
Fine-tuning&Quantization&LLM Vulnerabilities |
24.04 |
TongJi University, Tsinghua University&, eijing University of Technology, Nanyang Technological University, Peng Cheng Laboratory |
arxiv |
Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security |
Multimodal Large Language Models&Security Vulnerabilities&Image Inputs |
24.04 |
University of Washington, Carnegie Mellon University, University of British Columbia, Vector Institute for AI |
arxiv |
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge |
AI-Assisted Red-Teaming&Multicultural Knowledge |
24.04 |
Nanjing University |
DLSP 2024 |
Subtoxic Questions: Dive Into Attitude Change of LLM’s Response in Jailbreak Attempts |
Jailbreak&Subtoxic Questions&GAC Model |
24.04 |
Innodata |
arxiv |
Benchmarking Llama2, Mistral, Gemma, and GPT for Factuality, Toxicity, Bias, and Propensity for Hallucinations |
Evaluation&Safety |
24.04 |
University of Cambridge, New York University, ETH Zurich |
arxiv |
Foundational Challenges in Assuring Alignment and Safety of Large Language Models |
Alignment&Safety |
24.04 |
Zhejiang University |
arxiv |
TransLinkGuard: Safeguarding Transformer Models Against Model Stealing in Edge Deployment |
Intellectual Property Protection&Edge-deployed Transformer Model |
24.04 |
Harvard University |
arxiv |
More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness |
Reinforcement Learning from Human Feedback&Trustworthiness |
24.05 |
University of Maryland |
arxiv |
Constrained Decoding for Secure Code Generation |
Code Generation&Code LLM&Secure Code Generation&AI Safety |
24.05 |
Huazhong University of Science and Technology |
arxiv |
Large Language Models for Cyber Security: A Systematic Literature Review |
Cybersecurity&Systematic Review |
24.04 |
CSIRO’s Data61 |
ACM International Conference on AI-powered Software |
An AI System Evaluation Framework for Advancing AI Safety: Terminology, Taxonomy, Lifecycle Mapping |
AI Safety&Evaluation Framework&AI Lifecycle Mapping |
24.05 |
CSAIL and CBMM, MIT |
arxiv |
SecureLLM: Using Compositionality to Build Provably Secure Language Models for Private, Sensitive, and Secret Data |
SecureLLM&Compositionality |
24.05 |
Carnegie Mellon University |
arxiv |
Human–AI Safety: A Descendant of Generative AI and Control Systems Safety |
Human–AI Safety&Generative AI |
24.05 |
University of York |
arxiv |
Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding |
Safe Reinforcement Learning&Black-Box Environments&Adaptive Shielding |
24.05 |
Princeton University |
arxiv |
AI Risk Management Should Incorporate Both Safety and Security |
AI Safety&AI Security&Risk Management |
24.05 |
University of Oslo |
arxiv |
AI Safety: A Climb to Armageddon? |
AI Safety&Existential Risk&AI Governance |
24.06 |
Zscaler, Inc. |
arxiv |
Exploring Vulnerabilities and Protections in Large Language Models: A Survey |
Prompt Hacking&Adversarial Attacks&Suvery |
24.06 |
Texas A & M University - San Antonio |
arxiv |
Transforming Computer Security and Public Trust Through the Exploration of Fine-Tuning Large Language Models |
Fine-Tuning&Cyber Security |
24.06 |
Alibaba Group |
arxiv |
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States |
LLM Safety&Alignment&Jailbreak |
24.06 |
UC Davis |
arxiv |
Security of AI Agents |
Security&AI Agents&Vulnerabilities |
24.06 |
University of Connecticut |
USENIX Security ‘24 |
An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulnerabilities against Strong Detection |
Backdoor Attack&Code Completion Models&Vulnerability Detection |
24.06 |
University of California, Irvine |
arxiv |
TorchOpera: A Compound AI System for LLM Safety |
TorchOpera&LLM Safety&Compound AI System |
24.06 |
NVIDIA Corporation |
arxiv |
garak: A Framework for Security Probing Large Language Models |
garak&Security Probing |
24.06 |
Carnegie Mellon University |
arxiv |
Current State of LLM Risks and AI Guardrails |
LLM Risks&AI Guardrails |
24.06 |
Johns Hopkins University |
arxiv |
Every Language Counts: Learn and Unlearn in Multilingual LLMs |
Multilingual LLMs&Fake Information&Unlearning |
24.06 |
Tsinghua University |
arxiv |
Finding Safety Neurons in Large Language Models |
Safety Neurons&Mechanistic Interpretability&AI Safety |
24.06 |
Center for AI Safety and Governance, Institute for AI, Peking University |
arxiv |
SAFESORA: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset |
Safety Alignment&Text2Video Generation |
24.06 |
Samsung R&D Institute UK, KAUST, University of Oxford |
arxiv |
Model Merging and Safety Alignment: One Bad Model Spoils the Bunch |
Model Merging&Safety Alignment |
24.06 |
Hofstra University |
arxiv |
Analyzing Multi-Head Attention on Trojan BERT Models |
Trojan Attack&BERT Models&Multi-Head Attention |
24.06 |
Fudan University |
arxiv |
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance |
Safety Alignment&Jailbreak Attacks&Response Disparity |
24.06 |
Stony Brook University |
NAACL 2024 Workshop |
Automated Adversarial Discovery for Safety Classifiers |
Safety Classifiers&Adversarial Attacks&Toxicity |
24.07 |
University of Utah |
arxiv |
Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression |
Model Compression&Safety Evaluation |
24.07 |
University of Alberta |
arxiv |
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture |
Multilingual Blending&LLM Safety Alignment&Language Mixture |
24.07 |
Singapore National Eye Centre |
arxiv |
A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models – Safety, Consensus, Objectivity, Reproducibility and Explainability |
Evaluation Framework |
24.07 |
Microsoft |
arxiv |
SLIP: Securing LLM’s IP Using Weights Decomposition |
Hybrid Inference&Model Security&Weights Decomposition |
24.07 |
Microsoft |
arxiv |
Phi-3 Safety Post-Training: Aligning Language Models with a “Break-Fix” Cycle |
Phi-3&Safety Post-Training |
24.07 |
Tsinghua University |
arxiv |
Course-Correction: Safety Alignment Using Synthetic Preferences |
Course-Correction&Safety Alignment&Synthetic Preferences |
24.07 |
Northwestern University |
arxiv |
From Sands to Mansions: Enabling Automatic Full-Life-Cycle Cyberattack Construction with LLM |
Cyberattack Construction&Full-Life-Cycle |
24.07 |
Singapore University of Technology and Design |
arxiv |
AI Safety in Generative AI Large Language Models: A Survey |
Generative AI&AI Safety |
24.07 |
Lehigh University |
arxiv |
Blockchain for Large Language Model Security and Safety: A Holistic Survey |
Blockchain&Security&Safety |
24.08 |
OpenAI |
openai |
Rule-Based Rewards for Language Model Safety |
Reinforcement Learning&Safety&Rule-Based Rewards |
24.08 |
University of Texas at Austin |
arxiv |
HIDE AND SEEK: Fingerprinting Large Language Models with Evolutionary Learning |
Model Fingerprinting&In-context Learning |
24.08 |
Technical University of Munich |
arxiv |
Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study |
Secure Code Assessment&Vulnerability Detection |
24.08 |
Offenburg University of Applied Sciences |
arxiv |
"You still have to study" - On the Security of LLM generated code |
Code Security&Prompting Techniques |
24.08 |
University of Connecticut |
arxiv |
Clip2Safety: A Vision Language Model for Interpretable and Fine-Grained Detection of Safety Compliance in Diverse Workplaces |
Vision Language Model&Safety Compliance&Personal Protective Equipment Detection |
24.08 |
Pabna University of Science and Technology |
arxiv |
Risks, Causes, and Mitigations of Widespread Deployments of Large Language Models (LLMs): A Survey |
Privacy&Bias&Interpretability |
24.08 |
Quinnipiac University |
arxiv |
Is Generative AI the Next Tactical Cyber Weapon For Threat Actors? Unforeseen Implications of AI Generated Cyber Attacks |
Generative AI&Cybersecurity&Cyber Attacks |
24.08 |
Nanyang Technological University |
arxiv |
Trustworthy, Responsible, and Safe AI: A Comprehensive Architectural Framework for AI Safety with Challenges and Mitigations |
AI Safety&Trustworthy&Responsible |
24.08 |
King Abdullah University of Science and Technology |
arxiv |
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models |
Safety&Helpfulness&LLM Alignment |
24.08 |
University of Calgary |
arxiv |
Trustworthy and Responsible AI for Human-Centric Autonomous Decision-Making Systems |
Trustworthy AI&Algorithmic Bias&Responsible AI |
24.08 |
University of Oxford |
arxiv |
AI Security Audits: Challenges and Innovations in Assessing Large Language Models |
AI Security Audits&Vulnerability Assessment&AI Ethics |
24.08 |
University of Science and Technology of China |
arxiv |
Safety Layers of Aligned Large Language Models: The Key to LLM Security |
Aligned LLM&Safety Layers&Security Degradation |
24.09 |
University of Texas at San Antonio |
arxiv |
Enhancing Source Code Security with LLMs: Demystifying The Challenges and Generating Reliable Repairs |
Source Code Security&LLMs&Reinforcement Learning |
24.09 |
The Hong Kong Polytechnic University |
arxiv |
Alignment-Aware Model Extraction Attacks on Large Language Models |
Model Extraction Attacks&LLM Alignment&Watermark Resistance |
24.09 |
University of Oxford, Redwood Research |
arxiv |
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols |
AI Control&Safety Protocols&Game Theory |
24.09 |
University of Galway |
ECAI AIEB Workshop |
Ethical AI Governance: Methods for Evaluating Trustworthy AI |
Trustworthy AI&Ethics&AI Evaluation |
24.09 |
University of Texas at San Antonio |
arxiv |
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing |
Multi-Agent Systems&Code Security&Fuzz Testing&Static Analysis |
24.09 |
Tsinghua University |
arxiv |
Language Models Learn to Mislead Humans via RLHF |
Reinforcement Learning from Human Feedback (RLHF)&U-SOPHISTRY&Misleading AI |
24.09 |
Stevens Institute of Technology |
arxiv |
Measuring Copyright Risks of Large Language Model via Partial Information Probing |
Copyright&Partial Information Probing |
24.09 |
IBM Research |
arxiv |
Attack Atlas: A Practitioner’s Perspective on Challenges and Pitfalls in Red Teaming GenAI |
Red Teaming&LLM Security&Adversarial Attacks |
24.09 |
Pengcheng Laboratory |
arxiv |
Multi-Designated Detector Watermarking for Language Models |
Watermarking&Claimability&Multi-designated Verifier Signature |
24.09 |
ETH Zurich |
arxiv |
An Adversarial Perspective on Machine Unlearning for AI Safety |
Machine Unlearning&Adversarial Attacks&Unlearning Robustness |
24.10 |
Google DeepMind |
arxiv |
A Watermark for Black-Box Language Models |
Watermarking&Black-Box Models&LLM Detection |
24.10 |
Mohamed Bin Zayed University of Artificial Intelligence |
arxiv |
Optimizing Adaptive Attacks Against Content Watermarks for Language Models |
Watermarking&Adaptive Attacks&LLM Security |
24.10 |
Rice University, Rutgers University |
arxiv |
Taylor Unswift: Secured Weight Release for Large Language Models via Taylor Expansion |
Taylor Expansion&Model Security |
24.10 |
PeopleTec |
arxiv |
Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders |
Cybersecurity&Hallucinations |
24.10 |
Fondazione Bruno Kessler, Université Côte d’Azur |
EMNLP 2024 |
Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering |
Counterspeech&Safety Guardrails |
24.10 |
University of California, Davis, AWS AI Labs |
arxiv |
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models |
Safety alignment&Vision-Language models&Cross-modality representation manipulation |
24.10 |
North Carolina State University |
arxiv |
Superficial Safety Alignment Hypothesis: The Need for Efficient and Robust Safety Mechanisms in LLMs |
Superficial safety alignment&Safety mechanisms&Safety-critical components |
24.10 |
Shanghai Jiao Tong University, Chinese University of Hong Kong (Shenzhen), Tsinghua University |
arxiv |
ARCHILLES’ HEEL IN SEMI-OPEN LLMS: HIDING BOTTOM AGAINST RECOVERY ATTACKS |
Semi-open LLMs&Recovery attacks&Model resilience |
24.10 |
University of Tulsa |
arxiv |
Weak-to-Strong Generalization beyond Accuracy: A Pilot Study in Safety, Toxicity, and Legal Reasoning |
Weak-to-Strong Generalization&Safety&Toxicity |
24.10 |
Aalborg University |
arxiv |
Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis |
Language confusion&Multilingual LLMs&Security vulnerabilities |
24.10 |
Carnegie Mellon University |
arxiv |
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents |
LLM Safety&Browser Agents&Red Teaming |
24.10 |
Palisade Research |
arxiv |
LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild |
LLM Agents&Honeypots&Cybersecurity |
24.10 |
University of Pittsburgh |
arxiv |
Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents |
Embodied Agents&Multimodal Safety&Active Learning |
24.10 |
CSIRO’s Data61 |
arxiv |
From Solitary Directives to Interactive Encouragement! LLM Secure Code Generation by Natural Language Prompting |
Secure Code Generation&Encouragement Prompting |
24.10 |
AppCubic |
arxiv |
Jailbreaking and Mitigation of Vulnerabilities in Large Language Models |
Prompt Injection&Jailbreaking&AI Security |
24.10 |
UC Berkeley |
arxiv |
SAFETYANALYST: Interpretable, Transparent, and Steerable LLM Safety Moderation |
LLM Safety&Interpretability&Content Moderation |
24.10 |
ShanghaiTech University |
arxiv |
Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization |
Safety Alignment&Reinforcement Learning&Policy Optimization |
24.11 |
Zhejiang University |
arxiv |
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control |
Trustworthiness&Sparse Activation Control&Representation Control |
24.11 |
University of California, Riverside |
arxiv |
Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models |
Vision-Language Models&Safety Alignment&Cross-Layer Vulnerability |
24.11 |
National University of Singapore |
EMNLP 2024 |
Multi-expert Prompting Improves Reliability, Safety and Usefulness of Large Language Models |
Multi-expert Prompting&LLM Safety&Reliability&Usefulness |
24.11 |
OpenAI |
NeurIPS 2024 |
Rule Based Rewards for Language Model Safety |
Rule Based Rewards&Safety Alignment&AI Feedback |
24.11 |
Center for Automation and Robotics, Spanish National Research Council |
arXiv |
Can Adversarial Attacks by Large Language Models Be Attributed? |
Adversarial Attribution&LLM Security&Formal Language Theory |
24.11 |
McGill University |
arXiv |
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset |
Helpful and Harmless Dataset&Safety Trade-offs&Bias Analysis |
24.11 |
Fudan University |
arxiv |
Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding |
Text-to-Image Generation&Safety&Prompt Embedding Sanitization |
24.11 |
Meta |
arxiv |
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations |
Multimodal LLM&Content Moderation&Adversarial Robustness |
24.11 |
Columbia University |
arxiv |
When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations |
Backdoor Attacks&Explainability |
24.11 |
Ben-Gurion University of the Negev |
arxiv |
The Information Security Awareness of Large Language Models |
Information Security Awareness&Benchmarking |
24.11 |
Fordham University |
arxiv |
Next-Generation Phishing: How LLM Agents Empower Cyber Attackers |
Phishing Detection&Cybersecurity |
24.12 |
UC Berkeley |
arxiv |
Trust & Safety of LLMs and LLMs in Trust & Safety |
Trust and Safety&Prompt Injection |
24.12 |
Harvard Kennedy School, Avant Research Group |
arxiv |
Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects |
Phishing Attacks&Human-in-the-loop |
24.12 |
University of Massachusetts |
arxiv |
Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness |
Instruction Tuning&Safety&Helpfulness |
24.11 |
University of Pennsylvania, IBM T.J. Watson Research Center |
arxiv |
Cyber-Attack Technique Classification Using Two-Stage Trained Large Language Models |
Cyber-Attack Classification&Two-Stage Training |
24.12 |
University of New South Wales |
arxiv |
How Can LLMs and Knowledge Graphs Contribute to Robot Safety? A Few-Shot Learning Approach |
Robot Safety&Few-Shot Learning&Knowledge Graph Prompting |
24.12 |
Örebro University |
arxiv |
Large Language Models and Code Security: A Systematic Literature Review |
LLM-Generated Code&Vulnerability Detection&Data Poisoning Attacks |
24.12 |
Algiers Research Institute |
arxiv |
On the Validity of Traditional Vulnerability Scoring Systems for Adversarial Attacks against LLMs |
Adversarial Attacks&Vulnerability Metrics&Risk Assessment |
25.01 |
Meta |
arxiv |
MLLM-as-a-Judge for Image Safety without Human Labeling |
Image Safety&Zero-Shot Judgment&Multimodal Large Language Models |
25.01 |
FAU Erlangen-Nürnberg |
arxiv |
Refusal Behavior in Large Language Models: A Nonlinear Perspective |
Refusal Behavior&Mechanistic Interpretability&AI Alignment |