You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ninareh Mehrabi1, Ahmad Beirami2∗, Fred Morstatter1, Aram Galstyan1
1University of Southern California - Information Sciences Institute 2Meta AI
NACCL 2022 논문
Abstract
최근 NLP 연구는 다양한 toxicity detection에 개선이 있었음
toxicity detection models with the intention of identifying and mitigating toxic language from existing systems.
기존 연구가 많긴하나 adversarial attacks과 defense에 대한 연구는 부족했음
adversarial attacks that force the system to generate toxic language and the defense against them
기존의 연구는 대부분 사람이 attack 용 문장을 생성해왔음, 비용이 비싸고 확장가능하지 않음
반면에 자동화해서 만든 attack인 경우 attack vector가 human-like language와 맞지 않음, 이는 LM loss로 detecting이 가능함
Existing work to generate such attacks is either based on human-generated attacks which is costly and not scalable or, in case of automatic attacks, the attack vector does not conform to human-like language, which can be detected using a language model loss
본 연구에서는 conversational agents를 눈에 띄지 않게 공격(앞서 자동화한 공격과 달리 인식되지 못하게) 하는 방법을 coherency, relevancy, fluency 관점에서 제안함
propose attacks against conversational agents that are imperceptible, i.e., they fit the conversation in terms of coherency, relevancy, and fluency, while they are effective and scalable, i.e., they can automatically trigger the system into generating toxic language
본 연구에서는 제안한 attack에 대한 defense mechanism도 제안함. 공격을 완화시킬뿐만 아니라 conversational flow도 유지시킬 수 있는 방법을 제안함
propose a defense mechanism against such attacks which not only mitigates the attack but also attempts to maintain the conversational flow
결론적으로 공격이 잘들어와도 잘 막을 수 있는 방법에 대해 automaitc and human evaluations했고 효과적임을 보였음
our defense is effective at avoiding toxic language generation even against imperceptible toxicity triggers while the generated language fits the conversation in terms of coherency and relevancy
Introduction
대화 시스템에서 adversarial attacks을 고려하는게 safe, robust 대화를 위해서 중요함
consider adversarial attacks on human-centric chatbots and dialogue systems. It is important for these systems to be safe and robust in the face of natural(-looking) human conversations
대화 예시
attacks
Our proposed approach works by augmenting the universal adversarial triggers (UAT) from Wallace et al. (2019) with additional selection criteria to generate imperceptible yet effective triggers
defense
then focus on a defense mechanism for the non-adversarial (defender) model to avoid generating toxic utterances
간단한 방법(Xu et al., 2020)으로도 adversarial triggers를 찾아낼 수 있지만, 대화 흐름을 깰수있어서, 흐름을 깨지 않는 "detoxifies" 답변을 사용하는 defense mechanism 쪽에 관심을 가짐
Our proposed method relies on two levels of interpretable reasoning that helps the model to
(1) identify the key adversarial tokens responsible for the attack and
(2) avoid generating toxic responses by masking those tokens during the generation process.
Attack Approaches
first discuss the universal adversarial trigger(UAT) attack proposed by Wallace et al. (2019), which we use as our baseline
then propose alterations to this baseline to make the universal triggers more natural-looking and suitable for conversational domain
Methodology
Universal Adversarial Trigger (UAT) (Wallace et al., 2019)
The goal in universal adversarial trigger attack is to find a universal trigger sequence for a given trained model, which if attached to the start of any given input can cause the model to output the desired outcome (Wallace et al., 2019)
trigger sequence는 given input 앞에 붙게되면 모델의 결과물을 원하는대로 바꿔놓을 수 있는 것을 말하는 듯
This attack starts with a fixed-length sequence as the initial trigger, e.g., “the the the the the the” and tries to iteratively replace the tokens in the sequence to satisfy an objective.
The iterations terminatewhen no improvement (replacement) can be made to further optimize the objective
toxic token을 생성하기 위해서 넣는 거다보니 LM loss를 만족시킬 필요도 없고, ppl이 되게 높은 easily detectable한 반복적인 문장이 나오는 경우가 많음
Universal Adversarial Trigger with Language Model Loss (UAT-LM)
An intuitive solution to address the above shortcoming of UAT is to impose a language modeling objective on the trigger tokens.
위 전략대로 lm loss 를 추가해도 generated triggers 끼리는 말이 될지라도 conversation flow가 coherency, relevancy하다고 장담할순없음
다른 방법으로 수정한 방법론 제안하겠음
Unigram Trigger with Selection Criteria (UTSC)
propose an alternative approach in which we generate a collection of unigram triggers (with sequence length one) from UAT
유니그램 트리거를 UAT에서 모음
then feed these triggers along with the history of the conversation h to our dialogue model and generate different attack utterances
유니그램 트리거를 붙여서 다른 attack utterances를 생성해냄
Next, we pick the best suited attack utterance amongst all the generated attack utterances according to our selection criterion as demonstrated in Figure 2
생성된것들중에서 selection criterion에 기초해서 가장 잘 맞는 utterance를 선택함
유니그램 트리거를 conversation history에 붙여서 DialoGPT로 example을 생성하고 toxicity classfiers(단일 or 앙상블)로 점수를 내고 각각 기준(UTSC-N)에 따른 가장 높은 점수의 문장을 골라낸다
UTSC-1: 가장 높은 toxicity score 갖거나
UTSC-2: threshold 보다 큰 문장중에 가장 낮은 toxicity 점수를 갖거나 (threshold 못넘으면 가장 높은 점수를 가진 것)
UTSC-3: 가장 낮은 toxicity 점수를 갖는 것
유니그램이라 fluency도 눈에 띄게 희생되진 않는다
Experimental Setup
General Setup
use DialoGPT
to generate 100 conversations around a specific topic
The topic is determined by the context sentence that starts the conversation between the adversary and the defender.
Each conversation runs for 10 turns
Toxicity Detection Models
utilize an ensemble of three different toxicity detection models:
Toxic-bert, Perspective API, and Safety classifier (Xu et al., 2020)
Toxic-bert is the least sensitive of the three, followed by Perspective API, and the Safety classifier
allow the adversary to only use one of the toxicity detection models to design its attack. We then quantify toxicity using the other two toxicity detection methods, not accessed by the adversary.
Data
context sentences from two different datasets, Wizard of Wikipedia (Dinan et al., 2018) and ConvoKit’s Reddit Corpus
Wikipedia: neutral topics
Reddit: sensitive topics
picked 50 random context sentences from the Wizard of Wikipedia and 50 from the Reddit datasets.
AMT Experiments
To compare and verify the quality of conversations generated during and after the attacks, we conduct human experiments
AMT workers annotated 100 conversations from each of the three attacks and each conversa- tion was annotated by 3 AMT workers giving us overall 900 annotated conversations 300 from each attack
Results
Attack Effectiveness
two of our proposed attacks UAT-LM and UTSC-1 are performing the best according to the Perspective API and Toxic- bert classifiers
UAT baseline performs the best according to Safety classifier.
Overall results show that UTSC-1 and UAT-LM attacks are competitive attacks in terms of attack effectiveness.
UAT(baseline) attack tends to generate meaningless phrases, e.g., “acist neighborhoodsJohnson carry morals Ukrain” which can easily be detected as an anomaly and make the conversation not flow naturally
GPT-2 기준 PPL 차이
UAT is absurdly high (∼10^7) compared to ∼10^4 for UAT-LM, and ∼ 160 for UTSC-1
no attack case is ~39
Attack Transferability
attack is forcing the defender to generate actual toxic language rather than fooling the toxicity classifier.
Human Evaluation
Our UTSC-1 attack is rated to have the highest coherency
UTSC- 1 is rated to have more fluent attacks generated with mostly moderate to good scores and a higher average–shown by the black dotted lines–compared to the UAT and UAT-LM baselines
Fleiss Kappa (Fleiss, 1971) annotator agreement results from this evaluation is reported in Table 1. Annotators have reasonable overall agreement for all the qualities
Defense Approaches
two components
(a) detecting the attack and
(b) mitigating its effect by ensuring that the defender does not generate a toxic response
detection
The detection problem is rather straightforward, as the defense can simply run a toxicity classifier on the generated response
mitigation
Xu et al. (2020) suggested a mitigating approach which, when a toxic response is detected, simply resets the dialogue and generates a (non-toxic) utterance by randomly sampling from a predefined set of topics
기존 연구에서는 대화 중단 후 미리 정의된 토픽에서 랜덤하게 샘플링해서 대화 다시 생성
하지만 본 연구에서는 대화흐름을 유지하면서 toxic utterance 생성을 피하려고함!
Methodology
defense mechanism in the second stage utilizes two layers of reasoning using two different interpretability techniques
The first layer aims to detect which tokens in the defender’s utterance is making the toxicity detection model to label the utterance as being toxic.
defender’s utterance에서 문제의 토큰 찾기 (L1), we call these tokens the L1 tokens
The second layer aims to detect which tokens in the adversary’s attack utterance are responsible for generation of L1 tokens form defender’s utterance
adversary’s attack utterance에서 L1 token을 생성하는 원인 토큰 찾기 (L2 token)
defender then masks the L2 tokens from the adversary, which were responsible for triggering the defender model to generate toxic tokens, and generates a new utterance
L1을 생성하는 L2 토큰을 마스킹한다! 그리고 문장을 재생성한다!
then apply a toxicity classifier on this new utterance
새로 생성된 문장의 toxicity를 본다
toxicity 없으면 통과! 있으면 좀 더 masking해서 반복
For the first layer, we use transformers interpret which provides explanations and identifies the L1 token according to Toxic-bert model
fromtransformersimportAutoModelForSequenceClassification, AutoTokenizermodel_name="distilbert-base-uncased-finetuned-sst-2-english"model=AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer=AutoTokenizer.from_pretrained(model_name)
# With both the model and tokenizer initialized we are now able to get explanations on an example text.fromtransformers_interpretimportSequenceClassificationExplainercls_explainer=SequenceClassificationExplainer(
model,
tokenizer)
word_attributions=cls_explainer("I love you, I like you")
>>>word_attributions
[('[CLS]', 0.0),
('i', 0.2778544699186709),
('love', 0.7792370723380415),
('you', 0.38560088858031094),
(',', -0.01769750505546915),
('i', 0.12071898121557832),
('like', 0.19091105304734457),
('you', 0.33994871536713467),
('[SEP]', 0.0)]
For the second layer, we use LERG (Tuan et al., 2021) that provides local explanations for dialogue response generation and identifies the L2 token
LERG (Local Explanation of Response Generation) is a unified approach to explain why a conditional text generation model will predict a text
Experimental Setup
Baselines
Two-stage Non Sequitur Baseline
toxicity 발견하면 주제 바꿔서 말하게 하기
uses a toxicity classifier to detect if the utterance is toxic or not. It then changes the topic of the conversation if the ut- terance was detected to be toxic, e.g., “Hey do you want to talk about something else? How about we talk about X?” where X is a randomly chosen topic from 1087 topics judged as safe from the Wizard of Wikipedia conversational topic list
Trigger Masking (TM) Baseline
consider masking the adversarial trigger tokens. Note that the defender does not generally know which tokens were the trigger-tokens used by the adversary, so this approach is not applicable in realistic settings.
실전에선 trigger tokens이 어떤건지 모르지만 인사이트용을 위해 추가함
오라클 베이스라인임
AMT Experiments
evaluate the defense quality according to relevancy and fluency, the coherency of the overall conversation, and the toxicity of the defense utterance
27 conversations were rated from each of the three defenses (TM, Two- stage Non Sequitur, and our proposed defense). 3 AMT workers rated each conversation which gave us 243 annotations 81 from each defense
Results
Defense Effectiveness
our proposed defense mechanism as well as the Non Sequitur baseline achieve 100% defense effectiveness according to Toxic-bert classifier
our proposed method for all the attacks except UAT-LM, we were able to reach 100% defense effectivenessby only masking one token
For UAT-LM, almost 90% of cases were resolved by masking one token and the rest were resolved by the iterative approach that masked multiple tokens (up to 3)
Defense Transferability
Human Evaluation
Beyond Conversational Agents
show the generalizability of our defense method against non-conversational generation tasks, by conducting experiments with RealToxicityPrompts dataset
Conclusion
studied the possibility of generating imperceptible attacks against conversational agents that, while fluent and coherent, target the model into generating toxic responses
proposed a defense mechanism that was shown to be effective through various automatic and human evaluations as well as its transferability to human attacks, general generation tasks, and different toxicity classifiers
Future work can focus on improving our proposed attacks both in terms of imperceptibility and effectiveness as well as more advanced defense mechanisms.
The text was updated successfully, but these errors were encountered:
eagle705
changed the title
REACT: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS
Robust Conversational Agents against Imperceptible Toxicity Triggers
Dec 10, 2022
Note
Author
Abstract
Introduction
Attack Approaches
Methodology
Universal Adversarial Trigger (UAT) (Wallace et al., 2019)
goal
in universal adversarial trigger attack isto find a universal trigger sequence
for agiven trained model
, which if attached to the start of any given input can cause the model to output the desired outcome (Wallace et al., 2019)fixed-length sequence
asthe initial trigger
, e.g.,“the the the the the the”
and tries toiteratively replace the tokens
in the sequenceto satisfy an objective
.iterations terminate
when no improvement (replacement) can be made
to further optimize the objectiveUniversal Adversarial Trigger with Language Model Loss (UAT-LM)
language modeling objective on the trigger tokens
.Unigram Trigger with Selection Criteria (UTSC)
Experimental Setup
DialoGPT
100 conversations
around a specific topic10 turns
Toxic-bert
,Perspective API
, andSafety classifier
(Xu et al., 2020)adversary to only use one
of the toxicity detection models to design its attack. We then quantify toxicity using the other two toxicity detection methods, not accessed by the adversary.Results
Overall results show that UTSC-1 and UAT-LM attacks are competitive attacks in terms of attack effectiveness.
“acist neighborhoodsJohnson carry morals Ukrain”
which can easily be detected as an anomaly and make the conversation not flow naturallyDefense Approaches
Methodology
first layer
, we use transformers interpret which provides explanations and identifies the L1 token according toToxic-bert
modelsecond layer
, we use LERG (Tuan et al., 2021) that provides local explanations for dialogue response generation and identifies theL2 token
LERG (Local Explanation of Response Generation) is a unified approach to explain why a conditional text generation model will predict a text
Experimental Setup
Baselines
AMT Experiments
Results
100% defense effectiveness
according to Toxic-bert classifier100% defense effectiveness
by only masking one token
Beyond Conversational Agents
Conclusion
The text was updated successfully, but these errors were encountered: