Page 1:
arXiv:2412.11172v1 [cs.CL] 15 Dec 2024Unpacking the Resilience of
SNLI Contradiction Examples to Attacks
Chetan Verma
Department of Computer Science
University of Texas at Austin
chetan.kumar.verma@utexas.eduArchit Agarwal
Department of Computer Science
University of Texas at Austin
aa2023mscs@utexas.edu
Abstract
Pre-trained models excel on NLI benchmarks
like SNLI and MultiNLI, but their true lan-
guage understanding remains uncertain. Mod-
els trained only on hypotheses and labels
achieve high accuracy, indicating reliance on
dataset biases and spurious correlations. To
explore this issue, we applied the Universal
Adversarial Attack to examine the model’s
vulnerabilities. Our analysis revealed substan-
tial drops in accuracy for the entailment and
neutral classes, whereas the contradiction class
exhibited a smaller decline. Fine-tuning the
model on an augmented dataset with adversar-
ial examples restored its performance to near-
baseline levels for both the standard and chal-
lenge sets. Our findings highlight the value
of adversarial triggers in identifying spurious
correlations and improving robustness while
providing insights into the resilience of the
contradiction class to adversarial attacks.
1 Introduction
Natural Language Inference (NLI) is a founda-
tional task in Natural Language Processing (NLP)
that evaluate’s a model’s natural language under-
standing (NLU). It involves determining whether
a hypothesis is true (Entailment), false (Contra-
diction) or cannot be determined (Neutral) given
a premise ( Dagan et al. ,2006 ,2013 ). This rea-
soning ability is critical for mimicking human un-
derstanding and supports a wide range of applica-
tions. Consequently, when models achieve high
accuracy on this task, it is often claimed that they
have strong NLU capabilities. However, recent
research ( Poliak et al. ,2018 ;Gururangan et al. ,
2018 ) shows that models achieve high accuracy
even when trained on hypothesis-only datasets.
This suggests that models exploit spurious correla-
tions and superficial patterns known as dataset ar-
tifacts to predict the correct label rather than gen-
uinely understanding the language.To investigate this phenomenon and explore
ways to improve the model, we focused on
the Stanford NLI (SNLI) dataset ( Bowman et al. ,
2015 ), one of the most widely used bench-
marks for NLI tasks. For our study, we se-
lected Efficiently Learning an Encoder that Classi-
fies Token Replacements Accurately (ELECTRA-
small1) (Clark et al. ,2020 ). This model retains the
same architecture as Bidirectional Encoder Repre-
sentations from Transformers (BERT) but incor-
porates an improved training method. Moreover,
ELECTRA is computationally efficient, making
it a practical alternative to larger, more resource-
intensive models.
Adversarial datasets were chosen as a key tool
for assessing robustness due to their flexibility and
cross-dataset applicability. In particular, we ex-
plored a method used in a previous study to gener-
ate Universal Adversarial Triggers ( Wallace et al. ,
2019 ) to create a testing dataset and evaluated
the robustness of ELECTRA. These triggers are
transferable across models for all datasets, and
unlike other adversarial attacks, they are context-
independent and hence provide new insights into
the general input-output patterns learned by the
model.
To address the identified gaps and enhance the
model’s robustness, we fine-tuned it using a small,
augmented training dataset ( Liu et al. ,2019 ). A
detailed discussion of our methodology and anal-
ysis is provided later in the paper.
2 Background and Related Work
The evaluation and enhancement of machine
learning models’ robustness has emerged as a crit-
ical focus in recent research. A variety of tech-
niques have been developed to create challenge
1We will use ELECTRA as an alias to refer to ELECTRA-
small throughout this paper
Page 2:
sets that expose model vulnerabilities, with no-
table approaches including: (1) Contrast Sets:
These are manually crafted modifications to test
data that introduce small, label-changing alter-
ations while maintaining lexical and syntactic in-
tegrity ( Gardner et al. ,2020 ); (2) Checklist Sets:
This approach involves a systematic, task-agnostic
framework for testing NLP models. Drawing in-
spiration from behavioral testing principles in soft-
ware engineering, it employs diverse test types to
evaluate model performance ( Ribeiro et al. ,2020 );
(3) Adversarial Challenge Sets: These datasets
are deliberately modified to provoke problematic
outputs, revealing critical weaknesses in models
(Jia and Liang ,2017 ;Wallace et al. ,2019 ).
To address the vulnerabilities identified through
such challenge sets, researchers have proposed
and employed several mitigation techniques, in-
cluding: (1) Adversarial Data Training: This ap-
proach incorporates challenge sets directly into
the training process or employs adversarial data
augmentation to strengthen model performance
against adversarial inputs ( Liu et al. ,2019 ); (2)
Ensemble-based debiasing: Involves learning a
biased model that captures dataset-specific clues,
and then training a debiased model on the residual
errors ( He et al. ,2019 )
These techniques collectively represent signifi-
cant strides in fortifying machine learning models
against potential vulnerabilities, ensuring more re-
liable and robust performance in diverse applica-
tions.
3 Methodology
3.1 Universal Adversarial Attack
Universal adversarial triggers are generated to-
kens designed to manipulate a model’s predic-
tions. When appended to the beginning or end of
an input, these triggers can compel the model to
produce a prediction that deviates from the gold
label.
Why Universal Triggers ? (1) Black-Box Ca-
pability: These triggers can be generated with-
out any access to the target model, making them
effective even in black-box scenarios; (2) Model
Transferability: These triggers are universal, that
is, they are capable of attacking and transferring
across different models; and (3) Context Indepen-
dence: Their independence from context provides
valuable insights into the general input-output pat-
terns of the model. Table- 1illustrates the impactof universal adversarial triggers in altering pre-
dicted labels, highlighting their effectiveness in
manipulating model outputs.
3.1.1 Attack Equation
Given a model f(with white-box access), a text
input composed of tokens t(which could repre-
sent words, sub-words, or characters), and a tar-
get label ˜y, the goal is to generate triggers tadvto
append to the front or back of the input token t.
In a non-universal adversarial setting, this can be
mathematically expressed as:
f(tadv;t) = ˜y (1)
For a universal adversarial setting, the goal is
to optimize the universal trigger such that the loss
function for the target class ˜yis minimized across
all inputs from a dataset. Mathematically, this can
be represented as:
argmin
tadvEt∼T/bracketleftbig
L(˜y,f(tadv;t))/bracketrightbig
(2)
whereTrepresents all input instances from the
dataset, and Lrepresents the loss function.
3.1.2 Trigger Search Algorithm
Next, we start by selecting a trigger length: longer
triggers tend to be effective, whereas shorter trig-
gers are less noticeable. To initialize trigger cre-
ation, we prepend a simple token such as the char-
acter aor word theto the beginning of all inputs.
We incrementally refine the tokens in the trig-
ger to optimize the loss function associated with
target prediction, leveraging a technique inspired
by HotFlip ( Ebrahimi et al. ,2018 ), that uses the
token’s gradient to get the token replacement. To
use this technique, the trigger token tadvis repre-
sented as one-hot vectors and embedded to form
eadv.
The HotFlip-inspired technique uses a linear ap-
proximation of the task loss. Specifically we up-
date the embeddings for each trigger token eadvito
minimize the loss by applying a first-order Tay-
lor approximation around the current token em-
bedding:
argmin
e′
i∈V/bracketleftbig
e′
i−eadvi/bracketrightbigT∇eadviL (3)
whereVis the set of all token embeddings in the
model’s vocabulary, and ∇eadviLis the average of
the task loss over a batch. For our use-case, NLI (a
classification task), we use the cross-entropy loss
to optimize the attack.
Page 3:
Gold Label Input (Premise and Hypothesis) Predicted Label
Entailment (0) Premise: A woman in a purple shirt buying food from a street
vendor.
Hypothesis: nobody a woman makes a purchase from a vendor
outside.0→2
Neutral (1) Premise: A youth is kicking a soccer ball in an empty brick area.
Hypothesis: catsa funny human kicking.1→2
Contradiction (2) Premise: A person dressed in red and black outside a cracked
wall.
Hypothesis: anxiously a person in red and black falling through
the ionosphere.2→1
Table 1: Effect of universal triggers (highlighted in red) on predicted labels across three classes
Universal Random
Targeted Class Trigger Majority Class Score Trigger Majori ty Class Score
Entailment (0) nobody 2 0.96 diners 1 0.47
no 2 0.83 hands 2 0.38
Neutral (1) cats 2 0.96 road 1 0.40
cat 2 0.85 mass 0 0.38
Contradiction (2) joyously 1 1.00 remain 1 0.62
celebrating 1 0.79 rose 1 0.59
Table 2: Comparison of Universal and Random Triggers
3.2 Inoculation by Fine-Tuning
To address the identified vulnerabilities, we em-
ployed the Inoculation by Fine-Tuning technique
(Liu et al. ,2019 ). This method involves fine-
tuning a pre-trained model on a small, carefully
designed training dataset. Following fine-tuning,
the model typically exhibits one of three behav-
iors:
1.Reduced Performance Gap : The perfor-
mance disparity between the original test
set and the challenge set decreases, with
the model maintaining strong performance
across both datasets. This outcome suggests
that the observed gap originates from the
dataset itself rather than inherent limitations
of the model.
2.Unchanged Performance : The model’s per-
formance remains static, indicating an inabil-
ity to adapt to the challenge set. This points
to potential limitations within the model’s ar-
chitecture or design as the root cause.
3.Decreased Performance : The model’s per-
formance on the original dataset declines,even if improvements are observed on the
challenge set. This behavior indicates poten-
tial overfitting to the adversarial examples in-
troduced during fine-tuning, rather than ad-
dressing the underlying issue.
By applying this technique, we effectively diag-
nosed and mitigated the identified issues, strength-
ening the system’s resilience and addressing key
vulnerabilities.
4 Experiments
This section describes the process of generating
triggers (a word in this case), creating attacks us-
ing these triggers, and subsequently training and
evaluating the ELECTRA model to assess its abil-
ity to learn and perform the underlying NLI task
effectively.
4.1 Generation of Triggers
4.1.1 Universal Triggers
Universal triggers are created using the Univer-
sal Adversarial Attack (Section- 3.1). Initially,
the token representing the word theis prepended
to the targeted examples and then trigger search
algorithm (Section- 3.1.2 ) is applied to generate
Page 4:
the universal triggers. These triggers are derived
using the Enhanced Sequential Inference Model
(ESIM) ( Chen et al. ,2017 ) with GloVe embed-
dings ( Pennington et al. ,2014 ). To simulate a re-
alistic scenario, the generated triggers are tested
using the ELECTRA model, assuming black-box
access. This setup mimics real-world conditions,
where white-box access to deployed models is typ-
ically unavailable, but testing their robustness is
still necessary.
4.1.2 Random Triggers
As a baseline for the Universal Adversarial Attack,
triggers are generated using a Random Attack ap-
proach. In this method, words are randomly se-
lected from the SNLI dataset’s vocabulary, ensur-
ing a uniform distribution across all three classes.
These selected words are then prepended to the hy-
potheses of SNLI examples to create the random
attack.
4.1.3 Examples and Correlation Score
Table- 2presents examples of both universal and
random triggers, along with their corresponding
majority class (the class in which the trigger ap-
pears most frequently) and correlation score. The
correlation score is defined as the conditional
probability of a label lgiven a word w, and it is
mathematically expressed as:
p(l|w) =count(w,l)
count(w)(4)
4.2 Challenge Sets and Trigger-Augmented
Dataset
To systematically evaluate model performance
and mitigate reliance on spurious correlations,
we developed two challenge sets and a Trigger-
Augmented dataset. The challenge sets assess
the model’s robustness under adversarial condi-
tions, while the Trigger-Augmented dataset aims
to enhance generalization by addressing dataset-
specific biases. The construction and purpose of
these datasets are described below:
1.Challenge Set I (Validation split with univer-
sal triggers2): This set evaluates the model’s
robustness and understanding of the core NLI
task. It consists of 1,000 examples, ran-
domly sampled from each label class in the
2Our database for universal triggers can be found at
https://huggingface.co/datasets/ckverma/snli_univer salvalidation split of the SNLI dataset. Uni-
versal triggers (detailed in Section- 4.1.1 ) are
prepended to the hypothesis in these exam-
ples. This setup enables a focused evaluation
of the model’s performance across all label
classes in the presence of adversarial triggers.
2.Challenge Set II (Validation split with ran-
dom triggers3): Designed as a baseline
for comparison with universal triggers, this
set follows the same construction process
as Challenge Set I but replaces universal
triggers with random triggers (detailed in
Section- 4.1.2 ). This set provides a point of
reference for measuring the impact of univer-
sal triggers on model performance by isolat-
ing the effect of non-specific, randomly cho-
sen triggers.
3.Trigger-Augmented Dataset (Train split
with universal triggers): To reduce the
model’s reliance on spurious correlations
present in the original SNLI dataset, a fine-
tuning dataset was created. This dataset con-
tains 6,000 training examples, with 3,000 left
unmodified and the remaining 3,000 modi-
fied by prepending universal triggers to their
hypothesis. This augmentation encourages
the model to prioritize semantically mean-
ingful features over spurious patterns during
training (refer to Section- 3.2).
By utilizing these datasets, we aim to system-
atically evaluate and enhance the robustness of the
ELECTRA model under both adversarial and stan-
dard conditions.
4.3 Training and Evaluation process
The ELECTRA model4was trained and fine-tuned
on a single machine equipped with an NVIDIA T4
GPU. The training process utilized the Hugging-
Face Trainer framework, configured with a max-
imum sequence length of 128 to ensure that over
96% of examples from the SNLI dataset were fully
captured without truncation. A batch size of 256
was selected, while all other parameters were left
at their default settings.
Training was conducted in two stages. First, the
model was trained on the original SNLI dataset for
3Our database for random triggers can be found at
https://huggingface.co/datasets/ckverma/snli_random
4Our code for training the model can be found at
https://github.com/ckvermaAI/SNLI-Attack-Analysis.g it
Page 5:
three epochs to establish a strong foundational un-
derstanding of natural language inference. This
was followed by a fine-tuning phase, where the
model was adapted using the Trigger-Augmented
dataset to enhance robustness and mitigate re-
liance on spurious correlations. This two-stage
process allowed the model to effectively balance
learning from the original data and adapting to the
additional challenge sets.
5 Results and Analysis
5.1 Training ELECTRA on the SNLI Dataset
The Electra model was trained on the SNLI dataset
for three epochs, achieving a validation accuracy
of 88.98%. During this initial phase, evaluations
were performed on three datasets: the validation
subset (comprising 1,000 randomly sampled ex-
amples from the SNLI validation split), Challenge
Set I, and Challenge Set II. The results from this
stage are summarized in the first three rows of
Table- 3.
5.2 Fine-Tuning ELECTRA model
In the second phase, the model underwent fine-
tuning on the Trigger-Augmented dataset for
one epoch. This step was designed to reduce
the model’s dependence on spurious correlations
present in the original SNLI dataset, thereby
enhancing its robustness. Post-fine-tuning, the
model was re-evaluated on the validation subset
and Challenge Set I, with the results documented
in the last two rows of Table- 3.
5.3 Analyzing the Results
5.3.1 Effectiveness of Triggers
Table- 4highlights the impact of universal and ran-
dom triggers. Universal triggers effectively alter
the model’s predictions for entailment and neu-
tral examples, often shifting them to other classes.
In contrast, random triggers have minimal influ-
ence, affecting approximately 20% of the entail-
ment examples. These findings demonstrate the
superior efficacy of universal triggers in manip-
ulating model predictions (compared to random
triggers).
5.3.2 Success of Universal Triggers
The Universal Adversarial Attack generates trig-
gers that are strongly correlated with a compet-
ing class (the class other than the intended target).Table- 2highlights the high correlation scores be-
tween the universal triggers and their associated
dominant (or majority) class. When these triggers
are appended to SNLI examples from the targeted
class, they exploit the model’s reliance on spuri-
ous correlations, leading it to favor the competing
class over the intended target. For example, the
universal trigger nobody is closely associated with
the contradiction class. When prepended to exam-
ples from the entailment class, it causes the model
to misclassify them as contradictions (1st row in
Table- 1).
5.3.3 De-biasing the Model
The uniform distribution of universal triggers in
the Trigger-Augmented dataset helps the model
unlearn spurious correlations present in the orig-
inal SNLI dataset. The two-stage training pro-
cess balances foundational learning from the orig-
inal data while mitigating biases using the Trigger-
Augmented dataset. As shown in Table- 3(1st, 2nd
and, 5th row), this approach significantly enhances
the model’s overall performance and resilience (as
highlighted in Section- 3.2).
5.3.4 Decoding Attacks on the Contradiction
Class
The contradiction class contains more correlated
words in comparison to the entailment and neu-
tral classes. The cumulative frequency of the top
five correlated words is 312 for contradictions, 128
for neutral, and 57 for entailment. This abundance
of correlated words makes contradictions particu-
larly vulnerable. However, flipping predictions for
contradiction-class examples to entailment or neu-
tral by simply prepending tokens is feasible only if
the example lacks these giveaway words. This is
the reason why ELECTRA model’s ability to cor-
rectly predict contradiction examples reduces by
only 7.43% with the introduction of universal trig-
gers (refer to the results in Table- 3).
6 Conclusion and Future work
In this study, we systematically investigated the
vulnerabilities and biases in NLI models, propos-
ing methods to enhance their robustness. Our find-
ings demonstrated the effectiveness of universal
triggers in exploiting spurious correlations to ma-
nipulate NLI model predictions, significantly out-
performing random triggers. Moreover, Trigger-
Augmented training proved successful in mitigat-
ing biases, thereby improving model resilience.
Page 6:
Dataset Triggers (Model) Entailment (%) Neutral (%) Contradiction (%)
Validation Subset Original (Pre-Finetune) 90.23 86.70 91.06
Challenge Set I Universal (Pre-Finetune) 25.78 25.76 83.63
Challenge Set II Random (Pre-Finetune) 71.20 83.98 91.57
Validation Subset Original (Post-Finetune) 88.92 84.81 91.16
Challenge Set I Universal (Post-Finetune) 90.13 87.53 91.96
Table 3: Performance Summary of ELECTRA on Validation subse t of SNLI dataset and Challenge Sets (I and II)
Ground Truth Data E% N% C%
EntailmentValidation subset 90.23 7.65 2.11
Challenge Set I 25.78 32.93 41.29
Challenge Set II 71.20 21.85 6.95
NeutralValidation subset 7.33 86.70 5.97
Challenge Set I 0.63 25.76 73.61
Challenge Set II 5.97 83.98 10.05
ContradictionValidation subset 1.31 7.63 91.06
Challenge Set I 0.60 15.76 83.63
Challenge Set II 1.41 7.03 91.57
Table 4: ELECTRA model’s prediction distribution for diffe rent datasets. Each row shows a particular dataset
and each column shows how often model predicts a particular c lass. For example, on the challenge set I, neutral
examples are classified as contradiction examples 73.61% ti mes.
This approach also underscored the nuanced chal-
lenges associated with attacking the contradiction
class, shedding light on areas requiring further ex-
ploration.
For future work, we aim to explore diverse at-
tack strategies ( Song et al. ,2021 ) beyond merely
prepending triggers to hypotheses. Such strategies
will help uncover additional weaknesses in NLI
datasets, providing deeper insights into designing
more robust datasets and improving the training
processes for NLI models.
References
Samuel R. Bowman, Gabor Angeli, Christopher
Potts, and Christopher D. Manning. 2015.
A large annotated corpus for learning natural language infe rence .
CoRR , abs/1508.05326.
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling,
Si Wei, Hui Jiang, and Diana Inkpen. 2017.
Enhanced lstm for natural language inference . In
Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics (Volume
1: Long Papers) . Association for Computational
Linguistics.
Kevin Clark, Minh-Thang Luong, Quoc V .
Le, and Christopher D. Manning. 2020.
ELECTRA: pre-training text encoders as discriminators rat her than generators .
CoRR , abs/2003.10555.Ido Dagan, Oren Glickman, and Bernardo Magnini.
2006. The pascal recognising textual entailment
challenge. In Machine Learning Challenges. Eval-
uating Predictive Uncertainty, Visual Object Classi-
fication, and Recognising Tectual Entailment , pages
177–190, Berlin, Heidelberg. Springer Berlin Hei-
delberg.
Ido Dagan, Dan Roth, Mark Sam-
mons, and Fabio Zanzotto. 2013.
Recognizing textual entailment: Models and applications .
Synthesis Lectures on Human Language Technolo-
gies, 6(4):1–222. Publisher Copyright: © Morgan
and Claypool Publishers. All rights reserved.
Javid Ebrahimi, Anyi Rao, Daniel
Lowd, and Dejing Dou. 2018.
HotFlip: White-box adversarial examples for text classific ation .
InProceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume
2: Short Papers) , pages 31–36, Melbourne, Aus-
tralia. Association for Computational Linguistics.
Matt Gardner, Yoav Artzi, Victoria Basmov,
Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep
Dasigi, Dheeru Dua, Yanai Elazar, Ananth Got-
tumukkala, Nitish Gupta, Hannaneh Hajishirzi,
Gabriel Ilharco, Daniel Khashabi, Kevin Lin,
Jiangming Liu, Nelson F. Liu, Phoebe Mul-
caire, Qiang Ning, Sameer Singh, Noah A.
Smith, Sanjay Subramanian, Reut Tsarfaty,
Eric Wallace, Ally Zhang, and Ben Zhou. 2020.
Evaluating models’ local decision boundaries via contrast sets.
InFindings of the Association for Computational
Linguistics: EMNLP 2020 , pages 1307–1323,
Online. Association for Computational Linguistics.
Page 7:
Suchin Gururangan, Swabha Swayamdipta,
Omer Levy, Roy Schwartz, Samuel R.
Bowman, and Noah A. Smith. 2018.
Annotation artifacts in natural language inference data .
CoRR , abs/1803.02324.
He He, Sheng Zha, and Haohan Wang. 2019.
Unlearn dataset bias in natural language inference by fittin g the residual .
InProceedings of the 2nd Workshop on Deep Learn-
ing Approaches for Low-Resource NLP (DeepLo
2019) , pages 132–142, Hong Kong, China. Associ-
ation for Computational Linguistics.
Robin Jia and Percy Liang. 2017.
Adversarial examples for evaluating reading comprehensio n systems .
InProceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing , pages
2021–2031, Copenhagen, Denmark. Association for
Computational Linguistics.
Nelson F. Liu, Roy Schwartz,
and Noah A. Smith. 2019.
Inoculation by fine-tuning: A method for analyzing challeng e datasets .
CoRR , abs/1904.02668.
Jeffrey Pennington, Richard Socher,
and Christopher Manning. 2014.
GloVe: Global vectors for word representation .
InProceedings of the 2014 Conference on Em-
pirical Methods in Natural Language Processing
(EMNLP) , pages 1532–1543, Doha, Qatar. Associa-
tion for Computational Linguistics.
Adam Poliak, Jason Naradowsky, Aparajita Haldar,
Rachel Rudinger, and Benjamin Van Durme. 2018.
Hypothesis only baselines in natural language inference .
CoRR , abs/1805.01042.
Marco Tulio Ribeiro, Tongshuang Wu, Car-
los Guestrin, and Sameer Singh. 2020.
Beyond accuracy: Behavioral testing of NLP models with Chec kList .
InProceedings of the 58th Annual Meeting of the
Association for Computational Linguistics , pages
4902–4912, Online. Association for Computational
Linguistics.
Liwei Song, Xinwei Yu, Hsuan-Tung
Peng, and Karthik Narasimhan. 2021.
Universal adversarial attacks with natural triggers for te xt classification .
Eric Wallace, Shi Feng, Nikhil Kandpal,
Matt Gardner, and Sameer Singh. 2019.
Universal adversarial triggers for NLP . CoRR ,
abs/1908.07125.