loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2412.11172v1

Unpacking the Resilience of SNLI Contradiction Examples to Attacks

Authors: Chetan Verma, Archit Agarwal

Published: 2024-12-15

Abstract:

Pre-trained models excel on NLI benchmarks like SNLI and MultiNLI, but their true language understanding remains uncertain. Models trained only on hypotheses and labels achieve high accuracy, indicating reliance on dataset biases and spurious correlations. To explore this issue, we applied the Universal Adversarial Attack to examine the model's vulnerabilities. Our analysis revealed substantial drops in accuracy for the entailment and neutral classes, whereas the contradiction class exhibited a smaller decline. Fine-tuning the model on an augmented dataset with adversarial examples restored its performance to near-baseline levels for both the standard and challenge sets. Our findings highlight the value of adversarial triggers in identifying spurious correlations and improving robustness while providing insights into the resilience of the contradiction class to adversarial attacks.

Paper Content: on Alphaxiv
Page 1: arXiv:2412.11172v1 [cs.CL] 15 Dec 2024Unpacking the Resilience of SNLI Contradiction Examples to Attacks Chetan Verma Department of Computer Science University of Texas at Austin chetan.kumar.verma@utexas.eduArchit Agarwal Department of Computer Science University of Texas at Austin aa2023mscs@utexas.edu Abstract Pre-trained models excel on NLI benchmarks like SNLI and MultiNLI, but their true lan- guage understanding remains uncertain. Mod- els trained only on hypotheses and labels achieve high accuracy, indicating reliance on dataset biases and spurious correlations. To explore this issue, we applied the Universal Adversarial Attack to examine the model’s vulnerabilities. Our analysis revealed substan- tial drops in accuracy for the entailment and neutral classes, whereas the contradiction class exhibited a smaller decline. Fine-tuning the model on an augmented dataset with adversar- ial examples restored its performance to near- baseline levels for both the standard and chal- lenge sets. Our findings highlight the value of adversarial triggers in identifying spurious correlations and improving robustness while providing insights into the resilience of the contradiction class to adversarial attacks. 1 Introduction Natural Language Inference (NLI) is a founda- tional task in Natural Language Processing (NLP) that evaluate’s a model’s natural language under- standing (NLU). It involves determining whether a hypothesis is true (Entailment), false (Contra- diction) or cannot be determined (Neutral) given a premise ( Dagan et al. ,2006 ,2013 ). This rea- soning ability is critical for mimicking human un- derstanding and supports a wide range of applica- tions. Consequently, when models achieve high accuracy on this task, it is often claimed that they have strong NLU capabilities. However, recent research ( Poliak et al. ,2018 ;Gururangan et al. , 2018 ) shows that models achieve high accuracy even when trained on hypothesis-only datasets. This suggests that models exploit spurious correla- tions and superficial patterns known as dataset ar- tifacts to predict the correct label rather than gen- uinely understanding the language.To investigate this phenomenon and explore ways to improve the model, we focused on the Stanford NLI (SNLI) dataset ( Bowman et al. , 2015 ), one of the most widely used bench- marks for NLI tasks. For our study, we se- lected Efficiently Learning an Encoder that Classi- fies Token Replacements Accurately (ELECTRA- small1) (Clark et al. ,2020 ). This model retains the same architecture as Bidirectional Encoder Repre- sentations from Transformers (BERT) but incor- porates an improved training method. Moreover, ELECTRA is computationally efficient, making it a practical alternative to larger, more resource- intensive models. Adversarial datasets were chosen as a key tool for assessing robustness due to their flexibility and cross-dataset applicability. In particular, we ex- plored a method used in a previous study to gener- ate Universal Adversarial Triggers ( Wallace et al. , 2019 ) to create a testing dataset and evaluated the robustness of ELECTRA. These triggers are transferable across models for all datasets, and unlike other adversarial attacks, they are context- independent and hence provide new insights into the general input-output patterns learned by the model. To address the identified gaps and enhance the model’s robustness, we fine-tuned it using a small, augmented training dataset ( Liu et al. ,2019 ). A detailed discussion of our methodology and anal- ysis is provided later in the paper. 2 Background and Related Work The evaluation and enhancement of machine learning models’ robustness has emerged as a crit- ical focus in recent research. A variety of tech- niques have been developed to create challenge 1We will use ELECTRA as an alias to refer to ELECTRA- small throughout this paper Page 2: sets that expose model vulnerabilities, with no- table approaches including: (1) Contrast Sets: These are manually crafted modifications to test data that introduce small, label-changing alter- ations while maintaining lexical and syntactic in- tegrity ( Gardner et al. ,2020 ); (2) Checklist Sets: This approach involves a systematic, task-agnostic framework for testing NLP models. Drawing in- spiration from behavioral testing principles in soft- ware engineering, it employs diverse test types to evaluate model performance ( Ribeiro et al. ,2020 ); (3) Adversarial Challenge Sets: These datasets are deliberately modified to provoke problematic outputs, revealing critical weaknesses in models (Jia and Liang ,2017 ;Wallace et al. ,2019 ). To address the vulnerabilities identified through such challenge sets, researchers have proposed and employed several mitigation techniques, in- cluding: (1) Adversarial Data Training: This ap- proach incorporates challenge sets directly into the training process or employs adversarial data augmentation to strengthen model performance against adversarial inputs ( Liu et al. ,2019 ); (2) Ensemble-based debiasing: Involves learning a biased model that captures dataset-specific clues, and then training a debiased model on the residual errors ( He et al. ,2019 ) These techniques collectively represent signifi- cant strides in fortifying machine learning models against potential vulnerabilities, ensuring more re- liable and robust performance in diverse applica- tions. 3 Methodology 3.1 Universal Adversarial Attack Universal adversarial triggers are generated to- kens designed to manipulate a model’s predic- tions. When appended to the beginning or end of an input, these triggers can compel the model to produce a prediction that deviates from the gold label. Why Universal Triggers ? (1) Black-Box Ca- pability: These triggers can be generated with- out any access to the target model, making them effective even in black-box scenarios; (2) Model Transferability: These triggers are universal, that is, they are capable of attacking and transferring across different models; and (3) Context Indepen- dence: Their independence from context provides valuable insights into the general input-output pat- terns of the model. Table- 1illustrates the impactof universal adversarial triggers in altering pre- dicted labels, highlighting their effectiveness in manipulating model outputs. 3.1.1 Attack Equation Given a model f(with white-box access), a text input composed of tokens t(which could repre- sent words, sub-words, or characters), and a tar- get label ˜y, the goal is to generate triggers tadvto append to the front or back of the input token t. In a non-universal adversarial setting, this can be mathematically expressed as: f(tadv;t) = ˜y (1) For a universal adversarial setting, the goal is to optimize the universal trigger such that the loss function for the target class ˜yis minimized across all inputs from a dataset. Mathematically, this can be represented as: argmin tadvEt∼T/bracketleftbig L(˜y,f(tadv;t))/bracketrightbig (2) whereTrepresents all input instances from the dataset, and Lrepresents the loss function. 3.1.2 Trigger Search Algorithm Next, we start by selecting a trigger length: longer triggers tend to be effective, whereas shorter trig- gers are less noticeable. To initialize trigger cre- ation, we prepend a simple token such as the char- acter aor word theto the beginning of all inputs. We incrementally refine the tokens in the trig- ger to optimize the loss function associated with target prediction, leveraging a technique inspired by HotFlip ( Ebrahimi et al. ,2018 ), that uses the token’s gradient to get the token replacement. To use this technique, the trigger token tadvis repre- sented as one-hot vectors and embedded to form eadv. The HotFlip-inspired technique uses a linear ap- proximation of the task loss. Specifically we up- date the embeddings for each trigger token eadvito minimize the loss by applying a first-order Tay- lor approximation around the current token em- bedding: argmin e′ i∈V/bracketleftbig e′ i−eadvi/bracketrightbigT∇eadviL (3) whereVis the set of all token embeddings in the model’s vocabulary, and ∇eadviLis the average of the task loss over a batch. For our use-case, NLI (a classification task), we use the cross-entropy loss to optimize the attack. Page 3: Gold Label Input (Premise and Hypothesis) Predicted Label Entailment (0) Premise: A woman in a purple shirt buying food from a street vendor. Hypothesis: nobody a woman makes a purchase from a vendor outside.0→2 Neutral (1) Premise: A youth is kicking a soccer ball in an empty brick area. Hypothesis: catsa funny human kicking.1→2 Contradiction (2) Premise: A person dressed in red and black outside a cracked wall. Hypothesis: anxiously a person in red and black falling through the ionosphere.2→1 Table 1: Effect of universal triggers (highlighted in red) on predicted labels across three classes Universal Random Targeted Class Trigger Majority Class Score Trigger Majori ty Class Score Entailment (0) nobody 2 0.96 diners 1 0.47 no 2 0.83 hands 2 0.38 Neutral (1) cats 2 0.96 road 1 0.40 cat 2 0.85 mass 0 0.38 Contradiction (2) joyously 1 1.00 remain 1 0.62 celebrating 1 0.79 rose 1 0.59 Table 2: Comparison of Universal and Random Triggers 3.2 Inoculation by Fine-Tuning To address the identified vulnerabilities, we em- ployed the Inoculation by Fine-Tuning technique (Liu et al. ,2019 ). This method involves fine- tuning a pre-trained model on a small, carefully designed training dataset. Following fine-tuning, the model typically exhibits one of three behav- iors: 1.Reduced Performance Gap : The perfor- mance disparity between the original test set and the challenge set decreases, with the model maintaining strong performance across both datasets. This outcome suggests that the observed gap originates from the dataset itself rather than inherent limitations of the model. 2.Unchanged Performance : The model’s per- formance remains static, indicating an inabil- ity to adapt to the challenge set. This points to potential limitations within the model’s ar- chitecture or design as the root cause. 3.Decreased Performance : The model’s per- formance on the original dataset declines,even if improvements are observed on the challenge set. This behavior indicates poten- tial overfitting to the adversarial examples in- troduced during fine-tuning, rather than ad- dressing the underlying issue. By applying this technique, we effectively diag- nosed and mitigated the identified issues, strength- ening the system’s resilience and addressing key vulnerabilities. 4 Experiments This section describes the process of generating triggers (a word in this case), creating attacks us- ing these triggers, and subsequently training and evaluating the ELECTRA model to assess its abil- ity to learn and perform the underlying NLI task effectively. 4.1 Generation of Triggers 4.1.1 Universal Triggers Universal triggers are created using the Univer- sal Adversarial Attack (Section- 3.1). Initially, the token representing the word theis prepended to the targeted examples and then trigger search algorithm (Section- 3.1.2 ) is applied to generate Page 4: the universal triggers. These triggers are derived using the Enhanced Sequential Inference Model (ESIM) ( Chen et al. ,2017 ) with GloVe embed- dings ( Pennington et al. ,2014 ). To simulate a re- alistic scenario, the generated triggers are tested using the ELECTRA model, assuming black-box access. This setup mimics real-world conditions, where white-box access to deployed models is typ- ically unavailable, but testing their robustness is still necessary. 4.1.2 Random Triggers As a baseline for the Universal Adversarial Attack, triggers are generated using a Random Attack ap- proach. In this method, words are randomly se- lected from the SNLI dataset’s vocabulary, ensur- ing a uniform distribution across all three classes. These selected words are then prepended to the hy- potheses of SNLI examples to create the random attack. 4.1.3 Examples and Correlation Score Table- 2presents examples of both universal and random triggers, along with their corresponding majority class (the class in which the trigger ap- pears most frequently) and correlation score. The correlation score is defined as the conditional probability of a label lgiven a word w, and it is mathematically expressed as: p(l|w) =count(w,l) count(w)(4) 4.2 Challenge Sets and Trigger-Augmented Dataset To systematically evaluate model performance and mitigate reliance on spurious correlations, we developed two challenge sets and a Trigger- Augmented dataset. The challenge sets assess the model’s robustness under adversarial condi- tions, while the Trigger-Augmented dataset aims to enhance generalization by addressing dataset- specific biases. The construction and purpose of these datasets are described below: 1.Challenge Set I (Validation split with univer- sal triggers2): This set evaluates the model’s robustness and understanding of the core NLI task. It consists of 1,000 examples, ran- domly sampled from each label class in the 2Our database for universal triggers can be found at https://huggingface.co/datasets/ckverma/snli_univer salvalidation split of the SNLI dataset. Uni- versal triggers (detailed in Section- 4.1.1 ) are prepended to the hypothesis in these exam- ples. This setup enables a focused evaluation of the model’s performance across all label classes in the presence of adversarial triggers. 2.Challenge Set II (Validation split with ran- dom triggers3): Designed as a baseline for comparison with universal triggers, this set follows the same construction process as Challenge Set I but replaces universal triggers with random triggers (detailed in Section- 4.1.2 ). This set provides a point of reference for measuring the impact of univer- sal triggers on model performance by isolat- ing the effect of non-specific, randomly cho- sen triggers. 3.Trigger-Augmented Dataset (Train split with universal triggers): To reduce the model’s reliance on spurious correlations present in the original SNLI dataset, a fine- tuning dataset was created. This dataset con- tains 6,000 training examples, with 3,000 left unmodified and the remaining 3,000 modi- fied by prepending universal triggers to their hypothesis. This augmentation encourages the model to prioritize semantically mean- ingful features over spurious patterns during training (refer to Section- 3.2). By utilizing these datasets, we aim to system- atically evaluate and enhance the robustness of the ELECTRA model under both adversarial and stan- dard conditions. 4.3 Training and Evaluation process The ELECTRA model4was trained and fine-tuned on a single machine equipped with an NVIDIA T4 GPU. The training process utilized the Hugging- Face Trainer framework, configured with a max- imum sequence length of 128 to ensure that over 96% of examples from the SNLI dataset were fully captured without truncation. A batch size of 256 was selected, while all other parameters were left at their default settings. Training was conducted in two stages. First, the model was trained on the original SNLI dataset for 3Our database for random triggers can be found at https://huggingface.co/datasets/ckverma/snli_random 4Our code for training the model can be found at https://github.com/ckvermaAI/SNLI-Attack-Analysis.g it Page 5: three epochs to establish a strong foundational un- derstanding of natural language inference. This was followed by a fine-tuning phase, where the model was adapted using the Trigger-Augmented dataset to enhance robustness and mitigate re- liance on spurious correlations. This two-stage process allowed the model to effectively balance learning from the original data and adapting to the additional challenge sets. 5 Results and Analysis 5.1 Training ELECTRA on the SNLI Dataset The Electra model was trained on the SNLI dataset for three epochs, achieving a validation accuracy of 88.98%. During this initial phase, evaluations were performed on three datasets: the validation subset (comprising 1,000 randomly sampled ex- amples from the SNLI validation split), Challenge Set I, and Challenge Set II. The results from this stage are summarized in the first three rows of Table- 3. 5.2 Fine-Tuning ELECTRA model In the second phase, the model underwent fine- tuning on the Trigger-Augmented dataset for one epoch. This step was designed to reduce the model’s dependence on spurious correlations present in the original SNLI dataset, thereby enhancing its robustness. Post-fine-tuning, the model was re-evaluated on the validation subset and Challenge Set I, with the results documented in the last two rows of Table- 3. 5.3 Analyzing the Results 5.3.1 Effectiveness of Triggers Table- 4highlights the impact of universal and ran- dom triggers. Universal triggers effectively alter the model’s predictions for entailment and neu- tral examples, often shifting them to other classes. In contrast, random triggers have minimal influ- ence, affecting approximately 20% of the entail- ment examples. These findings demonstrate the superior efficacy of universal triggers in manip- ulating model predictions (compared to random triggers). 5.3.2 Success of Universal Triggers The Universal Adversarial Attack generates trig- gers that are strongly correlated with a compet- ing class (the class other than the intended target).Table- 2highlights the high correlation scores be- tween the universal triggers and their associated dominant (or majority) class. When these triggers are appended to SNLI examples from the targeted class, they exploit the model’s reliance on spuri- ous correlations, leading it to favor the competing class over the intended target. For example, the universal trigger nobody is closely associated with the contradiction class. When prepended to exam- ples from the entailment class, it causes the model to misclassify them as contradictions (1st row in Table- 1). 5.3.3 De-biasing the Model The uniform distribution of universal triggers in the Trigger-Augmented dataset helps the model unlearn spurious correlations present in the orig- inal SNLI dataset. The two-stage training pro- cess balances foundational learning from the orig- inal data while mitigating biases using the Trigger- Augmented dataset. As shown in Table- 3(1st, 2nd and, 5th row), this approach significantly enhances the model’s overall performance and resilience (as highlighted in Section- 3.2). 5.3.4 Decoding Attacks on the Contradiction Class The contradiction class contains more correlated words in comparison to the entailment and neu- tral classes. The cumulative frequency of the top five correlated words is 312 for contradictions, 128 for neutral, and 57 for entailment. This abundance of correlated words makes contradictions particu- larly vulnerable. However, flipping predictions for contradiction-class examples to entailment or neu- tral by simply prepending tokens is feasible only if the example lacks these giveaway words. This is the reason why ELECTRA model’s ability to cor- rectly predict contradiction examples reduces by only 7.43% with the introduction of universal trig- gers (refer to the results in Table- 3). 6 Conclusion and Future work In this study, we systematically investigated the vulnerabilities and biases in NLI models, propos- ing methods to enhance their robustness. Our find- ings demonstrated the effectiveness of universal triggers in exploiting spurious correlations to ma- nipulate NLI model predictions, significantly out- performing random triggers. Moreover, Trigger- Augmented training proved successful in mitigat- ing biases, thereby improving model resilience. Page 6: Dataset Triggers (Model) Entailment (%) Neutral (%) Contradiction (%) Validation Subset Original (Pre-Finetune) 90.23 86.70 91.06 Challenge Set I Universal (Pre-Finetune) 25.78 25.76 83.63 Challenge Set II Random (Pre-Finetune) 71.20 83.98 91.57 Validation Subset Original (Post-Finetune) 88.92 84.81 91.16 Challenge Set I Universal (Post-Finetune) 90.13 87.53 91.96 Table 3: Performance Summary of ELECTRA on Validation subse t of SNLI dataset and Challenge Sets (I and II) Ground Truth Data E% N% C% EntailmentValidation subset 90.23 7.65 2.11 Challenge Set I 25.78 32.93 41.29 Challenge Set II 71.20 21.85 6.95 NeutralValidation subset 7.33 86.70 5.97 Challenge Set I 0.63 25.76 73.61 Challenge Set II 5.97 83.98 10.05 ContradictionValidation subset 1.31 7.63 91.06 Challenge Set I 0.60 15.76 83.63 Challenge Set II 1.41 7.03 91.57 Table 4: ELECTRA model’s prediction distribution for diffe rent datasets. Each row shows a particular dataset and each column shows how often model predicts a particular c lass. For example, on the challenge set I, neutral examples are classified as contradiction examples 73.61% ti mes. This approach also underscored the nuanced chal- lenges associated with attacking the contradiction class, shedding light on areas requiring further ex- ploration. For future work, we aim to explore diverse at- tack strategies ( Song et al. ,2021 ) beyond merely prepending triggers to hypotheses. Such strategies will help uncover additional weaknesses in NLI datasets, providing deeper insights into designing more robust datasets and improving the training processes for NLI models. References Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language infe rence . CoRR , abs/1508.05326. Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced lstm for natural language inference . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics. Kevin Clark, Minh-Thang Luong, Quoc V . Le, and Christopher D. Manning. 2020. ELECTRA: pre-training text encoders as discriminators rat her than generators . CoRR , abs/2003.10555.Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The pascal recognising textual entailment challenge. In Machine Learning Challenges. Eval- uating Predictive Uncertainty, Visual Object Classi- fication, and Recognising Tectual Entailment , pages 177–190, Berlin, Heidelberg. Springer Berlin Hei- delberg. Ido Dagan, Dan Roth, Mark Sam- mons, and Fabio Zanzotto. 2013. Recognizing textual entailment: Models and applications . Synthesis Lectures on Human Language Technolo- gies, 6(4):1–222. Publisher Copyright: © Morgan and Claypool Publishers. All rights reserved. Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018. HotFlip: White-box adversarial examples for text classific ation . InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages 31–36, Melbourne, Aus- tralia. Association for Computational Linguistics. Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Got- tumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mul- caire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. 2020. Evaluating models’ local decision boundaries via contrast sets. InFindings of the Association for Computational Linguistics: EMNLP 2020 , pages 1307–1323, Online. Association for Computational Linguistics. Page 7: Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data . CoRR , abs/1803.02324. He He, Sheng Zha, and Haohan Wang. 2019. Unlearn dataset bias in natural language inference by fittin g the residual . InProceedings of the 2nd Workshop on Deep Learn- ing Approaches for Low-Resource NLP (DeepLo 2019) , pages 132–142, Hong Kong, China. Associ- ation for Computational Linguistics. Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehensio n systems . InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages 2021–2031, Copenhagen, Denmark. Association for Computational Linguistics. Nelson F. Liu, Roy Schwartz, and Noah A. Smith. 2019. Inoculation by fine-tuning: A method for analyzing challeng e datasets . CoRR , abs/1904.02668. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation . InProceedings of the 2014 Conference on Em- pirical Methods in Natural Language Processing (EMNLP) , pages 1532–1543, Doha, Qatar. Associa- tion for Computational Linguistics. Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis only baselines in natural language inference . CoRR , abs/1805.01042. Marco Tulio Ribeiro, Tongshuang Wu, Car- los Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with Chec kList . InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 4902–4912, Online. Association for Computational Linguistics. Liwei Song, Xinwei Yu, Hsuan-Tung Peng, and Karthik Narasimhan. 2021. Universal adversarial attacks with natural triggers for te xt classification . Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial triggers for NLP . CoRR , abs/1908.07125.

---