Paper Content:
Page 1:
LIGHTWEIGHT SAFETY CLASSIFICATION USING PRUNED
LANGUAGE MODELS
Mason Sawtell
Neudesic, an IBM Company
mason.sawtell@neudesic.com
Tula Masterman
Neudesic, an IBM Company
tula.masterman@neudesic.comSandi Besen
IBM
sandi.besen@ibm.com
Jim Brown
Neudesic, an IBM Company
jim.brown@neudesic.com
ABSTRACT
In this paper, we introduce a novel technique for content safety and prompt injection classification for
Large Language Models. Our technique, Layer Enhanced Classification (LEC), trains a Penalized
Logistic Regression (PLR) classifier on the hidden state of an LLM’s optimal intermediate transformer
layer. By combining the computational efficiency of a streamlined PLR classifier with the sophisti-
cated language understanding of an LLM, our approach delivers superior performance surpassing
GPT-4o and special-purpose models fine-tuned for each task. We find that small general-purpose
models (Qwen 2.5 Instruct sizes 0.5B, 1.5B, and 3B) and other transformer-based architectures like
DeBERTa v3 are robust feature extractors allowing simple classifiers to be effectively trained on
fewer than 100 high-quality examples. Importantly, the intermediate transformer layers of these
models typically outperform the final layer across both classification tasks. Our results indicate that
a single general-purpose LLM can be used to classify content safety, detect prompt injections, and
simultaneously generate output tokens. Alternatively, these relatively small LLMs can be pruned to
the optimal intermediate layer and used exclusively as robust feature extractors. Since our results are
consistent on different transformer architectures, we infer that robust feature extraction is an inherent
capability of most, if not all, LLMs.
Keywords Model Pruning, Classification, Large Language Models, Small Language Models, LLM, SLM, Content
Safety, Prompt Injection, Hidden Layers, Transformer
1 Introduction
Since the introduction of LLMs, a primary concern has been detecting inappropriate content in both the user’s input and
the generated output of the LLM. Establishing language model guardrails is a critical requirement of responsible AI
practices. A robust set of guardrails extends beyond the detection of hate speech and sexual content, also detecting when
The opinions expressed in this paper are solely those of the authors and do not necessarily reflect the views or policies of their
respective employers.arXiv:2412.13435v1 [cs.CL] 18 Dec 2024
Page 2:
an LLM strays away from its intended purpose. There have been numerous troubling instances where chat bots have
been coerced into responding to inappropriate requests and therefore produced damaging outputs. Existing solutions for
identifying inappropriate content range in complexity from traditional text classification methods to using an LLM to
classify the text. Our approach combines the computational efficiency of a simple machine learning based classifier
with the robust language understanding provided by an LLM for optimal performance.
Two primary concerns related to LLM use include content safety and prompt injection detection. Content safety involves
identifying inputs and outputs that are harmful, offensive, or otherwise inappropriate [10, 15, 34, 20]. Prompt injections
are attempts by the user to manipulate the language model to behave in unintended ways or respond outside of ethical
guidelines [16, 10]. This is important because prompt injections can compromise the ethical integrity of AI systems,
potentially leading to vulnerabilities in AI-driven applications.
Our contributions to content safety and prompt injection detection are:
•We prove the intermediate hidden state between transformer layers are robust feature extractors. A penalized
logistic regression classifier with the same number of trainable parameters as the size of the hidden state, with
as few as 769 parameters, achieves state-of-the-art performance surpassing GPT-4o – the presumed leader.
•We show that for both content safety and prompt injection classification tasks there exists an optimal interme-
diate transformer layer that produces the necessary features. With only a minimal set of training examples,
the classifier generalizes extremely well to unseen examples. This is particularly valuable in cases where few
high-quality examples are available, making our approach adaptable for a wide variety of custom classification
scenarios.
•We demonstrate that the hidden states of general-purpose LLMs (Qwen 2.5 0.5B Instruct, 1.5B, and 3B)
and special-purpose models fine-tuned for content safety (Llama Guard 3 1B and 8B) or prompt injection
detection (DeBERTa v3 Base Prompt Injection v2), produce features achieving similar classification results.
This indicates LEC generalizes across model architectures and domains. Special-purpose models require even
fewer examples to surpass GPT-4o level performance on both tasks.
•Special-purpose content safety and prompt injection models when pruned and used as feature extractors
outperform their non-pruned versions on their respective tasks.
Content Safety
Source ModelTrainable
Parameter CountMax F1-ScoreF1 at # Examples to Beat
Llama Guard 3 1BF1 at # Examples to Beat
Llama Guard 3 8BF1 at # Examples to Beat
GPT-4o
Qwen 2.5 0.5B Instruct 897 0.95 0.82 (5) 0.82 (5) 0.87 (15)
Llama Guard 3 1B 2049 0.96 0.77 (15) 0.77 (15) 0.83 (55)
Llama Guard 3 8B 4097 0.96 0.82 (15) 0.82 (15) 0.82 (55)
Prompt Injection Detection
Source ModelTrainable
Parameter CountMax F1-ScoreF1 at # Examples to Beat
ProtectAI DeBERTa v3F1 at # Examples to Beat
GPT-4o
Qwen 2.5 0.5B Instruct 897 0.98 0.77 (5) 0.92 (55)
ProtectAI DeBERTa v3 769 0.98 0.81 (15) 0.93 (75)
Table 1: Summary of results from LEC. Each model tested was able to surpass both their base performance as well as GPT-4o for
most layers in fewer than 100 examples. The intermediate model layers attained the highest F1-scores with the fewest number of
training examples across classification tasks. The above results are for the binary content safety classification task and the prompt
injection classification task.
We believe that these results offer promising indications that an LLM’s inference code can be adapted to produce the
necessary embeddings during the course of generating output tokens. Such an approach has the potential to reduce
computational complexity to trivial levels while achieving excellent classification results on content safety and prompt
injection. We discuss this more in the future work section 6.
We concur with other researchers that the various transformer layers focus on different characteristics of the prompt
input [30]. Generally, it appears that early transformer layers focus on local relationships between input tokens and later
layers focus more heavily on global relationships and influencing the next token prediction. Since no part of our content
safety and prompt injection classification tasks involves token generation, it seems natural to prune those layers from
the model to reduce computational complexity.
Although we focused our classification exploration on content safety and prompt injection detection, these same
techniques apply to any text classification task and especially those with limited training examples [5].
2
Page 3:
2 Related Work
2.1 Language Models as Classifiers
Previous work has demonstrated that combining transformer-based language models with penalized logistic regression
on model embeddings can create effective classifiers using a small number of examples (e.g., 10 to 1,000 examples)[5].
By using penalized logistic regression on the model embeddings, Buckmann et al. created classifiers that are robust to
model size as well as quantization. This technique, called linear probing, has traditionally been used to investigate the
hidden states of language models in order to understand what they represent[5, 1, 7]. These works have also shown that
prompting has a significant impact on the performance of these classifiers[5].
Our approach builds on this foundation by introducing layer-specific insights instead of only using the embeddings
from the final layer of the model before the prediction head. We demonstrate that different layers are better suited for
different classification tasks. This layer-focused analysis combined with model pruning enables highly accurate and
efficient classification models for responsible AI focused classification tasks.
2.2 Language Model Pruning
Model pruning techniques aim to reduce the computational requirements of language models by removing non-critical
components of the model to reduce its overall size while maintaining performance[23, 6]. Researchers have found that
in many cases up to half the model layers can be removed and that earlier layers tend to play a more important role
in knowledge understanding compared to later layers[12]. These pruned models are typically fine-tuned to recover
lost performance on tasks like zero-shot classification, generation, and question-answering[23, 12]. Research by D K,
Fischer et al. shows that layer-based pruning can be used to reduce large language models by 30-50% while maintaining
nearly all of their performance as text encoders[17]. They found that using the model as a text-encoder does not require
fine-tuning to recover performance, and that in some cases model performance actually increases when pruned.
Our approach expands on this by using the hidden state from intermediate layers as input to train a classification
model. Unlike other approaches, we do not need to fine-tune the model to recover performance since we find that the
intermediate layers provide better performance than the original model. Our focus is on identifying which of the pruned
layers create the best inputs to use for downstream classification tasks.
2.3 Intermediate Layers
Recent research analyzes the effectiveness of language models’ intermediate layers. Each intermediate layer block
encompasses 2 stages: Multi-Head Self-Attention and a Feedforward Network (MLP) with normalization occurring
before and after the MLP step. After each hidden layer there is an updated contextually aware hidden state that is
produced.
Valeriani et al.’s work demonstrates that in the early layers of the transformer model, the data manifold expands
becoming highly dimensional, then contracts significantly in the intermediate layers, and continues to remain constant
until the later layers of the model where it has a shallow second peak. Their experimentation suggests that the semantic
information of the dataset is better expressed at the end of the first peak – and therefore in the intermediate layers [30].
Skean et al. found that the intermediate layers of SSMs and transformer-based models can yield better performance on a
range of downstream tasks, including classification of embeddings [28]. They show that intermediate model layers have
lower levels of prompt entropy, suggesting that these layers more efficiently eliminate redundancy during training [33].
Our approach further supports the claims made in these papers and goes beyond to practically demonstrate how utilizing
the hidden state of the intermediate layers is beneficial for training a highly effective and computationally efficient
classification model across various tasks.
2.4 Model Explainability
Explainability for deep learning is a critical yet challenging field of research. Due to their complexity and scale, language
models are particularly opaque. Model hallucinations and the generation of harmful content are just some of the many
examples that result from the lack of interpretability within a transformer model. Increased explainability offers more
than the ability for researchers to improve downstream tasks, it also offers the end user clarity and confidence in the
model’s response.
There are several techniques to improve transformer interpretability which Luo et al. categorizes into the two broad
topics of "local analysis" and "global analysis". In local analysis, researchers aim to understand how models generate
3
Page 4:
specific predictions. In global analysis, researchers aim to explain the knowledge or linguistic properties encoded in the
hidden state activations of a model [22]. These local analysis techniques provide human-interpretable explanations of
model outputs by pairing a "black box" model with a more interpretable model [3, 4, 8, 25]. This allows the researcher
to take advantage of the complexity of deep neural networks while maintaining some level of explainability in the
output.
M'
Black-box
ML model
x
y
Transparent design methods
Decision Tree
(Fuzzy) rule-based learning
KNN
Prediction
Explanation
Figure 1: Visualization of a hybrid black-box model.
Mechanistic interpretability is a global analysis technique that seeks to explain behaviors of machine learning models’
internal components. In their work, Wang et al. mechanistically interpret how GPT-2 small implements a natural
language task by iteratively tracing important components back from logits, projecting the embedding space, performing
attention pattern analysis, and using activation patching as part of a circuit analysis. They identify an induced subgraph
of the model’s computational graph that is human-understandable and responsible for generating an output [32].
Although our research does not focus specifically on interpretability, it provides additional insight into the role of
contextual embeddings and the hidden state between transformer layers. Our findings allow us to formulate supported
hypotheses regarding why layers in some parts of the network tend to produce more effective representations for
classification tasks than others. This understanding sheds light on the distinct properties of hidden layer states and their
potential impact on downstream performance.
2.5 Responsible AI Classification Tasks
Content safety and prompt injections are two of the most well-researched and high-priority use cases related to the
responsible use of Generative AI. Without effective mitigation strategies, these issues can compromise model integrity,
user trust, and overall system security. Numerous methods have been developed to address these classification tasks,
with public leaderboards available to benchmark their performance. Notable public leaderboards include the AI Secure
LLM Safety Leaderboard on HuggingFace[2] and the Lakera PINT Benchmark for Prompt Injection [18].
One detection method for prompt injections is proposed by Hung et al. where they analyze patterns in the attention
heads and introduce a concept called the "distraction effect". The "distraction effect" is where select attention heads shift
focus from the original instruction to the newly introduced instruction. They propose Attention Tracker, a training-free
detection method that monitors attention patterns on instruction to detect prompt injection attacks[14].
Another content safety classification approach is presented in the work of Mozes et al., where they show that LLM-based
parameter-efficient fine-tuning (PEFT) can produce high-performance classifiers using small datasets (as few as 80
examples) across three domains: offensive dialogue, toxicity in online comments, and neutral responses to sensitive
topics. This method offers a cost-effective alternative to large-scale fine-tuning[24].
Our approach distinguishes itself from these existing approaches as it prunes the LLM’s hidden layers and uses just
the optimal number of parameters to be the most efficient yet performant classifier for the task. Additionally, it can be
implemented in two distinct ways: (1) integrated directly into the LLM’s forward pass, similar to the work of Hung
et al. [14], or (2) as a separate component in the model pipeline, akin to the approach used by Mozes et al. [24]. By
offering flexibility in deployment, our method provides a versatile and scalable solution for content safety and prompt
injection detection.
3 Experiments
3.1 Overview
Our experiments explore the effectiveness of training a classifier on the hidden states of intermediate transformer layers
and identify which intermediate layer(s) provide the best performance for both content safety and prompt injection
classification. We compare our approach to baseline models using task-specific datasets.
For each task, we evaluate performance against two types of baseline models, GPT-4o and a task-specific, special-
purpose model. We use GPT-4o as a baseline for both classification tasks since it is widely considered one of the most
4
Page 5:
capable general-purpose language models and in some cases outperforms the special-purpose models. We apply LEC to
a general-purpose model and the same special-purpose model selected for the baseline.
This setup allows us to compare three key aspects:
1.How well LEC performs when applied to a general-purpose model compared to both baseline models (GPT-4o
and the special-purpose model).
2.How much LEC improves the performance of the special-purpose model relative to its own baseline perfor-
mance.
3.How well LEC generalizes across model architectures and domains, by evaluating its performance on both
general-purpose and special-purpose models.
Our experiments include models ranging from 184M to 8B parameters. We do not quantize the models, since existing
research suggests that performance is largely preserved for different quantization levels[5]. This setup provides a robust
comparison of general-purpose and special-purpose models to illustrate the advantages of LEC in both responsible
AI-focused classification tasks.
We select Qwen 2.5 Instruct in sizes 0.5B, 1.5B, and 3B as our general-purpose model for both content safety and
prompt injection classification tasks. We select the following special-purpose models based on the classification task:
1. Content Safety: Meta’s Llama Guard 3 1B and Llama Guard 3 8B [15, 11]
2. Prompt Injection: ProtectAI’s DeBERTa v3 Base Prompt Injection v2[26]
It is important to note that we do not modify the system prompts during training or evaluation to include specific
instructions relevant to the classification tasks. This ensures that the LLM inference pipeline maintains its ability to be
adapted to produce the necessary embeddings while generating the output tokens. We believe that this is critical for
allowing classification tasks to be added to the LLM inference code without impacting computational efficiency.
We measure performance by assessing the weighted average F1-score of the baseline models and the models trained
using LEC. We also examine the impact of the number of training examples on the weighted average F1-score. This
method allows us to assess which layers provide the best performance for each task and how many training examples
are required to achieve optimal performance.
3.2 Experiment Implementation
For both general-purpose Qwen 2.5 Instruct models and special-purpose models, we prune individual layers and capture
the hidden state of the transformer layer to train a classification model. Our implementation uses the Python package
l3prune from D K et al.[17] to load the models from their respective HuggingFace repositories and remove the LM
Head. We iterate through each layer of the model, pruning a single layer each time and capturing the hidden state at the
transformer layer. This allows us to understand the impact of individual layers on the task.
After pruning, we train a PLR classifier with L2 regularization on the output vector of our pruned model. Our PLR
classifier uses the RidgeClassifier class from scikit-learn with α=10. All other settings are left to their default values.
Each model was run on a single A100 GPU with 220GB of memory in an Azure Databricks environment. For our
GPT-4o baseline we used an Azure OpenAI deployment with version "2024-06-01".
We used task-specific datasets, each containing 5,000 examples with 66% allocated to training and 33% to testing.
While previous work suggests that our classifiers will only see small improvements after a few hundred examples
[5], we randomly sampled 5,000 examples to ensure enough data diversity and minimize compute time. For each
experiment, we trained multiple classifiers on a random sample of our training set from sizes 5 to 3,000. Then, we
calculated the weighted F1-score on our 1,700 example test set for each unique training size, layer count, and model.
To establish the baseline models’ performance we calculated the weighted F1-score on the 1,700 example test set.
3.3 Datasets
Content Safety : Our content safety dataset is designed for both binary and multi-class classification. For binary
classification content is either "safe" or "unsafe". For multi-class classification, content is either categorized as "safe" or
assigned to a specific fine-grained category under the broader "unsafe" category. To create a balanced dataset of safe
and unsafe content, we combine two existing datasets, the "SALAD Data" dataset from OpenSafetyLab to represent
unsafe content and the "LMSYS-Chat-1M" dataset from LMSYS, to represent safe content[20, 34]. We randomly
sample 2,500 records from each source to create a combined balanced dataset with 5,000 examples.
5
Page 6:
The full SALAD dataset contains over 21,000 unsafe messages collected from various sources and is organized into three
levels of complexity. Each level contains one category for "safe" and progressively finer-grained "unsafe" categories
with 6 unsafe categories at level 1, 16 at level 2, and 65 at level 3. We randomly sampled 2,500 records from the "base"
set, which contains examples that have not been modified to bypass general LLM filters.
The full LMSYS dataset contains one million real-world conversations collected from 25 LLMs. We filter the dataset
for English-only, first-turn messages that were not flagged as unsafe. We randomly sampled 2,500 records from the
filtered dataset to use as our "safe" examples. These examples are labeled as "O0: Safe" to match the naming convention
used in the SALAD dataset.
Prompt Injection Detection : We use the SPML Chatbot Prompt Injection Dataset which contains 1,800 system
prompts and 20,000 user prompts, from which we randomly sampled 5,000 pairs[27]. In this context, a prompt injection
is defined as any attempt to change the intended behavior of the AI system as defined in the system prompt.
We chose the SPML dataset because of its diversity and complexity in representing real-world chat bot scenarios.
Many other datasets for classifying prompt injection attacks are designed for general-purpose chat bots with broad
instructions to not return unsafe or misleading content. Training on these simpler "no-context" prompt injection
tasks often allows small models to achieve near perfect accuracy, making these datasets insufficiently challenging for
evaluating our approach[26]. In contrast, SPML captures nuanced and domain-specific prompt injection challenges,
effectively addressing real-world challenges and providing the complexity necessary for our evaluation. Selecting
real-world system and user prompts further support the possibility of using the LLM hidden state during normal LLM
inference flows.
4 Results
4.1 Results Summary
Our results indicate that for both content safety and prompt injection classification tasks, using the transformer layer’s
hidden states with PLR classifiers consistently outperforms the baseline models, GPT-4o and the special-purpose models.
Furthermore, applying LEC to the special-purpose models outperforms the models’ own baseline performance by
identifying the most task-relevant layer for the classification task. This ultimately results in a significant improvement in
the F1-score compared to the full model’s performance on the same task. Overall, we find that LEC results in improved
performance across all evaluated tasks, models, and number of training examples, often achieving better performance
than the baselines in fewer than 100 examples.
We find that content safety and prompt injection classification are largely driven by local features that are represented
early on in the transformer network, allowing the middle layers of the model to perform better on the classification
tasks than later layers. These middle layers tend to attain the highest F1-scores within the fewest training examples for
content safety and prompt injection classification tasks. For both tasks, LEC enables the special-purpose models to
generalize to new tasks in a similar domain using fewer than 100 training examples. We infer this capability because, to
our knowledge, the base special-purpose models were not trained on the datasets used in this evaluation.
In each model, we observe that the performance appears to mimic a continuous right-skewed function that is concave
down, with one or two local maximums near 50%-75% of the original number of layers. This is consistent with Gromov
et al.’s assertion that the layers of each model are largely dependent on the previous layer [12]. These findings suggest
that the optimal layer for classification tasks is not the LM head or final encoding, but instead one of the intermediate
layers of the model. Figure 5 provides an example of the classifier performance for each layer of a model.
We also find that intermediate transformer layers tend to show the largest improvement in F1 compared to the final
transformer layer when trained on fewer examples. This suggests that low-resource or low-data cases can especially
benefit from LEC. In each result summary, we show the performance of LEC on 3 selected layers per model. These
layers include the full model’s layers, the best-performing layer, and the smallest layer that achieves similar performance
to the full model.
In summary, LEC provides a more computationally efficient and better performing solution (higher F1-scores with few
training examples) for these classification tasks compared to GPT-4o and the unmodified special-purpose models.
4.2 Content Safety Results
In this section, we present our results for both binary and multi-class classification on the content safety task. Of
the baseline models used for this task – Llama Guard 3 1B, Llama Guard 3 8B, and GPT-4o — GPT-4o consistently
achieved the highest weighted F1-scores for both binary and multi-class content safety classification. However, all
6
Page 7:
the models trained using LEC consistently outperform the Llama Guard models and GPT-4o across all classification
tasks, often surpassing GPT-4o’s performance in as few as 20 training examples for binary classification and 50 training
examples for multi-class classification.
Figure 2: LEC performance of select layers on binary content safety classification for Qwen 2.5 0.5B Instruct, Llama Guard 3 1B,
and Llama Guard 3 8B.
Model Name Layers Parameter Count (B) % of Full SizeMax Weighted
F1-ScoreF1 at # Examples to Beat
Llama Guard 3 1BF1 at # Examples to Beat
Llama Guard 3 8BF1 at # Examples to Beat
GPT-4o
Qwen 2.5 0.5B Instruct 5 0.21 42.65 0.95 0.74 (5) 0.74 (5) 0.84 (15)
Qwen 2.5 0.5B Instruct 12 0.32 63.78 0.95 0.82 (5) 0.82 (5) 0.87 (15)
Qwen 2.5 0.5B Instruct 24 0.49 100.0 0.95 0.76 (5) 0.76 (5) 0.85 (35)
Llama Guard 3 1B 4 0.51 40.94 0.94 0.75 (15) 0.75 (15) 0.86 (65)
Llama Guard 3 1B 7 0.69 55.71 0.96 0.77 (15) 0.77 (15) 0.83 (55)
Llama Guard 3 1B 16 1.24 100.0 0.94 0.79 (5) 0.79 (5) 0.83 (15)
Llama Guard 3 8B 9 2.49 33.16 0.95 0.78 (15) 0.78 (15) 0.83 (35)
Llama Guard 3 8B 12 3.14 41.87 0.96 0.82 (15) 0.82 (15) 0.82 (15)
Llama Guard 3 1B 32 7.5 100.0 0.94 0.79 (15) 0.79 (15) 0.83 (35)
Table 2: Content safety binary classification results for select model layers. The baseline F1-scores for Llama Guard 3 1B, Llama
Guard 3 8B, and GPT-4o were 0.65, 0.71, and 0.82, respectively.
Figure 3: LEC performance of Qwen 2.5 0.5B Instruct on all three levels of the multi-class content safety dataset.
For the binary classification task, all models trained using LEC outperformed the three baseline models. The hidden state
of the middle layers provided the highest weighted F1-score across all general and special-purpose models. Notably,
Qwen 2.5 0.5B Instruct outperformed both Llama Guard 3 baselines within 5 examples. It also surpassed GPT-4o’s
performance in 15 examples, attaining an F1-score of 0.87. Interestingly, using LEC, Llama Guard 3 1B surpassed its
own baseline performance as well as the Llama Guard 3 8B baseline performance within 15 examples, attaining an
F1-score of 0.75. Llama Guard 3 8B also surpassed all baseline models’ performance with an F1-score of 0.82 in 15
examples. With additional training examples, it attained a max F1-score of 0.96. When comparing the models trained
using LEC, the general-purpose and special-purpose models performed similarly, demonstrating that our approach
7
Page 8:
Model Name Layers Parameter Count (B) % of Full Size LevelMax Weighted
F1-ScoreF1 at # Examples to Beat
Llama Guard 1BF1 at # Examples to Beat
Llama Guard 8BF1 at # Examples to Beat
GPT-4o
Qwen 2.5 0.5B Instruct 5 0.21 42.65 1 0.81 0.53 (15) 0.57 (35) 0.59 (55)
2 0.77 0.37 (5) 0.44 (15) 0.48 (35)
3 0.72 0.37 (5) 0.41 (15) 0.45 (55)
Qwen 2.5 0.5B Instruct 12 0.32 63.78 1 0.82 0.56 (15) 0.56 (15) 0.59 (25)
2 0.78 0.39 (5) 0.39 (5) 0.47 (15)
3 0.72 0.38 (5) 0.38 (5) 0.45 (35)
Qwen 2.5 0.5B Instruct 24 0.49 100.0 1 0.82 0.54 (15) 0.59 (35) 0.59 (35)
2 0.79 0.38 (5) 0.38 (5) 0.5 (35)
3 0.73 0.37 (5) 0.41 (15) 0.46 (55)
Llama Guard 3 1B 4 0.51 40.94 1 0.79 0.55 (15) 0.57 (35) 0.63 (55)
2 0.73 0.39 (5) 0.39 (5) 0.48 (35)
3 0.72 0.39 (5) 0.39 (5) 0.45 (55)
Llama Guard 3 1B 7 0.69 55.71 1 0.8 0.58 (15) 0.58 (15) 0.58 (15)
2 0.75 0.39 (5) 0.39 (5) 0.47 (15)
3 0.73 0.39 (5) 0.39 (5) 0.45 (45)
Llama Guard 3 1B 16 1.24 100.0 1 0.8 0.43 (5) 0.63 (15) 0.63 (15)
2 0.77 0.43 (5) 0.43 (5) 0.5 (15)
3 0.75 0.42 (5) 0.42 (5) 0.45 (15)
Llama Guard 3 8B 9 2.49 33.16 1 0.86 0.57 (15) 0.57 (15) 0.59 (25)
2 0.83 0.36 (5) 0.46 (15) 0.48 (25)
3 0.77 0.37 (5) 0.41 (15) 0.45 (45)
Llama Guard 3 8B 12 3.14 41.87 1 0.87 0.62 (15) 0.62 (15) 0.62 (15)
2 0.84 0.36 (5) 0.48 (15) 0.48 (15)
3 0.75 0.37 (5) 0.41 (15) 0.45 (45)
Llama Guard 3 8B 32 7.5 100.0 1 0.85 0.59 (15) 0.59 (15) 0.59 (15)
2 0.83 0.49 (15) 0.49 (15) 0.49 (15)
3 0.77 0.45 (5) 0.45 (5) 0.45 (5)
Table 3: Multi-class content safety classification results for select model layers across all three levels of difficulty. The baseline
F1-scores for Llama Guard 3 1B across each level were 0.40, 0.34, and 0.34. The baseline F1-scores for Llama Guard 3 8B across
each level were 0.56, 0.37, and 0.38. The baseline F1-scores for GPT-4o across each level were 0.58, 0.47, and 0.44.
effectively improves classification performance for both model architectures. Detailed results can be found in Figure 2
and Table 2.
For the multi-class classification task, the intermediate layers of all models using the LEC approach outperformed
the GPT-4o baseline and baseline special-purpose models after being trained with fewer than 60 examples. Typical
of multi-class classification problems, there is an inverse correlation between model performance and number of
categories. Our findings did not suggest a clear number of layers that performed best on the multi-class content safety
task, indicating that more experimentation would be needed prior to implementation based on the model in which LEC
is applied to. However, by utilizing the LEC approach even very small models like Qwen 2.5 0.5B Instruct we were
able to surpass the performance of baseline models such as GPT-4o for a 66 category classification problem using as
few as 35 training examples. Detailed results can be found in Figure 3 and Table 3.
4.3 Prompt Injection Results
Figure 4: Performance of select layers on prompt injection classification for both general-purpose Qwen 2.5 0.5B Instruct and
DeBERTa-v3-Prompt-Injection-v2.
In this section, we present our results for the prompt injection classification task. We find that both general-purpose and
special-purpose models trained using LEC consistently outperform all the baseline models. As can be seen in Figure
8
Page 9:
Model Name Layers Parameter Count (B) % of Full SizeMax Weighted
F1-ScoreF1 at # Examples to Beat
ProtectAI DeBERTa v3F1 at # Examples to Beat
GPT-4o
Qwen 2.5 0.5B Instruct 1 0.15 30.57 0.96 0.73 (5) 0.93 (900)
Qwen 2.5 0.5B Instruct 12 0.32 63.78 0.98 0.77 (5) 0.92 (55)
Qwen 2.5 0.5B Instruct 24 0.49 100.0 0.97 0.76 (15) 0.94 (1000)
ProtectAI DeBERTa v3 8 0.16 84.58 0.98 0.81 (15) 0.92 (135)
ProtectAI DeBERTa v3 10 0.17 92.29 0.98 0.81 (15) 0.93 (75)
ProtectAI DeBERTa v3 12 0.18 100.0 0.94 0.73 (15) 0.94 (2000)
Table 4: Results for select layers on the prompt injection task. The baseline F1-scores for ProtectAI DeBERTa v3 and GPT-4o were
0.73 and 0.92, respectively.
4, applying our approach to a general-purpose model with only 0.5B parameters (Qwen 2.5 0.5B Instruct) achieves a
maximum F1-score of 0.98, surpassing DeBERTa-v3-Prompt-Injection-v2’s baseline performance for all the layers
within 20 examples, and outperforming GPT-4o’s performance for the middle layers in 55 examples.
Qwen 2.5 Instruct surpassed DeBERTa-v3-Prompt-Injection-v2’s baseline performance, F1-score of 0.73, with only 5
training examples across all model sizes (0.5B, 1.5B, and 3B) and layers. Notably, the intermediate layers achieved
the highest weighted average F1-scores across model sizes. As expected, the larger Qwen 2.5 Instruct models (Qwen
2.5 1.5B Instruct and Qwen 2.5 3B Instruct) achieve similar but superior performance compared to the smallest Qwen
2.5 Instruct model, surpassing the baseline GPT-4o performance within fewer examples and attaining slightly higher
F1-scores for certain intermediate layers.
When applying LEC to DeBERTa-v3-Prompt-Injection-v2, it attained an F1-score of 0.81 for the intermediate layers
in 15 examples and surpassed the baseline performance of the base DeBERTa-v3-Prompt-Injection-v2 model. In 75
examples, layer 10 surpassed GPT-4o’s performance and in 135 examples layer 12 did as well, attaining F1-scores of
0.93 and 0.94 respectively. DeBERTa-v3-Prompt-Injection-v2’s maximum F1-score for intermediate layers 8 and 10 is
0.98 which it attained with additional training examples. Detailed results can be found in Figure 4 and Table 4.
Overall, our results show that our approach improves the performance for both general-purpose and special-purpose
models on prompt injection classification. The Qwen 2.5 Instruct and DeBERTa-v3-Prompt-Injection-v2 LEC models
achieve similar results in terms of F1-score and number of examples required for training. This suggests that our
classification approach generalizes across both model architectures. The performance for both Qwen 2.5 Instruct and
DeBERTa-v3-Prompt-Injection-v2 LEC models approaches or exceeds GPT-4o’s baseline performance at certain model
sizes and layers. Due to the small size of Qwen 2.5 Instruct models tested, we believe these results demonstrate an
LLM’s inherent ability to extract high-quality features with significant separation so that a classifier can be trained in
as few as 20 training examples. Since the DeBERTa based model also exhibits this behavior, we infer that nearly all
transformer-based LLMs have this inherent ability.
Figure 5: LEC performance at each layer of the DeBERTa-v3-Prompt-Injection-v2 model for the prompt injection task.
9
Page 10:
5 Conclusion
In conclusion, our method, LEC, which uses the hidden state between intermediate transformer layers as robust feature
extractors for content safety and prompt injection classification outperforms all other current methods tested including
GPT-4o. The classification model is easily trained using penalized logistic regression and only a small number of
training examples are needed. Most importantly, our results demonstrate that high-performing content classification is
possible without modifying the LLM’s weights or modifying the input prompt in any way. This classification approach
has trivial computational complexity at inference time because the classifier contains the same number of parameters as
the size of the LLM’s hidden state. At most a few thousand new parameters are needed for the classification.
We believe that such a lightweight approach allows guardrails to be efficiently established around an LLM’s input and
output. This approach also unlocks the ability for other use case specific classifiers to be created. We are intrigued by
the possibility that content classification can be baked into the LLM inference code providing real-time monitoring of
the LLM’s input and output as tokens are generated.
Our results also show that tiny LLMs can be pruned and used only as robust feature extractors for computationally
efficient text classifiers. These tiny pruned models may be run virtually anywhere depending on the use case complexity.
6 Limitations and Future Work
Limitations : These experiments did not fine-tune the baseline models on our datasets, we instead left them unmodified
and focused on training our classifiers. We chose to use the static transformer layers to leave the possibility open that
LEC could be integrated into the inference code during token generation. Fine-tuning standalone lightweight feature
extractors may perform even better, but we did not explore this possibility.
Our findings are highly task dependent. More work is needed to directly compare the general ability of our method in
other unexplored classification domains. Other classification domains may require more robust models, but we limited
our exploration to a single linear model.
Additionally, we were unable to retrieve a small fraction of results from GPT-4o since sensitive content was blocked by
built-in safety filters. Although this can affect our results since the blocked content is more likely to contain content
labeled ’unsafe’, this accounted for less than 1% of our dataset in all cases. Regardless of GPT-4o’s performance on
these examples, our results are conclusive enough to show that our method outperforms it.
Future Work : LEC has numerous implications when it comes to the deployment of NLP classifiers. First, we show
that models with as little as 100 million parameters and fewer than 100 training examples can make accurate predictions
on a range of classification tasks. This speed and flexibility allows data scientists to test very specific use cases without
significant investment in time, compute, or hardware. We also show that the hidden state of small, general-purpose
models like Qwen 0.5B can be used to train a wide range of effective PLR classifiers. By using smaller models and
concentrating resources into one or a handful of LM deployments, organizations can drastically reduce the amount of
computational power devoted to these tasks.
Our findings also have potential implications in the field of NLP explainability. As language models grow more and
more complex, there is increasing interest in interpreting and understanding their outputs [35]. This is especially
important for fields where automated decision-making can cause direct harm to people. Despite this, many frameworks
for understanding, visualizing, or explaining transformer-based models rely on local explainability through attention
layer activations on individual phrases [19, 31, 9, 29, 13]. In comparison, our approach uses a model that is explainable
at both the local and global level. By utilizing techniques such as SHAP values [21], we can better understand which
components are most important for individual classifier predictions as well as for the model as a whole.
Our results suggest that a general-purpose model can be adapted to classify content safety violations and prompt
injections while simultaneously generating output tokens. Applying LEC to a general-purpose model allows us to
identify which model layers are important for each of the classification tasks. We believe it is possible to take the
outputs from each of these layers, run them through their associated classification model to generate the predictions
and based on the results either continue generating output tokens or stop the output from generating if a violation has
occurred. Alternatively, pruning a very small language model and using its relevant layers for classification would work
well and incredibly quickly, allowing for immediate identification of violations before sending the prompt inputs to a
separate LLM for generation.
10
Page 11:
References
[1] Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes .
arXiv:1610.01644 [stat]. Nov. 2018. DOI:10.48550/arXiv.1610.01644 .URL:http://arxiv.org/
abs/1610.01644 (visited on 12/08/2024).
[2] An Introduction to AI Secure LLM Safety Leaderboard . en-US. URL:https://huggingface.co/blog/
leaderboard-decodingtrust (visited on 12/13/2024).
[3] Alejandro Barredo Arrieta et al. Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities
and Challenges toward Responsible AI . arXiv:1910.10045 [cs]. Dec. 2019. DOI:10.48550/arXiv.1910.10045 .
URL:http://arxiv.org/abs/1910.10045 (visited on 12/12/2024).
[4] Kurt Bollacker, Natalia Díaz-Rodríguez, and Xian Li. “Extending Knowledge Graphs with Subjective Influence
Networks for Personalized Fashion”. en. In: Designing Cognitive Cities . Ed. by Edy Portmann et al. Cham:
Springer International Publishing, 2019, pp. 203–233. ISBN : 978-3-030-00317-3. DOI:10.1007/978-3-030-
00317-3_9 .URL:https://doi.org/10.1007/978-3-030-00317-3_9 (visited on 12/12/2024).
[5] Marcus Buckmann and Edward Hill. Logistic Regression makes small LLMs strong and explainable "tens-
of-shot" classifiers . arXiv:2408.03414 [cs]. Oct. 2024. DOI:10.48550/arXiv.2408.03414 .URL:http:
//arxiv.org/abs/2408.03414 (visited on 12/06/2024).
[6] Hongrong Cheng, Miao Zhang, and Javen Qinfeng Shi. MINI-LLM: Memory-Efficient Structured Pruning
for Large Language Models . arXiv:2407.11681 [cs]. July 2024. DOI:10.48550/arXiv.2407.11681 .URL:
http://arxiv.org/abs/2407.11681 (visited on 12/08/2024).
[7] Hyunsoo Cho et al. Prompt-Augmented Linear Probing: Scaling beyond the Limit of Few-shot In-Context
Learners . arXiv:2212.10873 [cs]. June 2023. DOI:10.48550/arXiv.2212.10873 .URL:http://arxiv.
org/abs/2212.10873 (visited on 12/08/2024).
[8] Alberto Fernandez et al. “Evolutionary Fuzzy Systems for Explainable Artificial Intelligence: Why, When,
What for, and Where to?” en. In: IEEE Computational Intelligence Magazine 14.1 (Feb. 2019), pp. 69–81.
ISSN : 1556-603X, 1556-6048. DOI:10.1109/MCI.2018.2881645 .URL:https://ieeexplore.ieee.org/
document/8610271/ (visited on 12/12/2024).
[9] Shafie Gholizadeh and Nengfeng Zhou. Model Explainability in Deep Learning Based Natural Language
Processing . arXiv:2106.07410 [cs]. June 2021. DOI:10.48550/arXiv.2106.07410 .URL:http://arxiv.
org/abs/2106.07410 (visited on 12/11/2024).
[10] Shaona Ghosh et al. AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts .
arXiv:2404.05993 [cs]. Sept. 2024. DOI:10.48550/arXiv.2404.05993 .URL:http://arxiv.org/abs/
2404.05993 (visited on 12/16/2024).
[11] Aaron Grattafiori et al. The Llama 3 Herd of Models . arXiv:2407.21783 [cs]. Nov. 2024. DOI:10.48550/arXiv.
2407.21783 .URL:http://arxiv.org/abs/2407.21783 (visited on 12/06/2024).
[12] Andrey Gromov et al. The Unreasonable Ineffectiveness of the Deeper Layers . arXiv:2403.17887 [cs]. Mar. 2024.
DOI:10.48550/arXiv.2403.17887 .URL:http://arxiv.org/abs/2403.17887 (visited on 12/09/2024).
[13] Benjamin Hoover, Hendrik Strobelt, and Sebastian Gehrmann. exBERT: A Visual Analysis Tool to Explore
Learned Representations in Transformers Models . arXiv:1910.05276 [cs]. Oct. 2019. DOI:10.48550/arXiv.
1910.05276 .URL:http://arxiv.org/abs/1910.05276 (visited on 12/11/2024).
[14] Kuo-Han Hung et al. Attention Tracker: Detecting Prompt Injection Attacks in LLMs . arXiv:2411.00348 [cs].
Nov. 2024. DOI:10.48550/arXiv.2411.00348 .URL:http://arxiv.org/abs/2411.00348 (visited on
12/13/2024).
[15] Hakan Inan et al. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations .
arXiv:2312.06674 [cs]. Dec. 2023. DOI:10 . 48550 / arXiv . 2312 . 06674 .URL:http : / / arxiv . org /
abs/2312.06674 (visited on 12/09/2024).
[16] Shuyu Jiang, Xingshu Chen, and Rui Tang. Prompt Packer: Deceiving LLMs through Compositional Instruction
with Hidden Attacks . arXiv:2310.10077 [cs]. Oct. 2023. DOI:10.48550/arXiv.2310.10077 .URL:http:
//arxiv.org/abs/2310.10077 (visited on 12/16/2024).
[17] Thennal D. K, Tim Fischer, and Chris Biemann. Large Language Models Are Overparameterized Text Encoders .
arXiv:2410.14578 [cs]. Oct. 2024. DOI:10.48550/arXiv.2410.14578 .URL:http://arxiv.org/abs/
2410.14578 (visited on 12/06/2024).
[18] lakeraai/pint-benchmark . original-date: 2024-03-27T19:04:05Z. Dec. 2024. URL:https://github.com/
lakeraai/pint-benchmark (visited on 12/13/2024).
11
Page 12:
[19] Jaesong Lee, Joong-Hwi Shin, and Jun-Seok Kim. “Interactive Visualization and Manipulation of Attention-
based Neural Machine Translation”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural
Language Processing: System Demonstrations . Ed. by Lucia Specia, Matt Post, and Michael Paul. Copenhagen,
Denmark: Association for Computational Linguistics, Sept. 2017, pp. 121–126. DOI:10.18653/v1/D17-2021 .
URL:https://aclanthology.org/D17-2021 (visited on 12/11/2024).
[20] Lijun Li et al. SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models .
arXiv:2402.05044 [cs]. June 2024. DOI:10.48550/arXiv.2402.05044 .URL:http://arxiv.org/abs/
2402.05044 (visited on 12/06/2024).
[21] Scott Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions . arXiv:1705.07874 [cs].
Nov. 2017. DOI:10.48550/arXiv.1705.07874 .URL:http://arxiv.org/abs/1705.07874 (visited on
12/11/2024).
[22] Haoyan Luo and Lucia Specia. From Understanding to Utilization: A Survey on Explainability for Large
Language Models . arXiv:2401.12874 [cs]. Feb. 2024. DOI:10.48550/arXiv.2401.12874 .URL:http:
//arxiv.org/abs/2401.12874 (visited on 12/12/2024).
[23] Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-Pruner: On the Structural Pruning of Large Language
Models . arXiv:2305.11627 [cs]. Sept. 2023. DOI:10.48550/arXiv.2305.11627 .URL:http://arxiv.org/
abs/2305.11627 (visited on 12/08/2024).
[24] Maximilian Mozes et al. Towards Agile Text Classifiers for Everyone . arXiv:2302.06541 [cs]. Oct. 2023. DOI:
10.48550/arXiv.2302.06541 .URL:http://arxiv.org/abs/2302.06541 (visited on 12/13/2024).
[25] Nicolas Papernot and Patrick McDaniel. Deep k-Nearest Neighbors: Towards Confident, Interpretable and
Robust Deep Learning . arXiv:1803.04765 [cs]. Mar. 2018. DOI:10 . 48550 / arXiv . 1803 . 04765 .URL:
http://arxiv.org/abs/1803.04765 (visited on 12/12/2024).
[26] Protect AI. “deberta-v3-base-prompt-injection”. In: (). Publisher: Hugging Face Version Number: 7e5dcc7.
DOI:10.57967/HF/2739 .URL:https://huggingface.co/protectai/deberta-v3-base-prompt-
injection (visited on 12/09/2024).
[27] Reshabh K. Sharma, Vinayak Gupta, and Dan Grossman. SPML: A DSL for Defending Language Models
Against Prompt Attacks . arXiv:2402.11755 [cs]. Feb. 2024. DOI:10 . 48550 / arXiv . 2402 . 11755 .URL:
http://arxiv.org/abs/2402.11755 (visited on 12/06/2024).
[28] Oscar Skean, Md Rifat Arefin, and Ravid Shwartz-Ziv. “Does Representation Matter? Exploring Intermediate
Layers in Large Language Models”. In: Workshop on Machine Learning and Compression, NeurIPS 2024 . 2024.
URL:https://openreview.net/forum?id=FN0tZ9pVLz .
[29] Hendrik Strobelt et al. Seq2Seq-Vis: A Visual Debugging Tool for Sequence-to-Sequence Models .
arXiv:1804.09299 [cs]. Oct. 2018. DOI:10 . 48550 / arXiv . 1804 . 09299 .URL:http : / / arxiv . org /
abs/1804.09299 (visited on 12/11/2024).
[30] Lucrezia Valeriani et al. The geometry of hidden representations of large transformer models . arXiv:2302.00294
[cs]. Oct. 2023. DOI:10.48550/arXiv.2302.00294 .URL:http://arxiv.org/abs/2302.00294 (visited
on 12/13/2024).
[31] Jesse Vig. Visualizing Attention in Transformer-Based Language Representation Models . arXiv:1904.02679 [cs].
Apr. 2019. DOI:10.48550/arXiv.1904.02679 .URL:http://arxiv.org/abs/1904.02679 (visited on
12/11/2024).
[32] Kevin Wang et al. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small .
arXiv:2211.00593 [cs]. Nov. 2022. DOI:10.48550/arXiv.2211.00593 .URL:http://arxiv.org/abs/
2211.00593 (visited on 12/13/2024).
[33] Lai Wei et al. Diff-eRank: A Novel Rank-Based Metric for Evaluating Large Language Models . arXiv:2401.17139
[cs]. Oct. 2024. DOI:10.48550/arXiv.2401.17139 .URL:http://arxiv.org/abs/2401.17139 (visited
on 12/11/2024).
[34] Lianmin Zheng et al. LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset . arXiv:2309.11998
[cs]. Mar. 2024. DOI:10.48550/arXiv.2309.11998 .URL:http://arxiv.org/abs/2309.11998 (visited
on 12/07/2024).
[35] Julia El Zini and Mariette Awad. “On the Explainability of Natural Language Processing Deep Models”. In:
ACM Computing Surveys 55.5 (May 2023). arXiv:2210.06929 [cs], pp. 1–31. ISSN : 0360-0300, 1557-7341. DOI:
10.1145/3529755 .URL:http://arxiv.org/abs/2210.06929 (visited on 12/11/2024).
12
Page 13:
7 Appendix
7.1 Cross-Validation Results
We observe that each model had large variations in performance, especially on a very small number of training examples.
In the prompt injection task, there is a sharp drop in performance with a training set of around 40, which can be
observed in Figures 12 , 13. Interestingly, these changes were most present in the intermediate layers of each model,
with the fine-tuned model having the most variability overall. Because this drop in performance is present in each
model and layer trained, our assumption is that the higher variability inherent in small datasets causes sharp changes
in estimated performance. We perform 10-fold cross-validation on the DeBERTa model on the prompt injection task.
From Figure 6, we observe that cross-validation appears to stabilize the performance across all layers of the model. We
conclude that model validation metrics such as cross-validation help provide more reliable estimates of the performance
of low-resource classifiers, especially when the number of examples is less than 100.
Figure 6: Performance graph of cross-validated DeBERTa LEC models on the prompt injection task.
7.2 Layer Concatenation
Since the layer representation of text can be significantly different between layers as the number of layers between them
increases, we performed experiments on whether the model could benefit from using the encoding from multiple layers.
In the experiment, we evaluated Qwen 2.5 0.5B using our prompt injection dataset. Rather than using a single layer’s
encoding, we set the features for our Ridge Classifier to be the concatenation of all previous layers. We then plotted and
compared the concatenated performance to performance using single layers. From Figure 7 and Gromov’s assertion
that model layers are largely dependent on the previous ones[12], we conclude that layer concatenation has little to no
effect on model performance.
13
Page 14:
Figure 7: Performance graphs of Qwen 2.5 0.5B Instruct LEC models using concatenated layers and with a singular layer.
Model Task System Prompt User Prompt
Qwen 2.5 0.5-
3BContent Safety You are Qwen, created by Alibaba Cloud. You are a
helpful assistant.{User Prompt}
Qwen 2.5 0.5-
3BPrompt Injec-
tion Detection{System Prompt} {User Prompt}
GPT-4o Content Safety Provide your safety assessment for ONLY THE
LAST User message in the above conversation: -
First line must read ’safe’ or ’unsafe’. - If unsafe, a
second line must include a comma-separated list of
violated categories.Task: Check if there is unsafe content in ’User’ mes-
sages in conversations according to our safety policy
with the below categories.
<BEGIN UNSAFE CONTENT CATEGORIES>
{categories}
<END UNSAFE CONTENT CATEGORIES>
<BEGIN CONVERSATION>
User: {question}
<END CONVERSATION>
GPT-4o Prompt Injec-
tion Detection# Instructions
Classify whether the User Prompt violates the guide-
lines defined in the System Prompt.
## System Prompt
{System}
## User Prompt
{User}
Llama Guard 3
1B/8BContent Safety Task: Check if there is unsafe content in ’User’ mes-
sages in conversations according our safety policy
with the below categories.
<BEGIN UNSAFE CONTENT CATEGORIES>
{Categories}
<END UNSAFE CONTENT CATEGORIES>
<BEGIN CONVERSATION>
User: {Question}
<END CONVERSATION>
Provide your safety assessment for ONLY THE
LAST User message in the above conversation:
- First line must read ’safe’ or ’unsafe’.
- If unsafe, a second line must include a comma-
separated list of violated categories.
ProtectAI
DeBERTa v3
Base Prompt
Injection*Prompt Injec-
tion Detection# Instructions
Classify whether the User Prompt violates the guide-
lines defined in the System Prompt.
## System Prompt
{System Prompt}
## User Prompt
{User Prompt}
Table 5: Prompts used to evaluate each model on both tasks.
*DeBERTa models are not chat-based, meaning the "User Prompt" is simply the text that was sent directly to the model.
14
Page 15:
Figure 8: Full performance of each Qwen 2.5 Instruct LEC model on all 3 levels of the content safety binary classification dataset.
15
Page 16:
Figure 9: Full performance of each Llama Guard 3 LEC model on all 3 levels of the content safety binary classification dataset.
16
Page 17:
Figure 10: Full performance of each Qwen 2.5 Instruct LEC model on all 3 levels of the content safety multi-class classification
dataset.
17
Page 18:
Figure 11: Full performance of each Llama Guard 3 LEC model on all 3 levels of the content safety multi-class classification dataset.
Figure 12: Full performance of each Qwen 2.5 Instruct LEC model on the prompt injection classification dataset.
Figure 13: Full performance of the DeBERTa v3 Base Prompt Injection v2 LEC models on the prompt injection dataset.
18