loader
Generating audio...

arxiv

Paper 2410.18527

Understanding Ranking LLMs: A Mechanistic Analysis for Information Retrieval

Authors: Tanya Chowdhury, Atharva Nijasure, James Allan

Published: 2024-10-24

Abstract:

Transformer networks, particularly those achieving performance comparable to GPT models, are well known for their robust feature extraction abilities. However, the nature of these extracted features and their alignment with human-engineered ones remain unexplored. In this work, we investigate the internal mechanisms of state-of-the-art, fine-tuned LLMs for passage reranking. We employ a probing-based analysis to examine neuron activations in ranking LLMs, identifying the presence of known human-engineered and semantic features. Our study spans a broad range of feature categories, including lexical signals, document structure, query-document interactions, and complex semantic representations, to uncover underlying patterns influencing ranking decisions. Through experiments on four different ranking LLMs, we identify statistical IR features that are prominently encoded in LLM activations, as well as others that are notably missing. Furthermore, we analyze how these models respond to out-of-distribution queries and documents, revealing distinct generalization behaviors. By dissecting the latent representations within LLM activations, we aim to improve both the interpretability and effectiveness of ranking models. Our findings offer crucial insights for developing more transparent and reliable retrieval systems, and we release all necessary scripts and code to support further exploration.

Paper Content:
Page 1: Understanding Ranking LLMs: A Mechanistic Analysis for Information Retrieval Tanya Chowdhury University of Massachusetts Amherst MA, USAAtharva Nijasure University of Massachusetts Amherst MA, USAJames Allan University of Massachusetts Amherst MA, USA ABSTRACT Transformer networks, particularly those achieving performance comparable to GPT models, are well known for their robust feature extraction abilities. However, the nature of these extracted features and their alignment with human-engineered ones remain unex- plored. In this work, we investigate the internal mechanisms of state-of-the-art, fine-tuned LLMs for passage reranking. We employ a probing-based analysis to examine neuron activations in rank- ing LLMs, identifying the presence of known human-engineered and semantic features. Our study spans a broad range of feature categories, including lexical signals, document structure, query- document interactions, and complex semantic representations, to uncover underlying patterns influencing ranking decisions. Through experiments on four different ranking LLMs, we iden- tify statistical IR features that are prominently encoded in LLM activations, as well as others that are notably missing. Further- more, we analyze how these models respond to out-of-distribution queries and documents, revealing distinct generalization behaviors. By dissecting the latent representations within LLM activations, we aim to improve both the interpretability and effectiveness of ranking models. Our findings offer crucial insights for developing more transparent and reliable retrieval systems, and we release all necessary scripts and code to support further exploration. ACM Reference Format: Tanya Chowdhury, Atharva Nijasure, and James Allan. 2018. Understanding Ranking LLMs: A Mechanistic Analysis for Information Retrieval. In .ACM, New York, NY, USA, 11 pages. https://doi.org/XXXXXXX.XXXXXXX 1 INTRODUCTION For many decades, the domain of passage retrieval and reranking has predominantly used algorithms grounded in statistical human- engineered features, typically derived from the query, the document set, or their interactions. In particular, the features in training and evaluation datasets such as the learning-to-rank MSLR dataset [ 42] are largely derived from features manually developed to support successful statistical ranking systems. However, recent advances have seen state-of-the-art passage retrieval and ranking algorithms increasingly pivot to neural network-based models. The introduc- tion of Large Language Models (LLMs) has notably widened the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. Conference’17, July 2017, Washington, DC, USA ©2018 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM https://doi.org/XXXXXXX.XXXXXXXperformance disparity between neural-based and traditional statis- tical rankers. Access to open-source ranking models, particularly those ex- hibiting performance comparable to GPTs, provides a unique op- portunity to explore the inner workings of transformer-based archi- tectures. Neural networks are recognized for their robust feature extraction capabilities; however, the representation of these fea- tures remains elusive, making it challenging to discern their nature and potential correlation with features engineered by humans. As a result, despite their efficacy, neural networks present a challenge in terms of transparency: their feature representations are complex and the location of critical features within the network remains obscure. In this context, mechanistic interpretability [ 44] aims to demystify the internal workings of transformer architectures, show- ing how LLMs process and learn information differently compared to traditional methods. Our research is motivated by the desire to determine whether the statistical features traditionally valued in algorithms like BM25 and tf*idf are somehow encoded within LLM architectures. This study stands to bridge the gap between neural and statistical methodologies, by using probes to generate hypotheses on the inner workings of the LLMs, offering insights that could enhance the field of information retrieval by enabling more intuitive explanations and better support for model design and analysis. Summary: This study is built upon four different LLM architec- tures : Llama2-7b, Llama2-13b, Llama3.1-8b, and Pythia-6.9b. For each of these we use or create a LoRa fine-tuned variant (e.g., Ran- kLlama [ 35]), optimized for passage reranking tasks using the MS Marco dataset. These point-wise rankers demonstrate substantial accuracy improvements over their non-LLM counterparts. Our anal- ysis concentrates on extracting activations from the MLP unit of each transformer block, which is posited to contain the key fea- ture extractors [ 24]. We aggregate these activations for each input sequence (query-document pairs) and assign labels to these pairs corresponding to each feature that is a target of our probe. Subse- quently, we employ a regression with regularization to correlate these activations with the labels. Our findings reveal a pronounced representation of certain MSLR features within the ranking LLMs, while others are markedly absent. We also observe that distinct LLM architectures capture very similar statistical IR features, when fine-tuned using the same dataset. These insights will allow future researchers to formulate hypotheses concerning the underlying circuitry of ranking LLMs. Research Questions: In this study, we aim to explore several internal mechanistic aspects of ranking LLMs through probing techniques. Specifically, we seek to determine whether known sta- tistical information retrieval (IR) features are present within the activations of LLMs. We are also interested in identifying groups of features and understanding how they may combine or interactarXiv:2410.18527v2 [cs.IR] 22 Feb 2025 Page 2: Conference’17, July 2017, Washington, DC, USA Tanya Chowdhury, Atharva Nijasure, and James Allan within the LLM’s activations. Additionally, we investigate whether LLMs contain components that mimic similarity scores from mod- els like BERT or T5. It is also of interest to find out if different LLM architectures encode the same statistical IR features within their activations. Finally, this inquiry extends to examining whether the latent features encoded within activations of LLMs remain con- sistent or change when the model encounters out-of-distribution queries or documents. By answering these questions, we aim to gain a deeper understanding of the inner workings of ranking LLMs and the extent to which they align with traditional IR methodologies. 1.1 Contributions and Findings (1)We discovered that several human-engineered metrics, such ascovered query term number ,covered query term ratio ,min of term frequency ,min and mean of stream length normalized term frequency , and variance of tf*idf , are prominently encoded within LLM activations. In contrast, certain features, includ- ingsum of stream length normalized term frequency ,max of stream length normalized term frequency , and BM25 , show no discernible representation in LLM activations. (2)We fine-tuned and probed two additional reranking models, Llama3.1-8b and Pythia-6.9b, resulting in RankLlama3-8b and RankPythia, alongside probing RankLlama2-7b and RankLlama2- 13b. The results revealed broadly consistent outcomes across all models, suggesting that different LLM architectures, when fine-tuned similarly, encode latent features in a comparable way within their activations. (3)We found that the activation patterns of the listed features remain consistent even when the model encounters out-of- distribution queries or documents on RankLlama2-7b, RankLlama3- 8b and RankPythia-6.9b. However, they do not remain consis- tent for certain features of RankLlama2-13b, suggesting poten- tial overfitting in the LLM during fine-tuning. (4)We identified specific combinations of MSLR features, like the sum of covered query term ratio, mean stream length normalized term frequency, and variance of tf*idf , along with the sum’s square and cube, all of which also exhibit a strong correlation with LLM activations. (5)Finally, we show further evidence on the presence of these encoded features using Shapley values. All scripts, datasets, and fine-tuned ranking LLMs developed and/or used in this study are made publicly available.1 2 BACKGROUND & RELATED WORK We discuss concepts in Mechanistic Interpretability, in particular sparse probing. Then we touch on inner interpretability approaches in Information retrieval. 2.1 Mechanistic Interpetability & Sparse Probing Mechanistic Interpretability is the ambitious goal of gaining an algorithmic-level understanding of any deep neural network’s com- putations [ 44]. This goal can be achieved in numerous ways, some of them being by studying weights (e.g., weight masking [ 12,50], 1https://suppressed.for.reviewby continual learning [ 15]), by studying individual neurons (e.g., excitement based [ 52], activation patching [ 37], gradient-based [ 2], perturbation and ablation based [ 53]), by studying subnetworks (e.g., sparsity based [ 39], circuit analysis based [ 20]), by studying representations (e.g., tokens [ 33], attention [ 11], probing [ 4,7,26]), by training sparse auto encoders [ 13,31,38] etc. In this study, we focus on studying the activations of individual neurons and also some groups of neurons with the help of probing regressors. To that end, we use concepts from a host of techniques mentioned above. While the concept of reverse-engineering specific neurons within large language models (LLMs) is relatively new, existing studies [23,24] illustrate that the feed-forward layers of transformers, com- prising two-thirds of the model’s parameters, function as key-value memories. These layers activate in response to specific inputs, a mechanism we aim to demystify by reverse-engineering the activa- tion function of these neurons. A notable challenge in this endeavor is the phenomenon of superposition, where early layers in LLMs select and store a vast array of features—often exceeding the num- ber of available neurons—in a linear combination across multiple neurons. In contrast, later layers tend to focus on more abstract features, discarding those deemed non-essential [25]. Probing aims to determine if a given representation effectively captures a specific type of information, as discussed by Belinkov [ 4]. This technique employs transfer learning to test whether embed- dings contain information pertinent to a target task. The three essential steps in probing include: (1) obtaining a dataset with ex- amples that exhibit variation in a particular quality of interest, (2) embedding these examples, and (3) training a model on these embeddings to assess if it can learn the quality of interest. This method is versatile as it can utilize any inner representation from any model. However, a limitation of probing methods is that a suc- cessful probe does not necessarily mean that the probed model actually utilizes that information about the data [ 45]. Belinkov [ 4] provides a comprehensive survey on probing methods for large language models, discussing their advantages, disadvantages, and complexities. Gurnee et al. [ 25] further introduce sparse probing, where they mine features of interest over groups of representations up to𝑘in size. We build upon sparse probing in this work. 2.2 Inner Interpretability in IR Most works of interpretability in information retrieval [ 1] have been model extrinsic [ 9,43,47]. Among model intrinsic interpretability methods, some works focus on using gradient-based methods to identify important neurons [ 6,19]. Although gradient based meth- ods give an accurate perspective on the flow of information, they are too fine-grained to give a human-level understanding of the LLM’s inner circuit [ 1]. A number of works have attempted to ex- amine the inner workings of neural retrievers to understand if they satisfy IR axioms and/or to spot known features. Fan et al. [ 18] probe fine-tuned BERT models on three different tasks, namely doc- ument retrieval, answer retrieval, and passage retrieval, to find out if these different forms of relevance lead to different models. Zhan et al. [ 51] study the attention patterns of BERT after fine-tuning on the document ranking task and make note of how BERT dumps Page 3: Understanding Ranking LLMs: A Mechanistic Analysis for Information Retrieval Conference’17, July 2017, Washington, DC, USA Table 1: The Ranking effectiveness of various fine-tuned ranking LLMs, in comparision to the performance of a BM25 reranker. We observe that while LLMs are effecient for reranking, they are opaque and complex to understand. Model Name Base Model DEV MRR@10 DL19 NDCG@10 DL20 NDCG@10 BM25 Lucene [29] 16.7 50.6 48.0 RankLLama2-7b Llama2 [49] 44.9 75.6 77.4 RankLLama2-13b Llama2 [49] 45.2 76.0 77.9 RankLlama3-8b Llama3 [16] 42.8 74.3 72.4 RankPythia-6.9b Pythia [5] 42.9 76.0 71.5 Figure 1: The RankLlama2-7B internal architecture with 32 layers and 4096 dimensional vectors. Similarly, the 13B mod- els contains 40 layers and 5120 dimensional vectors. Note the score layer, added to the general Llama2 architecture during fine-tuning using LoRa. duplicate attention weights on high frequency tokens (like peri- ods). Similar to them, Choi et al. [ 8] study attention maps in BERT and report the discovery of the IDF feature within BERT attention patterns. ColBERT investigations [ 21,22] study its term-matching mechanism and conclude that it captures a notion of term impor- tance, which is enhanced by fine-tuning. MacAvaney et al. [ 36] propose ABNIRML a set of diagnostic probes for neural retrieval models that allow searching for features like writing styles, factu- ality, sensitivity and word order in models like ColBERT and T5. Recently Parry et al. [ 41] provide a framework for diagnostic anal- ysis and intervention within Neural Ranking models by generating perturbed inputs followed by activation patching. These studies, however, have been conducted in the pre-LLM era and hence are much smaller in size. It is unknown if the findings from BERT will carry over to LLMs like Llama, given that BERT is an encoder-only model whereas most modern LLMs are decoder- only models. While most of the above works focus on attention heads and pattern, our work focuses on probing the MLP activation layers, which are now believed to be the primary feature extraction location within the LLM [ 25]. To the best of our knowledge, this is one of the first works towards study of neuron activations in LLMs for a large scale of statistical and semantic features.3 IDENTIFYING CONTEXT NEURONS We describe our probing pipeline to identify context neurons - neurons that are sensitive to or encode desired features - in the LLM architecture. Ranking Models: For our Ranking LLM interpretability study, we first select RankLlama [ 35], an open source pointwise reranker that has been fine-tuned on Llama-2 with the MS MARCO dataset using LoRA. Given an input sequence (query Q, document set D), RankLlama reranks them as : 𝑖𝑛𝑝𝑢𝑡 =′𝑞𝑢𝑒𝑟𝑦 :{𝑄}𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 :{𝐷}</𝑠>′ 𝑆𝑖𝑚(𝑄,𝐷)=𝐿𝑖𝑛𝑒𝑎𝑟(𝐷𝑒𝑐𝑜𝑑𝑒𝑟(𝑖𝑛𝑝𝑢𝑡)[−1]) where Linear(·) is a linear projection layer that projects the last layer representation of the end-of sequence token to a scalar. The model is fine-tuned using a contrastive loss function. RankLlama is accessible on Huggingface2and has demonstrated near state-of- the-art performance on the MS MARCO Dev set, as noted by Ma et al. [35]. We experiment on both the 7b and 13b parameter models fine-tuned for passage reranking. To facilitate comparison with other ranking architectures, we additionally fine-tune Llama3.1-8b and Pythia-6.9b using an ap- proach similar to the fine-tuning methodology employed by the RankLlama authors. Each of these LLMs is fine-tuned with a LoRa rank of 32 for 1 epoch, and their performance is evaluated on the commonly used TREC DL19 and DL20 datasets. Table 1 compares the ranking performance of all LLMs used in this study, indicating that, once fine-tuned, they achieve similar effectiveness. Internal Architecture: We refer to the internal architecture of RankLlama2-7b in Figure 1. Each block of the Llama transformer architecture can be divided into two main groups: the Multi-Head self attention layers and the MLP layers. For example, the Llama-2 7b and 13b architectures consist of 32 and 40 such identical blocks, where each component has a dimensionality of 4096 and 5120 re- spectively. The feed-forward sublayer takes in the output of the multi-head attention layer and performs two linear transformations over it, with an element-wise non-linear function 𝜎in between. 𝑓𝑙 𝑖=W𝑙 𝑣×𝜎(W𝑙 𝑘𝑎𝑙 𝑖+𝑏𝑙 𝑘)+𝑏𝑙 𝑣 where𝑎𝑙 𝑖is the output of the MHA block in the 𝑙layer, and W𝑙𝑣, W𝑙 𝑘,𝑏𝑙 𝑘, and𝑏𝑙𝑣are learned weight and bias matrices. This MLP layer applies pointwise nonlinearities to each token independently and therefore performs the majority of feature ex- traction for the LLM. Additionally, the MLP layers account for2 3rd 2https://huggingface.co/castorini/rankllama-v1-7b-lora-passage Page 4: Conference’17, July 2017, Washington, DC, USA Tanya Chowdhury, Atharva Nijasure, and James Allan of total parameters. As a result, our study focuses on the value of the residual stream (also known as hidden state) for token 𝑡at layer 𝑙, right after applying the MLP layers. Elhage et al. [ 17] provide further details about the transformer architecture math. Activation Sourcing: We set up PyTorch transformer forward hooks to mine activations from the ranking LLMs. We tokenize and feed the query-document pairs to the pointwise LLM discussed above, and capture activations corresponding to each input se- quence across all layers. The dimension of each layer’s activation is the number of tokens times the number of hidden units in the LLM. This dimension is 4096 in Llama2-7b and 5120 in Llama2-13b. To reduce complexity of the computation, we aggregate activations across tokens in an input sequence within a layer. Following Gurnee et al. [ 25], we try out mean and max activation aggregation to ulti- mately obtain a 4096/5120 dimensional activation vector per layer. This corresponds to aggregated activations for 4096/5120 neurons in each layer, which is the focal point of our study. We quantize and store these activation tensors to be used later for further ex- periments. Intuitively, we have captured the internals of the model at this point, in which we can start searching for desired features. Target Features: The MSLR dataset [ 42] provides a compre- hensive list of features that have been historically recognized as highly effective for ranking models [ 27,28]. Consequently, we aim to search for a subset of these MSLR features within the LLM acti- vations. Specifically, we focus on mining the following 19 features from the MSLR dataset within the LLM (“stream” means the doc- ument or passage text): Min TF, Min(TF/L), Min TF*IDF, Covered QT Number, Covered QT Ratio, Mean(TF/L), Variance TF*IDF, BM25, Mean TF*IDF, Variance(TF/L), Mean TF, Variance TF, Sum TF, Max TF, Max(TF/L), Stream length, Sum TF, Max TF*IDF and Sum TF*IDF . In addition to MSLR features, we mine for known query-document similarity metrics within the LLM architecture. For this we con- sider traditional query-document similarity scores like tf*idf cosine scores, Euclidian score, Manhattan score, KL-divergence score, Jensen- Shannon divergence score as well as popular semantic relevance metrics like BERT andT5scores. Each feature is probed individually; for instance, the probe for stream length is dedicated solely to that feature without searching for any others. Although many of these features are known to be correlated and may share neurons, this study does not include correlation analysis amongst features. Probing Datasets: To facilitate probing [ 46], we need to cu- rate a dataset of input sequences to study the model. We select query-document pairs from the MS MARCO test set [ 40] for this purpose. For each query, we include documents that are highly rel- evant, highly irrelevant, and of intermediate relevance. Our input sequences for the ranking LLM consist of these query-document pairs. Note that it is important for each probing dataset to be bal- anced – i.e., contain a uniform number of samples across the range of values of the feature being probed [ 4,25]. For instance, when analyzing the BM25 feature, our dataset included a balanced ratio of query-documents with both low and high BM25 values. Having an unbalanced probing dataset can lead to a biased analysis as the model might have a tendency to overfit the dominant class. For each input sequence, we calculate the expected value of the feature being studied. Consequently, we compute the values for each of the19 MSLR features and 7 similarity scores for our query-document pairs. Identifying context neurons: After collecting activations and feature labels across a wide range of input sequences (query-document pairs), we start the process of identifying context neurons. Un- like most prior research that employs classification-based probes [18,36], we utilize ridge regression-based probes to capture the continuous nature of the features under study. We first split the activations and their corresponding labels into training, validate and testing sets (60:20:20). Subsequently, we perform a layer-wise analysis. For each layer and feature, we employ a sparse probe by fitting the activation vectors of the training split to the corresponding feature labels. To maintain the sparsity of the context neurons, we use Lasso (L1 regularization) with 𝛼=0.1. We perform 5-fold cross validation to avoid overfitting and provide a more reliable estimate. After fitting the activations to the feature’s labels, we compute the 𝑅2score to measure how well the regression curve explains the variability of the labels. 𝑅2ranges from 0 to 1, with 1 indicating that the curve perfectly explains the varibiality in the data. A high 𝑅2score indicates the presence of a neuron that is sensitive to or activates for the feature being studied, whereas a negative 𝑅2confirms a negative correlation between the feature and the neuron’s activations. Experimental Details: For our probing experiments, we select 500 queries from the MS MARCO dev set and retrieve the top 100 documents for each query from the MS MARCO collections corpus using a BM25 ranker. We then compute activations corresponding to each query-document pair using all four models. The results reported in the following section are after mean aggregation and quantization for efficient storage. The size of each probing dataset is greater than 5,000 to ensure incorporation of a wide range of values for the feature being probed. The 100 documents retrieved for a query serve as the query’s corpus for idf calculation. We use the bert-base and t5-base models for BERT and T5 score computations. We use the well-known Okapi implementation of BM25 in our experiments. 4 RESEARCH QUESTIONS 4.1 Statistical features within RankLlama2-7b 4.1.1 MSLR features. For the first set of experiments, we designed sparse probes for each of the selected MSLR features and conducted experiments on each layer of the RankLlama2-7b architecture. All experiments were run with both mean and max activation aggrega- tion over the input sequence. Our findings (Figure 2) are categorized into two groups: (i) Features that exhibit a strong fit in certain lay- ers, (ii) Features that do not correlate to the LLM activations in any layer Positives : Features that achieve an 𝑅2score greater than 0.85 in any particular layer are considered strong positives. This means it is highly likely that the activation has extracted the particular feature within the neuron. We found that seven metrics, namely Min of TF ,Min(TF/L) ,Min of TF*IDF ,covered QT Number ,covered QT Ratio ,Mean(TF/L) , and variance of TF*IDF frequently exhibit low Mean Square Errors and high 𝑅2scores. This indicates that there exist MLP neurons within the investigated layers that perform feature extractions similar to these MSLR features. Additionally, we Page 5: Understanding Ranking LLMs: A Mechanistic Analysis for Information Retrieval Conference’17, July 2017, Washington, DC, USA Figure 2: Probing for statistical features from the MSLR dataset in RankLlama2-7b model. Here 𝑄𝑇stands for Query Term, 𝑇𝐹 stands for Term Frequency and ·/𝐿stands for length normalized. The graph lines indicate the presence of a particular feature along the layers of the LLM. Certain features like 𝑀𝑖𝑛𝑇𝐹∗𝐼𝐷𝐹show consistent presence across the layers. Other features like 𝐶𝑜𝑣𝑒𝑟𝑒𝑑𝑄𝑇 𝑁𝑢𝑚𝑏𝑒𝑟 ,𝐶𝑜𝑣𝑒𝑟𝑒𝑑𝑄𝑇 𝑅𝑎𝑡𝑖𝑜 ,𝑀𝑒𝑎𝑛(𝑇𝐹/𝐿)and𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑇𝐹∗𝐼𝐷𝐹show increasing prominence from the 1st layer to the last, ultimately playing an important role in ranking decision making. Other MSLR features like 𝑆𝑢𝑚(𝑇𝐹/𝐿),𝑀𝑎𝑥(𝑇𝐹/𝐿), and 𝑆𝑢𝑚𝑇𝐹∗𝐼𝐷𝐹show negative correlation with RankLlama decision making. Figure 3: Plot showing 𝑅2scores of statistical query-document distance metrics when used to probe Rankllama2-7b. Scores indicate that these distance metrics are not encapsulated within Rankllama2-7b as is. observed that the 𝑅2scores obtained with mean and max activation aggregation are comparable, except minor exceptions, suggesting either aggregation method is suitable.Negatives : Certain features failed to achieve an 𝑅2score greater than 0.1 in the final layer, often achieving a highly negative 𝑅2. This indicates that the LLM does not consider these features important Page 6: Conference’17, July 2017, Washington, DC, USA Tanya Chowdhury, Atharva Nijasure, and James Allan Figure 4: Plot comparing the maximum 𝑅2score of each query-document IR feature metrics when used to probe different Ranking LLM architectures - RankLlama2-7b and 13b, RankLlama3-8b and RankPythia-6.9b. Here ·/𝐿stands for length normalized and 𝑄𝑇stands for Query Term. We observe that different LLM architectures, when fine-tuned on the same dataset and loss function, learn very similar ranking related latent features. This suggests, that our probing findings aren’t LLM specific, but generalizable. in their current form. The 10 features from the MSLR set that fell into this category include: Sum and Max of TF*IDF ,Sum(TF/L) and Max(TF/L) ,Max and variance of TF ,stream length andSum of TF . Thus, while Mean(TF/L) is something the LLM highly seeks, it does not seek for the sum andmax ofnormalized term frequency at all! 4.1.2 Traditional similarity metrics. We also probe the LLM ar- chitecture for non-semantic statistical query-document distance metrics like tf*idf cosine score ,euclidean score ,manhattan score , Kullback-Leibler divergence score and Jensen-Shannon divergence score . Our motive behind this is to identify if the LLMs include com- ponents that mimic statistical similarity measures. Results for the probe on RankLlama2-7b are visualized in Figure 3. We observe that statistical score based methods do not correlate well with neuron activations. Out of the feature examined, euclidean score shows the highest correlation with 𝑅2reaching 0.6, dipping before the final layer. Given its well known success in the past, it is ironic that the tfi*df cosine score between the query and document shows the least correlation when probed for, within RankLlama activations.4.1.3 BERT and T5 scores. BERT and T5 are neural models widely used for retrieval and reranking tasks before the advent of LLMs. LLMs are much larger in size compared to BERT models. As a result it is of interest to see if BERT and T5 subnetworks are present within the studied LLM’s activations. We design probes to mine for the cosine distance between the BERT and T5 embeddings of the query and the document. Our observations show that both BERT and T5 obtain moderate 𝑅2scores in our probes, reaching 0.7and 0.82on RankLlama2-7b and 13b respectively. The BERT and T5 scores follow each other across the LLM layers. This indicates that the LLM likely does not encode BERT and T5 subnetworks as-is. We compare our findings to related works on BERT probing in Section 5. 4.2 Comparing Ranking LLMs We next explore whether different ranking LLMs encode similar or distinct Information Retrieval features. To investigate this, we extend our analysis beyond the RankLlama2-7b model and com- pare it to the two additional fine-tuned LLMs, RankLlama3-8b and Page 7: Understanding Ranking LLMs: A Mechanistic Analysis for Information Retrieval Conference’17, July 2017, Washington, DC, USA Figure 5: Probing RankLlama2-7b and 13b with in-distribution vs out-of-distribution datasets. We witness that most MSLR features when probed for in RankLlama2-7b, show similar performance with both in-distribution and out-of-distribution datasets. This is however not the case with RankLlama2-13b, where certain features like stream length andsum of term frequency show a strong presence in the in-distribution dataset probe, even though they are unlikely features to influence a ranking decision. This suggests overfitting on the MS MARCO dataset in RankLlama2-13b. RankPythia-6.9b, as well as the RankLlama2-13b model. We fine- tune these models under an identical passage reranking framework, with an identical dataset and loss function, to ensure a fair and consistent comparison. Once the models are fine-tuned, we conduct the same probing experiments to analyze how well each ranking LLM encodes the known human-engineered and semantically meaningful attributes relevant to ranking tasks. The results of these probing experiments are summarized and visualized in a spider chart (Fig. 4), allowing for an intuitive comparison of feature encoding across the four models. It highlights the similarities and differences in the latent features captured by LLMs with varying architectures and sizes, shedding light on the generalizability and architecture-specific tendencies of ranking-focused fine-tuning. The figure depicts the 𝑅2score for an IR feature, when probed in a particular LLM, and lies between [0,1].All four ranking LLMs demonstrate broadly comparable patterns for most features, particularly for the Minmetrics (e.g., Min of TF, Min of TF*IDF ), where they uniformly show top scores (all 1.0). They also appear closely aligned on coverage-based features—such as Covered Query Term Number and Ratio—with each scoring in the high 0.95–0.98 range. By contrast, the greatest differences emerge in TF-based statistics like Max of Term Frequency and Variance of TF/TF*IDF , where RankPythia tends to exhibit slightly higher peaks (e.g., a larger 𝑅2forMax of TF ) than the RankLLaMA models. Additionally, all LLMs show near zero performance in encoding Sum of Stream Length Normalized TF , suggesting similarity in how these models handle term-frequency weighting. Overall, the four models encode IR features in a largely similar manner, indicating that our probing findings are generalizable and not LLM specific. Page 8: Conference’17, July 2017, Washington, DC, USA Tanya Chowdhury, Atharva Nijasure, and James Allan Figure 6: Graph showing 𝑅2scores of feature groups (𝑄𝑇𝑅+𝑆𝑇𝐹+𝑉𝑇𝐹𝐼𝐷𝐹),(𝑄𝑇𝑅+𝑆𝑇𝐹+𝑉𝑇𝐹𝐼𝐷𝐹)2and(𝑄𝑇𝑅+𝑆𝑇𝐹+𝑉𝑇𝐹𝐼𝐷𝐹)3 over the layers of the RankLlama2-7b and 13b architectures, where 𝑄𝑇𝑅 represents covered query term ratio ,𝑆𝑇𝐹represents mean stream length normalized term frequency and𝑉𝑇𝐹𝐼𝐷𝐹 represents variance of tf*idf . 4.3 Out-of-distribution queries and documents All four of the LLMs we study were obtained by fine-tuning the base models using LoRa on the MS MARCO dev set [ 35]. It is of interest to see if during inference the fine-tuned LLM extracts different fea- tures from in-distribution and out-of-distribution query-document pairs. We try probing all the studied LLMs with query-document pairs from two others datasets. First, we use the BEIR Scidocs dataset [48], comprising scientific documents and queries. We select 200 queries and use bm25 to retrieve the top 100 documents for each. We then probe the MS MARCO fine-tuned LLMs with BEIR activa- tions. We repeat the process with the SoDup dataset [ 32], which contains a question from StackOverflow as query and a list of other relevant questions from StackOverflow as potential duplicate candidates. For each question, we select known duplicate ques- tions from the dataset and treat them as relevant documents. We again pick 200 queries from this dataset for our out-of-distribution probing experiments. We probe for each of the 19 MSLR features on all the fine-tuned ranking LLMs and compare probing results between in-distribution and the out-of-distribution datasets. We show our comparision of the probing results of RankLlama2-7b vs RankLlama2-13b in Figure 5, reporting the 𝑅2score of each feature in the last layer of the respective LLM. For the feature mean of tf we see that our probes cannot find any activations representing the feature in RankLlama 7b, both with in-distribution and out- of-distribution datasets. However, when probing RankLlama 13b, while out-of-distribution activations do not capture this feature, it gets a hit within in-distribution activations. We see some variance in results between RankLlama2-7b and RankLlama2-13b in this set of experiments. The probes seem to fare similarly on most features between in-distribution and out-of- distribution data on RankLlama2-7b. The only exception being sum of stream length normalized tf , which has a negative 𝑅2score for both in and out distribution data, making its magnitude insignificant. However, the probes seem to fetch different results between in andout-of-distribution data for RankLlama2-13b. It in particular finds probes strongly correlated to stream length ,sum and mean of tf , and mean of tf*idf in the in-distribution data, but notfor the out-of- distribution data. A likely reason for this might be that RankLlama 13b has overfit to the MS MARCO dev set and is hence seeking features like stream length which are known to not generalize for reflecting query-document relevance [4]. We compared the probing performance of in-distribution and out-of-distribution query-document pairs in RankLlama3-8b and RankPythia-6.9b, observing similar performance for both types of inputs, consistent with the results observed for RankLlama2-7b in Figure 5a. The graphs depicting these probing results are not included due to space constraints. 4.4 Feature Groups within RankLlama In the previous subsection, we found that several MSLR features (covered QT Number, covered QT Ratio, mean(TF/L), min TF, min of TF*IDF, Min(TF/L) andvariance of TF*IDF ) appear to be modeled within LLM activations, especially within the later layers. These features also seem to be a fair choice to model relevance based on intuition. However, these features might not be present in- dividually but in combination with one another or individually with different exponents. To test for this possibility, we probe for combinations of these features within different layers of the LLM and find various combinations of covered query term ratio, mean of stream length normalized term frequency and variance of tf*idf . We find strong indication for their combined presence within the LLM activations of the later layers of the LLM. For example, if QTR represents covered query term ratio , STF repre- sents mean of stream length normalized term frequency and VTFIDF represents variance of tf*idf , we find high scores on average for all of QTR+STF, QTR+VTFIDF, STF+VTFIDF, QTR+STF+VTFIDF, QTR*STF, STF*VTFIDF, QTR*VTFIDF, and QTR*STF*VTFIDF when probing the last layer of the LLM. This potentially indicates that Page 9: Understanding Ranking LLMs: A Mechanistic Analysis for Information Retrieval Conference’17, July 2017, Washington, DC, USA the LLM has over the course of layers learned some representation of (QTR+STF+VTFIDF)𝑘. In Figure 6, we show the probing perfor- mance of a sum of those three features and their exponents, relative to BM25 in RankLlama2-7b and 13b models. We observe that the sum of this feature group, its square, as well as its cube consistently show a strong correlation to neuron activations within RankLlama. This analysis provides a foundation for future work to uncover and investigate complex groups of features encoded within LLM activa- tions. By highlighting how various ranking architectures capture and represent different IR features, it opens avenues for deeper exploration into latent feature interactions and the mechanistic underpinnings of ranking-focused language models. 5 DISCUSSION Comparing Notes with Previous Probing Work: A number of studies have probed BERT and T5 with the aim of understanding if they encode concepts like term frequency and inverse document frequency. Formal et al. [ 21] study the matching process of ColBERT [30] and conclude that the model is able to capture a notion of term importance and relies on exact matches for important terms. This is in agreement with our findings, where we find strong correlations to term matching features like covered query term number/ratio and stream length normalized term frequency within LLM activations. In a later work, Formal et al. [ 22] conduct a study to measure the out-of-domain zero-shot capabilities of BERT/T5 models in lexical matching on the BEIR dataset, and show that these models fail to generalize lexical matching in out-of-domain datasets or terms not seen at training time. This finding continues for RankLlama as well, where the model is unable to generalize lexical matching to terms not seen beforehand (Figure 5, RankLlama 13b OOD vs In-distribution probing discrepencies). Different types of neurons: A monosemantic neuron is a neu- ron within a neural network that responds to a single, specific feature or concept in the input data [ 3]. However, in large neural networks the networks often try to extract more features than the number of neurons. In such scenarios, the model has to perform a superposition of features on single neurons in order to compress desired features into the limited number of neurons. This gives rise to polysemantic neurons – i.e., neurons encoding more than one feature. This phenomenon is usually witnessed in the early layers of the LLM architecture and is difficult to disentangle using linear probes. There is unfortunately no known method to iden- tify monosemantic vs polysemantic neurons via probing. Another common phenomena in sparse LLMs is when multiple neurons within a layer together represent a particular human-known con- text. Among the features probed in our study, 23%were found in a single neuron, 32%were distributed between 2-3 neurons and the other 45%were distributed in 4 or more neurons. Validating Probing Results While probing techniques are widely used to understand the internal workings of LLMs, there are several limitations of using probing for this purpose. (1) Probing techniques reveal correlations between specific features and neuron activations, but not a causal analysis of the decision making process. (2) The insights gained from probing are heavily dependent on the dataset used for probing. If the dataset is not representative of the model’s typical input, the results may not accurately reflect themodel’s general behavior. (3) LLMs often rely on complex inter- actions between multiple features. Probing techniques may fail to capture these interactions, leading to an incomplete understanding of how the model makes decisions. (4) Probing can identify features but may mislead regarding their role or significance in the model’s decision-making process. However, many of the limitations of prob- ing can be addressed by employing various techniques, such as creating balanced probing datasets and independently validating probing results through methods like ablation studies and feature attribution analyses. To validate our probing results, we analyzed neurons in the final layer of RankLLaMA models that exhibited strong correla- tions (𝑅2>0.85) with probed features, as this layer directly in- fluences ranking decisions. Using Shapley values [ 10,14,34] for feature attributions we assessed the contributions of individual neurons and neuron groups to the ranking predictions, leveraging a value function based on changes in NDCG scores for MS MARCO query-document pairs. Probing validation was conducted on 100 query-document pairs for each RankLLaMA model, and we com- puted average attributions across neuron groups of varying sizes. Results showed that identified neuron groups ranked within the 95th percentile in 79 out of 100 cases for RankLLaMA 7B and 84 out of 100 for RankLLaMA 13B, confirming that these neurons are instrumental in ranking decisions and validating the location of the probes. 6 CONCLUSION AND FUTURE WORK In this study, we probed four different large language models, each fine-tuned for passage reranking, to look for a subset of features from the MSLR dataset. These human-engineered features, which include query-only, document-only, and query-document interac- tion features, are recognized as significant in ranking tasks. Using a layer-wise probe, we discovered that the activations in most lay- ers could accurately replicate select features such as Min TF*IDF, covered QT Number/Ratio, Min TF, Min(TF/L) , and Mean(TF/L) . This suggests that the fine-tuned Llama network deems these features important. Conversely, we found no correspondence for features likesum andmax oftf*idf andsum andmax oftfin the probed neurons. This indicates the absence of these features in RankL- lama’s neural feature extractors. We also reported observations on feature groups that show a strong correlation with RankLlama neurons and hypothesized that some abstract features are mined by the LLM. We also observed that different LLM architectures, when fine-tuned on the same dataset with the same loss function, exhibit similar latent features when probed. Finally, we compared later features encoded between in-distribution and out-of-distribution datasets and encountered a case of the LLM’s apparently overfitting during fine-tuning. These findings enhanced our understanding of ranking LLMs and gave us generalizable insights for the design of more effective and transparent ranking models. The long-term objectives of this endeavor include: (i) identifying potential modifi- cations to existing MSLR features, such that they further alight with LLM activations, (ii) deciphering and reverse-engineering segments of the LLM that do not correspond to recognized MSLR features, and (iii) ultimately cataloging all features deemed significant by LLMs and plugging them back into simpler statistical models to enhance the performance and interpretability of statistical ranking Page 10: Conference’17, July 2017, Washington, DC, USA Tanya Chowdhury, Atharva Nijasure, and James Allan models. These directions will help further explain the underlying mechanisms of ranking LLMs and improve human trust into them. Acknowledgments This work was supported in part by the Center for Intelligent In- formation Retrieval. Any opinions, findings and conclusions or recommendations expressed in this material are those of the au- thors and do not necessarily reflect those of the sponsor. REFERENCES [1]Avishek Anand, Lijun Lyu, Maximilian Idahl, Yumeng Wang, Jonas Wallat, and Zijian Zhang. 2022. Explainable information retrieval: A survey. arXiv preprint arXiv:2211.02405 (2022). [2]Marco Ancona, Enea Ceolini, Cengiz Öztireli, and Markus Gross. 2019. Gradient- based attribution methods. Explainable AI: Interpreting, explaining and visualizing deep learning (2019), 169–191. [3]David Bau, Jun-Yan Zhu, Hendrik Strobelt, Agata Lapedriza, Bolei Zhou, and Antonio Torralba. 2020. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences 117, 48 (2020), 30071–30078. [4]Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and ad- vances. Computational Linguistics 48, 1 (2022), 207–219. [5]Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al .2023. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning . PMLR, 2397–2430. [6]Catherine Chen, Jack Merullo, and Carsten Eickhoff. 2024. Axiomatic Causal In- terventions for Reverse Engineering Relevance Computation in Neural Retrieval Models. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval . 1401–1410. [7]Nuo Chen, Ning Wu, Shining Liang, Ming Gong, Linjun Shou, Dongmei Zhang, and Jia Li. 2023. Beyond Surface: Probing LLaMA Across Scales and Layers. arXiv preprint arXiv:2312.04333 (2023). [8]Jaekeol Choi, Euna Jung, Sungjun Lim, and Wonjong Rhee. 2022. Finding Inverse Document Frequency Information in BERT. arXiv preprint arXiv:2202.12191 (2022). [9]Tanya Chowdhury, Razieh Rahimi, and James Allan. 2023. Rank-lime: local model-agnostic feature attribution for learning to rank. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval . 33–37. [10] Tanya Chowdhury, Yair Zick, and James Allan. 2024. RankSHAP: a Gold Standard Feature Attribution Method for the Ranking Task. arXiv preprint arXiv:2405.01848 (2024). [11] Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. 2019. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341 (2019). [12] Róbert Csordás, Sjoerd van Steenkiste, and Jürgen Schmidhuber. 2020. Are neural nets modular? inspecting functional modularity through differentiable weight masks. arXiv preprint arXiv:2010.02066 (2020). [13] Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. 2023. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600 (2023). [14] Anupam Datta, Shayak Sen, and Yair Zick. 2016. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In 2016 IEEE symposium on security and privacy (SP) . IEEE, 598–617. [15] Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Chris- tian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. 2019. Bias in bios: A case study of semantic representa- tion bias in a high-stakes setting. In proceedings of the Conference on Fairness, Accountability, and Transparency . 120–128. [16] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024). [17] Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al .2021. A mathematical framework for transformer circuits. Transformer Circuits Thread 1 (2021), 1. [18] Yixing Fan, Jiafeng Guo, Xinyu Ma, Ruqing Zhang, Yanyan Lan, and Xueqi Cheng. 2021. A linguistic study on relevance modeling in information retrieval. InProceedings of the Web Conference 2021 . 1053–1064. [19] Zeon Trevor Fernando, Jaspreet Singh, and Avishek Anand. 2019. A study on the Interpretability of Neural Retrieval Models using DeepSHAP. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval . 1005–1008.[20] James Fiacco, Samridhi Choudhary, and Carolyn Rose. 2019. Deep neural model inspection and comparison via functional neuron pathways. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . 5754–5764. [21] Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. A white box analysis of ColBERT. In Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28–April 1, 2021, Pro- ceedings, Part II 43 . Springer, 257–263. [22] Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2022. Match your words! a study of lexical matching in neural information retrieval. In Euro- pean Conference on Information Retrieval . Springer, 120–127. [23] Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg. 2022. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680 (2022). [24] Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2020. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913 (2020). [25] Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. 2023. Finding neurons in a haystack: Case studies with sparse probing. arXiv preprint arXiv:2305.01610 (2023). [26] Wes Gurnee and Max Tegmark. 2023. Language models represent space and time. arXiv preprint arXiv:2310.02207 (2023). [27] Xinzhi Han and Sen Lei. 2018. Feature selection and model comparison on microsoft learning-to-rank data sets. arXiv preprint arXiv:1803.05127 (2018). [28] Yael Hochma, Yuval Felendler, and Mark Last. 2024. Efficient Feature Ranking and Selection using Statistical Moments. IEEE Access (2024). [29] Georgios Katsimpras and Georgios Paliouras. 2024. GENRA: Enhancing Zero- shot Retrieval with Rank Aggregation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing . 7566–7577. [30] Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval . 39–48. [31] Connor Kissane, Robert Krzyzanowski, Arthur Conmy, and Neel Nanda. 2024. Saes (usually) transfer between base and chat models. In AI Alignment Forum . [32] Nathan Lambert, Lewis Tunstall, Nazneen Rajani, and Tristan Thrush. 2023. HuggingFace H4 Stack Exchange Preference Dataset . https://huggingface.co/ datasets/HuggingFaceH4/stack-exchange-preferences [33] Belinda Z Li, Maxwell Nye, and Jacob Andreas. 2021. Implicit representations of meaning in neural language models. arXiv preprint arXiv:2106.00737 (2021). [34] Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. Advances in neural information processing systems 30 (2017). [35] Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. 2023. Fine- tuning llama for multi-stage text retrieval. arXiv preprint arXiv:2310.08319 (2023). [36] Sean MacAvaney, Sergey Feldman, Nazli Goharian, Doug Downey, and Arman Cohan. 2022. ABNIRML: Analyzing the behavior of neural IR models. Transactions of the Association for Computational Linguistics 10 (2022), 224–239. [37] Aleksandar Makelov, Georg Lange, and Neel Nanda. 2023. Is this the subspace you are looking for? an interpretability illusion for subspace activation patching. arXiv preprint arXiv:2311.17030 (2023). [38] Aleksandar Makelov, George Lange, and Neel Nanda. 2024. Towards principled evaluations of sparse autoencoders for interpretability and control. arXiv preprint arXiv:2405.08366 (2024). [39] Clara Meister, Stefan Lazov, Isabelle Augenstein, and Ryan Cotterell. 2021. Is sparse attention more interpretable? arXiv preprint arXiv:2106.01087 (2021). [40] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human-generated machine reading comprehension dataset. (2016). [41] Andrew Parry, Catherine Chen, Carsten Eickhoff, and Sean MacAvaney. 2024. MechIR: A Mechanistic Interpretability Framework for Information Retrieval. (2024). [42] Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 Datasets. CoRR abs/1306.2597 (2013). http://arxiv.org/abs/1306.2597 [43] Razieh Rahimi, Youngwoo Kim, Hamed Zamani, and James Allan. 2021. Ex- plaining documents’ relevance to search queries. arXiv preprint arXiv:2111.01314 (2021). [44] Tilman Räuker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. 2023. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) . IEEE, 464–483. [45] Abhilasha Ravichander, Yonatan Belinkov, and Eduard Hovy. 2020. Probing the probing paradigm: Does probing accuracy entail task relevance? arXiv preprint arXiv:2005.00719 (2020). [46] Hassan Sajjad, Nadir Durrani, and Fahim Dalvi. 2022. Neuron-level interpretation of deep nlp models: A survey. Transactions of the Association for Computational Linguistics 10 (2022), 1285–1303. [47] Jaspreet Singh and Avishek Anand. 2019. Exs: Explainable search using local model agnostic interpretability. In Proceedings of the twelfth ACM international conference on web search and data mining . 770–773. Page 11: Understanding Ranking LLMs: A Mechanistic Analysis for Information Retrieval Conference’17, July 2017, Washington, DC, USA [48] Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663 (2021). [49] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al .2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023). [50] Mitchell Wortsman, Vivek Ramanujan, Rosanne Liu, Aniruddha Kembhavi, Mo- hammad Rastegari, Jason Yosinski, and Ali Farhadi. 2020. Supermasks in superpo- sition. Advances in Neural Information Processing Systems 33 (2020), 15173–15184.[51] Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2020. An analysis of BERT in document ranking. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval . 1941– 1944. [52] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2014. Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856 (2014). [53] Bolei Zhou, Yiyou Sun, David Bau, and Antonio Torralba. 2018. Revisiting the im- portance of individual units in cnns via ablation. arXiv preprint arXiv:1806.02891 (2018).

---