Metrics for Evaluating RAG Systems
1. Recall@K
Measures how many of the top K retrieved documents contain relevant information.
Formula:
Recall@K = \( \frac{\text{Number of relevant documents in top K results}}{\text{Total number of relevant documents}} \)
Example:
If K = 2 and the relevant document appears in the top 2 results:
Recall@2 = \( \frac{1}{1} = 1.0 \)
2. Mean Reciprocal Rank (MRR)
Evaluates how quickly the first relevant document is retrieved.
Formula:
MRR = \( \frac{1}{N} \sum_{i=1}^{N} \frac{1}{\text{rank}_i} \)
Example:
Query 1: Reciprocal Rank = \( \frac{1}{2} \)
Query 2: Reciprocal Rank = \( 1.0 \)
Query 3: Reciprocal Rank = \( 0 \)
MRR = \( \frac{1}{3} (\frac{1}{2} + 1.0 + 0) = 0.5 \)
3. BLEU (Bilingual Evaluation Understudy)
Measures the overlap of n-grams between the generated text and a reference answer.
Formula:
BLEU = BP * exp\( \sum_{n=1}^{N} w_n \log p_n \)
4. ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
Focuses on recall by measuring how much of the reference text is captured in the generated text.
5. Factual Consistency
Ensures that the generated response is factually consistent with the retrieved content. This can be evaluated manually or using automated tools.
Implementation Examples
Python Code for Recall@K and MRR
def recall_at_k(retrieved_docs, relevant_docs, k):
retrieved_at_k = retrieved_docs[:k]
return sum(1 for doc in retrieved_at_k if doc in relevant_docs) / len(relevant_docs)
def mean_reciprocal_rank(retrieved_docs, relevant_docs_list):
reciprocal_ranks = []
for relevant_docs in relevant_docs_list:
for rank, doc in enumerate(retrieved_docs, start=1):
if doc in relevant_docs:
reciprocal_ranks.append(1 / rank)
break
else:
reciprocal_ranks.append(0)
return sum(reciprocal_ranks) / len(relevant_docs_list)
retrieved_docs = ["doc1", "doc2", "doc3"]
relevant_docs_list = [["doc2"], ["doc3"]]
print("Recall@2:", recall_at_k(retrieved_docs, relevant_docs_list[0], 2))
print("MRR:", mean_reciprocal_rank(retrieved_docs, relevant_docs_list))
Python Code for BLEU
from nltk.translate.bleu_score import sentence_bleu
reference = [["OpenAI", "was", "founded", "by", "Elon", "Musk", "and", "Sam", "Altman"]]
candidate = ["OpenAI", "was", "started", "by", "Sam", "Altman", "and", "Elon", "Musk"]
score = sentence_bleu(reference, candidate)
print("BLEU Score:", score)