MIRAGE is a benchmark designed to evaluate the performance of retrieval-augmented generation (RAG) systems using various QA datasets. It includes 7560 Q/A pairs and 37800 context pools collected from Wikipedia-based QA benchmarks such as IfQA, NaturalQA, TriviaQA, DROP, and PopQA.
- RAG Evaluation: Measures the robustness of LLMs in RAG environments using three setups:
- Base: Closed-book QA where only the query is provided.
- Oracle: Open-book QA with the correct context provided.
- Mixed: Realistic RAG environment with both correct and noisy contexts.
- MIRAGE Metrics: Evaluates LLM adaptability in RAG environments through:
- Noise Vulnerability: Assesses the model's susceptibility to noise in the context.
- Context Acceptability: Evaluates the model's ability to effectively leverage the provided context to generate accurate answers.
- Context Insensitivity: Highlights cases where the model fails to utilize the context information.
- Context Misinterpretation: Identifies cases where the model answers correctly without context but hallucinates when given the oracle context.
- Retriever Dependency: Noise Vulnerability and Context Acceptability metrics show significant differences based on the retriever used, indicating that the retrieval phase is a bottleneck in RAG pipelines.
- LLM Capability: Context Insensitivity and Context Misinterpretation metrics are more related to the inherent capabilities of the LLM, showing improvements with newer models.
- Efficient Evaluation: Uses a retrieval pool of 37.8k chunks (1% of the full wiki-dump) to significantly reduce computational costs while maintaining high relevance to large-scale benchmarks like MTEB.
- Scaling Effect: Accurately reflects scaling effects within the same model family and trends observed in top-performing models like NV-embed-v2.
- Realistic Setup: Replaces mixed context with top-5 chunks retrieved by the actual retriever, ensuring that performance always falls between base and oracle setups.
-
Create Conda Environment:
conda create -n mirage python==3.11.11 conda activate mirage
-
Clone Repository:
git clone https://github.com/JohnnyNLP/MIRAGE.git cd MIRAGE -
Install Requirements:
pip install -r requirements.txt
-
Run Main Script:
python main.py
-
Modify Configuration (if needed):
- Edit
config.yamlas required.
- Edit
-
Run Evaluation Script:
python evaluation.py
- Contains default settings for 4 LLMs and 5 retrievers used in the main experiments.
- Designed to run on a single GPU (A6000).
- Supports three modes: RAG, LLM, RET.
- Configurable arguments via
config.yaml. - Uses vLLM for LLM inference and SentenceTransformer for retriever inference.
- Default setup is 5-shot, balancing retrieval pool size and optimal RAG performance.
- Evaluates retriever, LLM, and RAG performance.
- LLMs are evaluated using EM Score, retrievers using F1, NDCG, and Acc, and RAG performance using four metrics proposed in the MIRAGE paper.
- The detailed report can be found in the
Evaluation_resultdirectory:LLM_result.jsonlshows F1, EM_loose, and EM_strict scores. (Note that EM_loose score is more reliable than EM_strict since LLM tends to generate verbose responses)RET_result.jsonlshows F1, NDCG, precision, and recall scores at 1, 3, and 5 respectively.Metrics.jsonlshows 4 MIRAGE metrics scores: noise vulnerability, context acceptability, context insensitivity, and context misinterpretation.- When running the script, you can also see the ranking and overall score of each system.
- The overall score is calculated as
-NV + CA - CI - CM.
- Simple and Fast: Designed for quick and easy use with minimal computational resources.
- Effective for LLM/Retriever/RAG Experiments: Provides datasets and code for effective experimentation without heavy resource requirements.
- vLLM Framework: Supports multi-GPU inference for LLMs.
- Single GPU for Retriever: Currently supports single GPU inference for retrievers using SentenceTransformer.
- Batch API for Cost Efficiency: Consider using batch API to reduce costs, especially for GPT-4o inference.
- GPT-4o Inference: Costs approximately $70 for a single run. This may be subjected to change depending on the openAI's price policy.
- Batch API: Recommended for cost savings.