MIRAGE Benchmark

MIRAGE is a benchmark designed to evaluate the performance of retrieval-augmented generation (RAG) systems using various QA datasets. It includes 7560 Q/A pairs and 37800 context pools collected from Wikipedia-based QA benchmarks such as IfQA, NaturalQA, TriviaQA, DROP, and PopQA.

Key Features

RAG Evaluation: Measures the robustness of LLMs in RAG environments using three setups:
- Base: Closed-book QA where only the query is provided.
- Oracle: Open-book QA with the correct context provided.
- Mixed: Realistic RAG environment with both correct and noisy contexts.
MIRAGE Metrics: Evaluates LLM adaptability in RAG environments through:
- Noise Vulnerability: Assesses the model's susceptibility to noise in the context.
- Context Acceptability: Evaluates the model's ability to effectively leverage the provided context to generate accurate answers.
- Context Insensitivity: Highlights cases where the model fails to utilize the context information.
- Context Misinterpretation: Identifies cases where the model answers correctly without context but hallucinates when given the oracle context.

Evaluation Insights

Retriever Dependency: Noise Vulnerability and Context Acceptability metrics show significant differences based on the retriever used, indicating that the retrieval phase is a bottleneck in RAG pipelines.
LLM Capability: Context Insensitivity and Context Misinterpretation metrics are more related to the inherent capabilities of the LLM, showing improvements with newer models.

Retriever Evaluation

Efficient Evaluation: Uses a retrieval pool of 37.8k chunks (1% of the full wiki-dump) to significantly reduce computational costs while maintaining high relevance to large-scale benchmarks like MTEB.
Scaling Effect: Accurately reflects scaling effects within the same model family and trends observed in top-performing models like NV-embed-v2.

RAG Performance

Realistic Setup: Replaces mixed context with top-5 chunks retrieved by the actual retriever, ensuring that performance always falls between base and oracle setups.

Environment Setup

Create Conda Environment:

conda create -n mirage python==3.11.11
conda activate mirage

Clone Repository:

git clone https://github.com/JohnnyNLP/MIRAGE.git
cd MIRAGE

Install Requirements:
```
pip install -r requirements.txt
```
Run Main Script:
```
python main.py
```
Modify Configuration (if needed):
- Edit config.yaml as required.
Run Evaluation Script:
```
python evaluation.py
```

File Descriptions

config.yaml

Contains default settings for 4 LLMs and 5 retrievers used in the main experiments.
Designed to run on a single GPU (A6000).

main.py

Supports three modes: RAG, LLM, RET.
Configurable arguments via config.yaml.
Uses vLLM for LLM inference and SentenceTransformer for retriever inference.
Default setup is 5-shot, balancing retrieval pool size and optimal RAG performance.

evaluation.py

Evaluates retriever, LLM, and RAG performance.
LLMs are evaluated using EM Score, retrievers using F1, NDCG, and Acc, and RAG performance using four metrics proposed in the MIRAGE paper.
The detailed report can be found in the Evaluation_result directory:
- LLM_result.jsonl shows F1, EM_loose, and EM_strict scores. (Note that EM_loose score is more reliable than EM_strict since LLM tends to generate verbose responses)
- RET_result.jsonl shows F1, NDCG, precision, and recall scores at 1, 3, and 5 respectively.
- Metrics.jsonl shows 4 MIRAGE metrics scores: noise vulnerability, context acceptability, context insensitivity, and context misinterpretation.
- When running the script, you can also see the ranking and overall score of each system.
- The overall score is calculated as -NV + CA - CI - CM.

Application

Simple and Fast: Designed for quick and easy use with minimal computational resources.
Effective for LLM/Retriever/RAG Experiments: Provides datasets and code for effective experimentation without heavy resource requirements.

Notes

vLLM Framework: Supports multi-GPU inference for LLMs.
Single GPU for Retriever: Currently supports single GPU inference for retrievers using SentenceTransformer.
Batch API for Cost Efficiency: Consider using batch API to reduce costs, especially for GPT-4o inference.

Cost Considerations

GPT-4o Inference: Costs approximately $70 for a single run. This may be subjected to change depending on the openAI's price policy.
Batch API: Recommended for cost savings.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Preset_prompt		Preset_prompt
mirage		mirage
.gitignore		.gitignore
LICENSE		LICENSE
LLM.py		LLM.py
config.yaml		config.yaml
evaluation.py		evaluation.py
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt
retriever.py		retriever.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MIRAGE Benchmark

Key Features

Evaluation Insights

Retriever Evaluation

RAG Performance

Environment Setup

File Descriptions

config.yaml

main.py

evaluation.py

Application

Notes

Cost Considerations

About

Uh oh!

Releases

Packages

Languages

License

JohnnyNLP/MIRAGE

Folders and files

Latest commit

History

Repository files navigation

MIRAGE Benchmark

Key Features

Evaluation Insights

Retriever Evaluation

RAG Performance

Environment Setup

File Descriptions

config.yaml

main.py

evaluation.py

Application

Notes

Cost Considerations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages