Toward Reliable Scientific Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models
- Our paper is accepted to IJCAI 2025!
TruthHypo is a benchmark for assessing the capabilities of LLMs in generating truthful scientific hypotheses. This repo also contains the source code of KnowHD, a knowledge-based hallucination detector to evaluate how well hypotheses are grounded in existing knowledge. Our paper shows that LLMs struggle to generate truthful hypotheses. By analyzing hallucinations in reasoning steps, we demonstrate that the groundedness scores provided by KnowHD serve as an effective metric for filtering truthful hypotheses from the diverse outputs of LLMs.
The TruthHypo dataset is directly accessible via HuggingFace:
from datasets import load_dataset
data = load_dataset("TruthHypo/edges_test")The processed knowledge sources for knowledge-enhanced hypothesis generation can be found at
- Literature
- Knowledge Graph
Our repository contains the following contents:
- data: the data of TruthHypo benchmark
- edges_test.tsv: the test data used for LLM evaluation
- src: the source code of agents and verifiers used in our experiments
- agent: the LLM agents used to generated biomedical hypotheses
- base.py: the base agent
- cot.py: the agent using parametric knowledge only
- kg.py: the agent using both parametric knowledge and information fromknowledge graphs
- rag.py: the agent using both parametric knowledge and information from scientific literature
- rag_kg.py: the agent using parametric knowledge and information from both knowledge graphs and scientific literature
- verifier: the LLM verifiers used to measure the groundedness of generated hypotheses
- rag_verifier.py: the verifier with scientific literature as the supporting knowledge base
- kg_verifier.py: the verifier with knowledge graphs as the supporting knowledge base
- rag_kg_verifier.py: the verifier with both scientific literature and knowledge graphs as the supporting knowledge base
- agent: the LLM agents used to generated biomedical hypotheses
@inproceedings{xiong2025toward,
title = {Toward Reliable Scientific Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models},
author = {Xiong, Guangzhi and Xie, Eric and Williams, Corey and Kim, Myles and Shariatmadari, Amir Hassan and Guo, Sikun and Bekiranov, Stefan and Zhang, Aidong},
booktitle = {Proceedings of the Thirty-Fourth International Joint Conference on
Artificial Intelligence, {IJCAI-25}},
publisher = {International Joint Conferences on Artificial Intelligence Organization},
editor = {James Kwok},
pages = {7849--7857},
year = {2025},
month = {8},
note = {Main Track},
doi = {10.24963/ijcai.2025/873},
url = {https://doi.org/10.24963/ijcai.2025/873},
}