Toward Reliable Scientific Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models

News

Our paper is accepted to IJCAI 2025!

Introduction

TruthHypo is a benchmark for assessing the capabilities of LLMs in generating truthful scientific hypotheses. This repo also contains the source code of KnowHD, a knowledge-based hallucination detector to evaluate how well hypotheses are grounded in existing knowledge. Our paper shows that LLMs struggle to generate truthful hypotheses. By analyzing hallucinations in reasoning steps, we demonstrate that the groundedness scores provided by KnowHD serve as an effective metric for filtering truthful hypotheses from the diverse outputs of LLMs.

Usage

The TruthHypo dataset is directly accessible via HuggingFace:

from datasets import load_dataset

data = load_dataset("TruthHypo/edges_test")

The processed knowledge sources for knowledge-enhanced hypothesis generation can be found at

Literature
- PubMed Articles
Knowledge Graph
- PubTator Edges
- PubTator Nodes

Structure

Our repository contains the following contents:

data: the data of TruthHypo benchmark
- edges_test.tsv: the test data used for LLM evaluation
src: the source code of agents and verifiers used in our experiments
- agent: the LLM agents used to generated biomedical hypotheses
  - base.py: the base agent
  - cot.py: the agent using parametric knowledge only
  - kg.py: the agent using both parametric knowledge and information fromknowledge graphs
  - rag.py: the agent using both parametric knowledge and information from scientific literature
  - rag_kg.py: the agent using parametric knowledge and information from both knowledge graphs and scientific literature
- verifier: the LLM verifiers used to measure the groundedness of generated hypotheses
  - rag_verifier.py: the verifier with scientific literature as the supporting knowledge base
  - kg_verifier.py: the verifier with knowledge graphs as the supporting knowledge base
  - rag_kg_verifier.py: the verifier with both scientific literature and knowledge graphs as the supporting knowledge base

Citation

@inproceedings{xiong2025toward,
  title     = {Toward Reliable Scientific Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models},
  author    = {Xiong, Guangzhi and Xie, Eric and Williams, Corey and Kim, Myles and Shariatmadari, Amir Hassan and Guo, Sikun and Bekiranov, Stefan and Zhang, Aidong},
  booktitle = {Proceedings of the Thirty-Fourth International Joint Conference on
               Artificial Intelligence, {IJCAI-25}},
  publisher = {International Joint Conferences on Artificial Intelligence Organization},
  editor    = {James Kwok},
  pages     = {7849--7857},
  year      = {2025},
  month     = {8},
  note      = {Main Track},
  doi       = {10.24963/ijcai.2025/873},
  url       = {https://doi.org/10.24963/ijcai.2025/873},
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toward Reliable Scientific Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models

News

Table of Contents

Introduction

Usage

Structure

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Toward Reliable Scientific Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models

News

Table of Contents

Introduction

Usage

Structure

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages