SCI-Verifier: Scientific Verifier with Thinking

📄 Paper | 🤗 Hugging Face | 🤗 Model | ⏬ Data

This repo contains the code for the paper SCI-Verifier: Scientific Verifier with Thinking.

🔥 News

2026.01: 🎉 Our SCI-Verifier is accepted by ICLR 2026.

Overview

SCI-VerifyBench is a cross-disciplinary benchmark for evaluating the scientific verification abilities of large language models (LLMs), covering mathematics, physics, chemistry, biology, and general scientific QA. It includes real LLM responses enhanced with domain-specific equivalence transformations, with high-quality annotations from both models and human experts.

SCI-Verifier is a reasoning-augmented model designed for scientific verification. It leverages logical reasoning and equivalence judgment to verify LLM answers accurately while providing concise and stable outputs.

Together, SCI-VerifyBench and SCI-Verifier offer a principled framework for systematic evaluation and reliable scientific reasoning with LLMs.

Data process

SCI-VerifyBench is a comprehensive benchmark designed to evaluate the scientific verification capabilities of Large Language Models (LLMs). It showcases multi-domain proficiency across mathematics, physics, chemistry, biology, and QA.

The field names in the files are explained as follows:

uid：Unique identifier for each question
question: The question text,
gold_answer: The correct/reference answer,
raw_llm_response: The response generated by an LLM,
llm_response: Final result of LLM answers extracted according to rules,
answer_type: The format of the answer: "Expression", "Numerical", "Interval", "Equation", etc.,
data_source: The source dataset from which the quesiton was taken,
domain: The domain of the problem: "math", "physics", "chemistry", "biology", or "QA",
task_type: Category corresponding to the task,
gold_judgment: The verification judgment: true/false,
aug: Whether the answer was generated through equivalent transformation,
llm: The LLM related to the llm_response

SCI-Verifier

We propose a two-stage post-training approach using SFT and RL to develop a scientific verifier with concise reasoning capabilities, demonstrating strong ability in judging answer equivalence.

Experiments

Eval

Use the following command to evaluate the verifier:

python src/local_eval.py \
  --model_path \
  --data_root \  
  --dataset_name \ # Location of the data: {data_root}/{dataset_name}.jsonl
  --output_dir \ # Location of the output_summary: {output_dir}/{dataset_name}/{model_name}
  --prompt_type \ #  ["instruct", "cot", "xverify"]
  --batch_size \
  --tensor_parallel_size \
  --temperature \
  --max_tokens

The key results on SCI-VerifyBench are as follows:

Contact

If interested in our work, please contact us at:

- Shenghe Zheng: shenghez.zheng@gmail.com

Citation

@article{zheng2025sci,
  title={SCI-Verifier: Scientific Verifier with Thinking},
  author={Zheng, Shenghe and Huang, Chenyu and Yu, Fangchen and Yao, Junchi and Ye, Jingqi and Chen, Tao and Luo, Yun and Ding, Ning and Bai, Lei and Cui, Ganqu and others},
  journal={arXiv preprint arXiv:2509.24285},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
img		img
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SCI-Verifier: Scientific Verifier with Thinking

🔥 News

Overview

Data process

SCI-Verifier

Experiments

Eval

Contact

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SCI-Verifier: Scientific Verifier with Thinking

🔥 News

Overview

Data process

SCI-Verifier

Experiments

Eval

Contact

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages