📄 Paper | 🤗 Hugging Face | 🤗 Model | ⏬ Data
This repo contains the code for the paper SCI-Verifier: Scientific Verifier with Thinking.
- 2026.01: 🎉 Our SCI-Verifier is accepted by ICLR 2026.
SCI-VerifyBench is a cross-disciplinary benchmark for evaluating the scientific verification abilities of large language models (LLMs), covering mathematics, physics, chemistry, biology, and general scientific QA. It includes real LLM responses enhanced with domain-specific equivalence transformations, with high-quality annotations from both models and human experts.
SCI-Verifier is a reasoning-augmented model designed for scientific verification. It leverages logical reasoning and equivalence judgment to verify LLM answers accurately while providing concise and stable outputs.
Together, SCI-VerifyBench and SCI-Verifier offer a principled framework for systematic evaluation and reliable scientific reasoning with LLMs.
SCI-VerifyBench is a comprehensive benchmark designed to evaluate the scientific verification capabilities of Large Language Models (LLMs). It showcases multi-domain proficiency across mathematics, physics, chemistry, biology, and QA.
The field names in the files are explained as follows:
- uid:Unique identifier for each question
- question: The question text,
- gold_answer: The correct/reference answer,
- raw_llm_response: The response generated by an LLM,
- llm_response: Final result of LLM answers extracted according to rules,
- answer_type: The format of the answer: "Expression", "Numerical", "Interval", "Equation", etc.,
- data_source: The source dataset from which the quesiton was taken,
- domain: The domain of the problem: "math", "physics", "chemistry", "biology", or "QA",
- task_type: Category corresponding to the task,
- gold_judgment: The verification judgment: true/false,
- aug: Whether the answer was generated through equivalent transformation,
- llm: The LLM related to the llm_response
We propose a two-stage post-training approach using SFT and RL to develop a scientific verifier with concise reasoning capabilities, demonstrating strong ability in judging answer equivalence.
Use the following command to evaluate the verifier:
python src/local_eval.py \
--model_path \
--data_root \
--dataset_name \ # Location of the data: {data_root}/{dataset_name}.jsonl
--output_dir \ # Location of the output_summary: {output_dir}/{dataset_name}/{model_name}
--prompt_type \ # ["instruct", "cot", "xverify"]
--batch_size \
--tensor_parallel_size \
--temperature \
--max_tokens
The key results on SCI-VerifyBench are as follows:
If interested in our work, please contact us at:
- Shenghe Zheng: shenghez.zheng@gmail.com
@article{zheng2025sci,
title={SCI-Verifier: Scientific Verifier with Thinking},
author={Zheng, Shenghe and Huang, Chenyu and Yu, Fangchen and Yao, Junchi and Ye, Jingqi and Chen, Tao and Luo, Yun and Ding, Ning and Bai, Lei and Cui, Ganqu and others},
journal={arXiv preprint arXiv:2509.24285},
year={2025}
}

