Skip to content

Zhengsh123/SCI-Verifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

SCI-Verifier: Scientific Verifier with Thinking

📄 Paper   |   🤗 Hugging Face   |   🤗 Model   |   ⏬ Data

This repo contains the code for the paper SCI-Verifier: Scientific Verifier with Thinking.

🔥 News

Overview

SCI-VerifyBench is a cross-disciplinary benchmark for evaluating the scientific verification abilities of large language models (LLMs), covering mathematics, physics, chemistry, biology, and general scientific QA. It includes real LLM responses enhanced with domain-specific equivalence transformations, with high-quality annotations from both models and human experts.

SCI-Verifier is a reasoning-augmented model designed for scientific verification. It leverages logical reasoning and equivalence judgment to verify LLM answers accurately while providing concise and stable outputs.

Together, SCI-VerifyBench and SCI-Verifier offer a principled framework for systematic evaluation and reliable scientific reasoning with LLMs.

Data process

SCI-VerifyBench is a comprehensive benchmark designed to evaluate the scientific verification capabilities of Large Language Models (LLMs). It showcases multi-domain proficiency across mathematics, physics, chemistry, biology, and QA.

The field names in the files are explained as follows:

  • uid:Unique identifier for each question
  • question: The question text,
  • gold_answer: The correct/reference answer,
  • raw_llm_response: The response generated by an LLM,
  • llm_response: Final result of LLM answers extracted according to rules,
  • answer_type: The format of the answer: "Expression", "Numerical", "Interval", "Equation", etc.,
  • data_source: The source dataset from which the quesiton was taken,
  • domain: The domain of the problem: "math", "physics", "chemistry", "biology", or "QA",
  • task_type: Category corresponding to the task,
  • gold_judgment: The verification judgment: true/false,
  • aug: Whether the answer was generated through equivalent transformation,
  • llm: The LLM related to the llm_response

SCI-Verifier

We propose a two-stage post-training approach using SFT and RL to develop a scientific verifier with concise reasoning capabilities, demonstrating strong ability in judging answer equivalence.

Experiments

Eval

Use the following command to evaluate the verifier:

python src/local_eval.py \
  --model_path \
  --data_root \  
  --dataset_name \ # Location of the data: {data_root}/{dataset_name}.jsonl
  --output_dir \ # Location of the output_summary: {output_dir}/{dataset_name}/{model_name}
  --prompt_type \ #  ["instruct", "cot", "xverify"]
  --batch_size \
  --tensor_parallel_size \
  --temperature \
  --max_tokens

The key results on SCI-VerifyBench are as follows:

Contact

If interested in our work, please contact us at:

- Shenghe Zheng: shenghez.zheng@gmail.com

Citation

@article{zheng2025sci,
  title={SCI-Verifier: Scientific Verifier with Thinking},
  author={Zheng, Shenghe and Huang, Chenyu and Yu, Fangchen and Yao, Junchi and Ye, Jingqi and Chen, Tao and Luo, Yun and Ding, Ning and Bai, Lei and Cui, Ganqu and others},
  journal={arXiv preprint arXiv:2509.24285},
  year={2025}
}

About

Official GitHub repo for SCI-Verifier: Scientific Verifier with Thinking (ICLR2026).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages