BizFinBench.v2: A Unified Dual-Mode Bilingual Benchmark for Expert-Level Financial Capability Alignment
Xin Guo1,2,* , Rongjunchen Zhang1,*,♠, Guilong Lu1, Xuntao Guo1, Jia Shuai1, Zhi Yang2, Liwen Zhang2,♠
*Co-first authors, ♠Corresponding author, zhangrongjunchen@myhexin.com,zhang.liwen@shufe.edu.cn
📖Paper |🏠Homepage|🤗Huggingface
BizFinBench.v2 is the secend release of BizFinBench. It is built entirely on real-world user queries from Chinese and U.S. equity markets. It bridges the gap between academic evaluation and actual financial operations.
- Authentic & Real-Time: 100% derived from real financial platform queries, integrating online assessment capabilities.
- Expert-Level Difficulty: A challenging dataset of 28,860 Q&A pairs requiring professional financial reasoning.
- No Judge Model: Utilizes rule-based metrics instead of dynamic judge models to ensure 100% reproducibility, high efficiency, and reliable scoring.
- High Difficulty: Even ChatGPT-5 achieves only 61.5% accuracy on main tasks, highlighting a significant gap vs. human experts.
- Online Prowess: DeepSeek-R1 outperforms all other commercial LLMs in dynamic online tasks, achieving a total return of 13.46% with a maximum drawdown of -8%.
- 🚀 [28/01/2026] BizFinBench.v2 is ready for one-click evaluation, and we have also integrated it into GAGE for faster evaluation.
- 🚀 [08/01/2026] BizFinBench.v2 is out: 28,860 real-world financial questions so tough that ChatGPT-5 only scores 61.5/100.
BizFinBench.v2 contains multiple subtasks, each focusing on a different financial understanding and reasoning ability, as follows:
| Scenarios | Tasks | Avg. Input Tokens | # Questions |
|---|---|---|---|
| Business Information Provenance | Anomaly Information Tracing | 8,679 | 3,963 |
| Financial Multi-turn Perception | 10,361 | 4,497 | |
| Financial Data Description | 3,577 | 3,803 | |
| Financial Logic Reasoning | Financial Quantitative Computation | 1,984 | 2,000 |
| Event Logic Reasoning | 437 | 3,944 | |
| Counterfactual Inference | 2,267 | 604 | |
| Stakeholder Feature Perception | User Sentiment Analysis | 3,326 | 4,000 |
| Financial Report Analysis | 19,681 | 2,000 | |
| Real-time Market Discernment | Stock Price Prediction | 5,510 | 4,049 |
| Portfolio Asset Allocation | — | — | |
| Total | — | — | 28,860 |
Online result can be found HERE
pip install -r requirements.txt
huggingface-cli download --repo-type dataset HiThink-Research/BizFinBench.v2 --local-dir ./datasets --local-dir-use-symlinks Falsepython run_pipeline.py \
--config config/offical/BizFinBench_v2_cn.yaml \ #evluation config here
--model_path models/chat/Qwen3-0.6B \ #your model path hereexport API_NAME=chatgpt # The api name, currently support chatgpt
export API_KEY=xxx # Your api key
export MODEL_NAME=gpt-4.1
# Pass in the config file path to start evaluation
python run_pipeline.py --config config/offical/BizFinBench_v2_cn.yamlNote: You can adjust the API’s queries-per-second limit by modifying the semaphore_limit setting in envs/constants.py. e.g., GPTClient(api_name=api_name,api_key=api_key,model_name=model_name,base_url='https://api.openai.com/v1/chat/completions', timeout=600, semaphore_limit=5)
@article{guo2026bizfinbench,
title={BizFinBench. v2: A Unified Dual-Mode Bilingual Benchmark for Expert-Level Financial Capability Alignment},
author={Guo, Xin and Zhang, Rongjunchen and Lu, Guilong and Guo, Xuntao and Jia, Shuai and Yang, Zhi and Zhang, Liwen},
journal={arXiv preprint arXiv:2601.06401},
year={2026}
}
Usage and License Notices: The data and code are intended and licensed for research use only.
License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
- Special thanks to Ning Zhang, Siqi Wei, Kai Xiong, Kun Chen and colleagues at HiThink Research's data team for their support in building BizFinBench.v2.




