🤗 HF Collection | 📄 Paper
ToolRM is a family of lightweight generative and discriminative RMs tailored for agentic tool-use scenarios. To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique tasks that supports reinforcement learning with verifiable feedback. To evaluate tool-use RMs, we also introduce TRBench-BFCL, a benchmark built on the agentic evaluation suite BFCL. Trained on our constructed data, models from the Qwen3-4B/8B series outperform several giant LLMs in pairwise reward judgments. Beyond training objectives, ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction. It also supports downstream RL training effectively.
- [2026-01-14]: We update our paper with additional experimental results.
- [2025-11-10]: Datasets and ToolRM model checkpoints have been released in this huggingface collection.
- Download ToolPref-Pairwise-30K to train ToolRM, TRBench-BFCL to evaluate reward models in the general tool-use scenarios, and ToolRM checkpoints to directly facilitate your agentic tool-use research.
- Note that we respectively use the
thinkandno_thinkprompt templates to create the GenRM-formatted datasets. Please ensure you select the appropriate dataset for both training and evaluation based on whether the target LLM is a reasoning or non-reasoning model.
-
Install
verl: Clone from the verl repository for generative ToolRM training. Set upverlwithin a dedicated Python virtual environment (e.g., usingcondaorvenv). Follow the official verl installation guide to ensure all prerequisites:git clone https://github.com/volcengine/verl
-
Install
OpenRLHF: Clone from the OpenRLHF repository for discriminative ToolRM training:git clone https://github.com/OpenRLHF/OpenRLHF
-
Activate Environment: Ensure your
verlvirtual environment is active in your current terminal session.conda activate <your_venv_name> # Example if using conda # source <your_venv_path>/bin/activate # Example if using venv
-
Prepare Training Files:
-
Copy the custom reward function script into the
verllibrary structure:cd <your_toolrm_project_root_path> cp train/toolrm_reward_function.py <your_verl_project_root_path>/verl/utils/reward_score/
-
Copy the training configuration script
train_toolrm_gen.shto theverlexamples directory:cp scripts/train_toolrm_gen.sh <your_verl_project_root_path>/examples/grpo_trainer/
-
-
Execute Training: Navigate to the
verldirectory and run the training script:cd <your_verl_project_root_path> bash examples/grpo_trainer/train_toolrm_gen.sh # FSDP model checkpoints are converted to Huggingface-compatible checkpoints after training.
-
Prepare Training Files: Copy the training configuration script
train_toolrm_disc.shto theopenrlhfexample scripts directory:cp scripts/train_toolrm_disc.sh <your_openrlhf_project_root_path>/examples/scripts/
-
Execute Training: Navigate to the
openrlhfdirectory and run the training script:cd <your_openrlhf_project_root_path> bash examples/scripts/train_toolrm_disc.sh
- Prepare Evaluation Script: Ensure the evaluation script
scripts/eval_trbench_*.shis correctly configured with the paths to checkpoints of your trained model or any baseline models you wish to evaluate. - Run Evaluation: Execute the script for evaluation on local-deployed models (default with
vllminference backend):
-
To evaluate generative reward models:
cd <your_toolrm_project_root_path> bash scripts/eval_trbench_genrm.sh
-
To evaluate discriminative reward models:
cd <your_toolrm_project_root_path> bash scripts/eval_trbench_discrm.sh
Evaluation results of several proprietary and open-source LLMs on TRBench-BFCL are shown as follows:
ToolRM is a research project developed by Alibaba Cloud and licensed under the CC BY-NC-SA 4.0 License.
Thanks to the APIGen, APIGen-MT, BUTTON, ComplexFuncBench, Glaive-Function-Calling, Hermes-Function-Calling, ToolAlpaca and BFCL projects for open-source tool call trajectory data.
@misc{li2026toolrmagentictoolusereward,
title={ToolRM: Towards Agentic Tool-Use Reward Modeling},
author={Renhao Li and Jianhong Tu and Yang Su and Yantao Liu and Fei Huang and Hamid Alinejad-Rokny and Derek F. Wong and Junyang Lin and Min Yang},
year={2026},
eprint={2510.26167},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.26167},
}

