GitHub - lirenhao1997/ToolRM: ToolRM: Towards Agentic Tool-Use Reward Modeling

🌟 Overview

ToolRM is a family of lightweight generative and discriminative RMs tailored for agentic tool-use scenarios. To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique tasks that supports reinforcement learning with verifiable feedback. To evaluate tool-use RMs, we also introduce TRBench-BFCL, a benchmark built on the agentic evaluation suite BFCL. Trained on our constructed data, models from the Qwen3-4B/8B series outperform several giant LLMs in pairwise reward judgments. Beyond training objectives, ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction. It also supports downstream RL training effectively.

Figure 1: An overview of ToolRM framework.

📰 News

[2026-01-14]: We update our paper with additional experimental results.
[2025-11-10]: Datasets and ToolRM model checkpoints have been released in this huggingface collection.

🚀 Quick Start

Resource Preparation

Download ToolPref-Pairwise-30K to train ToolRM, TRBench-BFCL to evaluate reward models in the general tool-use scenarios, and ToolRM checkpoints to directly facilitate your agentic tool-use research.
Note that we respectively use the think and no_think prompt templates to create the GenRM-formatted datasets. Please ensure you select the appropriate dataset for both training and evaluation based on whether the target LLM is a reasoning or non-reasoning model.

Environment Setup

Install verl: Clone from the verl repository for generative ToolRM training. Set up verl within a dedicated Python virtual environment (e.g., using conda or venv). Follow the official verl installation guide to ensure all prerequisites:
```
git clone https://github.com/volcengine/verl
```
Install OpenRLHF: Clone from the OpenRLHF repository for discriminative ToolRM training:
```
git clone https://github.com/OpenRLHF/OpenRLHF
```

Activate Environment: Ensure your verl virtual environment is active in your current terminal session.

conda activate <your_venv_name> # Example if using conda
# source <your_venv_path>/bin/activate # Example if using venv

ToolRM-Gen Model Training

Prepare Training Files:

Copy the custom reward function script into the verl library structure:

cd <your_toolrm_project_root_path>
cp train/toolrm_reward_function.py <your_verl_project_root_path>/verl/utils/reward_score/

Copy the training configuration script train_toolrm_gen.sh to the verl examples directory:
```
cp scripts/train_toolrm_gen.sh <your_verl_project_root_path>/examples/grpo_trainer/
```

Execute Training: Navigate to the verl directory and run the training script:

cd <your_verl_project_root_path>
bash examples/grpo_trainer/train_toolrm_gen.sh
# FSDP model checkpoints are converted to Huggingface-compatible checkpoints after training.

ToolRM-Disc Model Training

Prepare Training Files: Copy the training configuration script train_toolrm_disc.sh to the openrlhf example scripts directory:
```
cp scripts/train_toolrm_disc.sh <your_openrlhf_project_root_path>/examples/scripts/
```

Execute Training: Navigate to the openrlhf directory and run the training script:

cd <your_openrlhf_project_root_path>
bash examples/scripts/train_toolrm_disc.sh

Evaluation on TRBench-BFCL

Prepare Evaluation Script: Ensure the evaluation script scripts/eval_trbench_*.sh is correctly configured with the paths to checkpoints of your trained model or any baseline models you wish to evaluate.
Run Evaluation: Execute the script for evaluation on local-deployed models (default with vllm inference backend):

To evaluate generative reward models:

cd <your_toolrm_project_root_path>
bash scripts/eval_trbench_genrm.sh

To evaluate discriminative reward models:

cd <your_toolrm_project_root_path>
bash scripts/eval_trbench_discrm.sh

Evaluation results of several proprietary and open-source LLMs on TRBench-BFCL are shown as follows:

Figure 2: Evaluation results of reward models on TRBench-BFCL.

🚦 License

ToolRM is a research project developed by Alibaba Cloud and licensed under the CC BY-NC-SA 4.0 License.

🙏 Acknowledgments

Thanks to the APIGen, APIGen-MT, BUTTON, ComplexFuncBench, Glaive-Function-Calling, Hermes-Function-Calling, ToolAlpaca and BFCL projects for open-source tool call trajectory data.

📝 Citation

@misc{li2026toolrmagentictoolusereward,
      title={ToolRM: Towards Agentic Tool-Use Reward Modeling}, 
      author={Renhao Li and Jianhong Tu and Yang Su and Yantao Liu and Fei Huang and Hamid Alinejad-Rokny and Derek F. Wong and Junyang Lin and Min Yang},
      year={2026},
      eprint={2510.26167},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.26167}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
eval		eval
scripts		scripts
train		train
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌟 Overview

📰 News

🚀 Quick Start

Resource Preparation

Environment Setup

ToolRM-Gen Model Training

ToolRM-Disc Model Training

Evaluation on TRBench-BFCL

🚦 License

🙏 Acknowledgments

📝 Citation

About

Uh oh!

Languages

License

lirenhao1997/ToolRM

Folders and files

Latest commit

History

Repository files navigation

🌟 Overview

📰 News

🚀 Quick Start

Resource Preparation

Environment Setup

ToolRM-Gen Model Training

ToolRM-Disc Model Training

Evaluation on TRBench-BFCL

🚦 License

🙏 Acknowledgments

📝 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages