This repository contains the official code for the paper "Reinforcement Learning from Human Feedback with Active Queries".
Authors: Kaixuan Ji*, Jiafan He*, Quanquan Gu
Active Direct Preference Optimization (ADPO) is a query-efficient alternative to direct prefernce optimization (DPO). More Specifically, at each training step, ADPO first compute estimate the model's uncertainty of each preference pair. ADPO only queries for the preference labels of those pairs with low confidence scores. For the pairs with high confidence scores, ADPO uses the predicted preference label (pseudo-label) to update the model. Experiments on Zephyr-β and Zephyr-gemma shows that ADPO matches the performance of DPO with only one quarters of queries needed. For more details, please refer to our paper.
The following steps provide the necessary setup to run our codes. First, create a conda environment as follows
conda create -n adpo python=3.10.9
conda activate adpo
Next, install the required packages as follows
python3 -m pip install -e .
To reproduce the results in our paper, please follow the steps below. To replicate the results on Zephyr-β, please run the following command.
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes=8 --main_process_port 30000 scripts/run_dpo.py recipes/training_configs/zephyr-beta.yaml \
--beta=0.1 \
--data_selection=true \
--Gamma=1.3 \
--num_train_epochs=1 \
--output_dir={path_to_your_output_dir}
To replicate the results on Zephyr-gemma, please run the following command.
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes=8 --main_process_port 30000 scripts/run_dpo.py recipes/training_configs/zephyr-gemma.yaml \
--beta=0.1 \
--data_selection=true \
--Gamma=1.3 \
--num_train_epochs=1 \
--output_dir={path_to_your_output_dir}
In both commands, \beta is the weight of the KL divergence term in the loss function, and data_selection is a boolean flag to enable active querying. When active-querying is enabled, Gamma is the confidence threshold
The evaluation is based on the official repositories of [Open LLM Leaderboard](GitHub - https://github.com/EleutherAI/lm-evaluation-harness), AlpacaEval and MT-Bench. We refer to the official repositories for more details about the evaluation procedure.
If you find this repository helpful, please kindly cite our paper:
@article{ji2024reinforcement,
title={Reinforcement learning from human feedback with active queries},
author={Ji, Kaixuan and He, Jiafan and Gu, Quanquan},
journal={arXiv preprint arXiv:2402.09401},
year={2024}
}This repo is built upon The Alignment Handbook. We thank the authors for their great work.
