Active Direct Preference Optimization (ADPO)

This repository contains the official code for the paper "Reinforcement Learning from Human Feedback with Active Queries".

Authors: Kaixuan Ji*, Jiafan He*, Quanquan Gu

About ADPO

Active Direct Preference Optimization (ADPO) is a query-efficient alternative to direct prefernce optimization (DPO). More Specifically, at each training step, ADPO first compute estimate the model's uncertainty of each preference pair. ADPO only queries for the preference labels of those pairs with low confidence scores. For the pairs with high confidence scores, ADPO uses the predicted preference label (pseudo-label) to update the model. Experiments on Zephyr-β and Zephyr-gemma shows that ADPO matches the performance of DPO with only one quarters of queries needed. For more details, please refer to our paper.

Environment Setup

The following steps provide the necessary setup to run our codes. First, create a conda environment as follows

conda create -n adpo python=3.10.9
conda activate adpo

Next, install the required packages as follows

python3 -m pip install -e .

Reproducing Results

To reproduce the results in our paper, please follow the steps below. To replicate the results on Zephyr-β, please run the following command.

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes=8 --main_process_port 30000 scripts/run_dpo.py recipes/training_configs/zephyr-beta.yaml \
--beta=0.1 \
--data_selection=true \
--Gamma=1.3 \
--num_train_epochs=1 \
--output_dir={path_to_your_output_dir}

To replicate the results on Zephyr-gemma, please run the following command.

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes=8 --main_process_port 30000 scripts/run_dpo.py recipes/training_configs/zephyr-gemma.yaml \
--beta=0.1 \
--data_selection=true \
--Gamma=1.3 \
--num_train_epochs=1 \
--output_dir={path_to_your_output_dir}

In both commands, \beta is the weight of the KL divergence term in the loss function, and data_selection is a boolean flag to enable active querying. When active-querying is enabled, Gamma is the confidence threshold $\gamma$ for active querying. Please refer to the recipe file for Zephyr-β or recipe file for Zephyr-gemma for more details about the hyperparameters.

The evaluation is based on the official repositories of [Open LLM Leaderboard](GitHub - https://github.com/EleutherAI/lm-evaluation-harness), AlpacaEval and MT-Bench. We refer to the official repositories for more details about the evaluation procedure.

Citation

If you find this repository helpful, please kindly cite our paper:

@article{ji2024reinforcement,
  title={Reinforcement learning from human feedback with active queries},
  author={Ji, Kaixuan and He, Jiafan and Gu, Quanquan},
  journal={arXiv preprint arXiv:2402.09401},
  year={2024}
}

Acknowledgement

This repo is built upon The Alignment Handbook. We thank the authors for their great work.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
images		images
recipes		recipes
scripts		scripts
src/alignment		src/alignment
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Active Direct Preference Optimization (ADPO)

About ADPO

Environment Setup

Reproducing Results

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Active Direct Preference Optimization (ADPO)

About ADPO

Environment Setup

Reproducing Results

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages