Skip to content

jkx19/ActiveQuery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Active Direct Preference Optimization (ADPO)

This repository contains the official code for the paper "Reinforcement Learning from Human Feedback with Active Queries".

Authors: Kaixuan Ji*, Jiafan He*, Quanquan Gu

About ADPO

Active Direct Preference Optimization (ADPO) is a query-efficient alternative to direct prefernce optimization (DPO). More Specifically, at each training step, ADPO first compute estimate the model's uncertainty of each preference pair. ADPO only queries for the preference labels of those pairs with low confidence scores. For the pairs with high confidence scores, ADPO uses the predicted preference label (pseudo-label) to update the model. Experiments on Zephyr-β and Zephyr-gemma shows that ADPO matches the performance of DPO with only one quarters of queries needed. For more details, please refer to our paper.

Environment Setup

The following steps provide the necessary setup to run our codes. First, create a conda environment as follows

conda create -n adpo python=3.10.9
conda activate adpo

Next, install the required packages as follows

python3 -m pip install -e .

Reproducing Results

To reproduce the results in our paper, please follow the steps below. To replicate the results on Zephyr-β, please run the following command.

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes=8 --main_process_port 30000 scripts/run_dpo.py recipes/training_configs/zephyr-beta.yaml \
--beta=0.1 \
--data_selection=true \
--Gamma=1.3 \
--num_train_epochs=1 \
--output_dir={path_to_your_output_dir}

To replicate the results on Zephyr-gemma, please run the following command.

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes=8 --main_process_port 30000 scripts/run_dpo.py recipes/training_configs/zephyr-gemma.yaml \
--beta=0.1 \
--data_selection=true \
--Gamma=1.3 \
--num_train_epochs=1 \
--output_dir={path_to_your_output_dir}

In both commands, \beta is the weight of the KL divergence term in the loss function, and data_selection is a boolean flag to enable active querying. When active-querying is enabled, Gamma is the confidence threshold $\gamma$ for active querying. Please refer to the recipe file for Zephyr-β or recipe file for Zephyr-gemma for more details about the hyperparameters.

The evaluation is based on the official repositories of [Open LLM Leaderboard](GitHub - https://github.com/EleutherAI/lm-evaluation-harness), AlpacaEval and MT-Bench. We refer to the official repositories for more details about the evaluation procedure.

Citation

If you find this repository helpful, please kindly cite our paper:

@article{ji2024reinforcement,
  title={Reinforcement learning from human feedback with active queries},
  author={Ji, Kaixuan and He, Jiafan and Gu, Quanquan},
  journal={arXiv preprint arXiv:2402.09401},
  year={2024}
}

Acknowledgement

This repo is built upon The Alignment Handbook. We thank the authors for their great work.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors