diff --git a/README.md b/README.md index a29d575..867b29e 100644 --- a/README.md +++ b/README.md @@ -6,69 +6,7 @@ π€ Models & Datasets | π Technical Report
--> -# The Alignment Handbook +# Reinforcement Learning with Active Queries -## Installation instructions +The website of Reinforcement Learning with Active Queries, See [here](https://jkx19.github.io/ActiveQuery/) -To run the code in this project, first, create a Python virtual environment using e.g. Conda: - -```shell -conda create -n handbook python=3.10 && conda activate handbook -``` - -Next, install PyTorch `v2.1.0` - the precise version is important for reproducibility! Since this is hardware-dependent, we -direct you to the [PyTorch Installation Page](https://pytorch.org/get-started/locally/). - -You can then install the remaining package dependencies as follows: - -```shell -git clone https://github.com/huggingface/alignment-handbook.git -cd ./alignment-handbook/ -python -m pip install . -``` - -You will also need Flash Attention 2 installed, which can be done by running: - -> **Note** -> If your machine has less than 96GB of RAM and many CPU cores, reduce the MAX_JOBS., e.g. `MAX_JOBS=4 pip install flash-attn --no-build-isolation` - -```shell -python -m pip install flash-attn --no-build-isolation -``` - -Next, log into your Hugging Face account as follows: - -```shell -huggingface-cli login -``` - -Finally, install Git LFS so that you can push models to the Hugging Face Hub: - -```shell -sudo apt-get install git-lfs -``` - -You can now check out the `scripts` and `recipes` directories for instructions on how to train some models πͺ! - -## Project structure - -``` -βββ LICENSE -βββ Makefile <- Makefile with commands like `make style` -βββ README.md <- The top-level README for developers using this project -βββ chapters <- Educational content to render on hf.co/learn -βββ recipes <- Recipe configs, accelerate configs, slurm scripts -βββ scripts <- Scripts to train and evaluate chat models -βββ setup.cfg <- Installation config (mostly used for configuring code quality & tests) -βββ setup.py <- Makes project pip installable (pip install -e .) so `alignment` can be imported -βββ src <- Source code for use in this project -βββ tests <- Unit tests -``` - -## Running - -First checkout `recipes/zephyr-7b-beta/dpo/config_lora.yaml` and set the following arguments. gradient_accumulation_steps, loss_type (choose from "corr", "sigmoid", "hinge") and per_device_train_batch_size. Then edit --num_processes in `pipeline.sh`. Make sure gradient_accumulation_steps\*per_device_train_batch_size\*num_processes=true_batch_size. Then run the following command in shell: - -``` -bash pipeline.sh -``` diff --git a/assets/handbook.png b/assets/handbook.png deleted file mode 100644 index a1146bf..0000000 Binary files a/assets/handbook.png and /dev/null differ diff --git a/chapters/en/_toctree.yml b/chapters/en/_toctree.yml deleted file mode 100644 index e8fc7c0..0000000 --- a/chapters/en/_toctree.yml +++ /dev/null @@ -1,4 +0,0 @@ -- title: Unit 0. Welcome to the RLHF Handbook! - sections: - - local: chapter0/introduction - title: What is this about? \ No newline at end of file diff --git a/chapters/en/chapter0/introduction.mdx b/chapters/en/chapter0/introduction.mdx deleted file mode 100644 index 26f500f..0000000 --- a/chapters/en/chapter0/introduction.mdx +++ /dev/null @@ -1,3 +0,0 @@ -# Welcome to the RLHF Handbook! - -Stay tuned for more details π€ \ No newline at end of file diff --git a/images/algo.png b/images/algo.png new file mode 100644 index 0000000..4af409e Binary files /dev/null and b/images/algo.png differ diff --git a/images/chart.png b/images/chart.png new file mode 100644 index 0000000..3e37c01 Binary files /dev/null and b/images/chart.png differ diff --git a/images/table.png b/images/table.png new file mode 100644 index 0000000..98b862a Binary files /dev/null and b/images/table.png differ diff --git a/images/them.png b/images/them.png new file mode 100644 index 0000000..b106f5a Binary files /dev/null and b/images/them.png differ diff --git a/index.html b/index.html new file mode 100644 index 0000000..71bae19 --- /dev/null +++ b/index.html @@ -0,0 +1,234 @@ + + + + + + + + + + + + + + + + + + + ++ Aligning large language models (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (RLHF). Despite their superior performance, current RLHF approaches often require a large amount of human-labelled preference data, which is expensive to collect. In this paper, inspired by the success of active learning, we address this problem by proposing query-efficient RLHF methods. We first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization APPO algorithm with a constant regret bound and a constant query complexity. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of the state-of-the-art DPO method. +
+ + We propose Active Proximal Policy Optimization (APPO) for learning linear contextual bandits with global sub-optimal gap. In each round, the algorithm first use the following MLE estimator to estimate the parameter: + $$ \lambda\kappa_{\sigma} \mathbf{\theta} + \sum_{\tau \in \mathcal{C}_{t-1}}\Big(o_{\tau} - \mu\big(\langle \mathbf{\theta}, \mathbf{\phi}^1_{\tau}-\mathbf{\phi}^2_{\tau} \rangle\big)\Big) (\mathbf{\phi}^1_{\tau}-\mathbf{\phi}^2_{\tau}) = \mathbf{0} $$. + With the estimated parameter, APPO then compute the estimated reward and choose the best arm. APPO will not query for the label if the uncertainty is low. Our theoretical analysis shows the following guarantees regarding regret upper bound and query complexity. +
+
+ + We further proposed Active Direct Preference Optimization (ADPO) as a label-efficient alternative to Direct Preference Optimization. For each piece of prompt and the two answers, DPO treat the LLM as a reward model and assign rewards to the answers. Therefore, the difference of the two scores indicates the model's confidence of its prediction of the label. In ADPO, LLM will only queries the laebls of those answer pairs with high uncertainty. For those with low uncertainty, LLM will use its predicted label as its training target. +
+ ++ We trained zephry-7b-sft-full on the 62k Ultrachat-Feedback dataset using both DPO and our method ADPO. We evaluate the trained models on Open LLM LeaderBoard dataset. We also tested the performance of our method but without training on the pseudo-labels (denoted as ADPO w/o PL). Here are the key points of our results: +
+
+
+
+ @misc{deng2023rephrase,
+ title={Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves},
+ author={Yihe Deng and Weitong Zhang and Zixiang Chen and Quanquan Gu},
+ year={2023},
+ eprint={2311.04205},
+ archivePrefix={arXiv},
+ primaryClass={cs.CL}
+ }
+