RLHF with Proximal Policy Optimization (PPO)

This project implements Reinforcement Learning with Human Feedback (RLHF) using Proximal Policy Optimization (PPO) to fine-tune a large language model (a pre-trained GPT-2 from Huggingface lvwerra/gpt2-imdb) based on human preferences. The reward model (siebert/sentiment-roberta-large-english) is used to guide the policy updates, improving the quality of text generation based on human-aligned objectives.

Project Overview

This project uses PPO to optimize the text generation of GPT-2, guided by a reward model that reflects human feedback. The main steps include:

Policy Model (GPT-2): The language model that generates text based on prompts.
Reward Model (siebert/sentiment-roberta-large-english): Evaluates the generated responses, assigning a score based on sentiment or human preferences.
PPO Algorithm: Updates the GPT-2 model by maximizing the reward signal while ensuring stable updates with clipping.

How It Works

Language Model (Policy):
- GPT-2 generates text token by token, and each token prediction is considered an action.
- The model learns to generate meaningful and coherent sequences by optimizing over multiple training epochs.
Reward Model:
- The reward model (using siebert/sentiment-roberta-large-english) evaluates the text generated by the language model. It assigns rewards based on how well the output aligns with positive sentiments or other desired outcomes.
PPO Training:
- The Proximal Policy Optimization (PPO) algorithm is used to fine-tune the language model by maximizing the rewards while ensuring stable training.
- The policy model is updated in a way that prevents drastic changes, using advantage estimation and importance sampling.
Training Data:
- The IMDB dataset is used for both the reward model training and the input prompts for the GPT-2 model.

Files and Structure

train.py: Contains the training loop that fine-tunes the GPT-2 model using PPO. After training, the model is saved to the saved_models folder.
generate.py: Generates text sequences using the trained model based on user input.
PPOTrainer.py: Implements the PPO algorithm, including advantage estimation and policy updates.
model.py: Defines the policy model (PolicyModel) and the reward model (RewardModel).
dataset.py: Loads and processes the IMDB dataset.
config.py: Contains the configuration parameters used for training (e.g., learning rate, batch size).
main.py: The entry point for running training or generation tasks using command-line arguments.

Setup and Installation

To run the training:

python main.py --task train

Requirements

Python 3.8+
PyTorch
Hugging Face transformers library
datasets library
torchtyping

You can install the dependencies using pip or using the requirements.yml file:

conda env create -f requirements.yml

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
code		code
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.yml		requirements.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RLHF with Proximal Policy Optimization (PPO)

Project Overview

How It Works

Files and Structure

Setup and Installation

Requirements

About

Uh oh!

Uh oh!

Languages

License

AnkitXP/RLHF-PPO

Folders and files

Latest commit

History

Repository files navigation

RLHF with Proximal Policy Optimization (PPO)

Project Overview

How It Works

Files and Structure

Setup and Installation

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages