Skip to content

A Reinforcement Learning based inference-time reflection and self-correction framework to rectify errors without external agents or knowledge distillation.

License

Notifications You must be signed in to change notification settings

AnkitXP/RL-SCoRe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Reasoning and Self-Correction

This repository contains an unofficial implementation of the paper Training Language Models to Self-Correct via Reinforcement Learning by Aviral Kumar et al.

Overview

Large Language Models (LLMs) often produce errors and struggle to self-correct when prompted. This paper introduces SCoRe, a novel multi-turn online reinforcement learning framework that enables LLMs to progressively self-correct their outputs without requiring external supervision or larger teacher models.

This project aims to reproduce and experiment with SCoRe, a multi-turn reinforcement learning framework that enables large language models (LLMs) to autonomously self-correct their outputs without external supervision. The approach improves reasoning and code generation by training LLMs to iteratively refine their own responses.

SCoRe improves self-correction by:

  • Training initial correction attempts that remain close to the base model's outputs.
  • Leveraging multi-turn reinforcement learning with reward shaping to encourage meaningful improvements.
  • Using self-generated training data to avoid dependency on human feedback.
  • Preventing failure modes like behavior collapse and overfitting through regularization.

Note: This is NOT the official repository; it is a personal implementation for experimentation and learning purposes.

Prerequisites

  • Python 3.8+
  • PyTorch
  • Transformers

Steps to execute the code:

  1. Create necessary directories.
mkdir data base_models saved_models
  1. Download the MATH dataset and store it in the data folder.
cd data
wget https://people.eecs.berkeley.edu/~hendrycks/MATH.tar
tar -xvf MATH.tar
cd ..
  1. Create conda environment using environment.yml.
conda env create -f environment.yml
  1. Activate the environment.
conda activate llm-self-correct
  1. Create folder base_models and download the base model from Huggingface by running main.py in download task. Please note a token might be required to access certain models.
cd code/
python main.py --task download
  1. Once downloaded, run the main.py in train task.
python main.py --task train

Or, use the run.slurm file for submitting a job in HPRC.

cd ..
sbatch run.slurm
  1. To run the saved model in evaluation mode, use the following command.
python main.py --task evaluate

Citation

If you build upon this work, please cite the original paper:

@article{kumar2024training, title={Training Language Models to Self-Correct via Reinforcement Learning}, author={Kumar, Aviral and Zhuang, Vincent and Agarwal, Rishabh and Su, Yi and Co-Reyes, JD and Singh, Avi and Baumli, Kate and Iqbal, Shariq and Bishop, Colton and Roelofs, Rebecca and others}, journal={arXiv preprint arXiv:2409.12917}, year={2024} }


License

This project is licensed under the MIT License.


Contact

For questions or collaboration, please contact the authors via the email addresses provided in the paper or open an issue on this repository.


About

A Reinforcement Learning based inference-time reflection and self-correction framework to rectify errors without external agents or knowledge distillation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •