LLM Reasoning and Self-Correction

This repository contains an unofficial implementation of the paper Training Language Models to Self-Correct via Reinforcement Learning by Aviral Kumar et al.

Overview

Large Language Models (LLMs) often produce errors and struggle to self-correct when prompted. This paper introduces SCoRe, a novel multi-turn online reinforcement learning framework that enables LLMs to progressively self-correct their outputs without requiring external supervision or larger teacher models.

This project aims to reproduce and experiment with SCoRe, a multi-turn reinforcement learning framework that enables large language models (LLMs) to autonomously self-correct their outputs without external supervision. The approach improves reasoning and code generation by training LLMs to iteratively refine their own responses.

SCoRe improves self-correction by:

Training initial correction attempts that remain close to the base model's outputs.
Leveraging multi-turn reinforcement learning with reward shaping to encourage meaningful improvements.
Using self-generated training data to avoid dependency on human feedback.
Preventing failure modes like behavior collapse and overfitting through regularization.

Note: This is NOT the official repository; it is a personal implementation for experimentation and learning purposes.

Prerequisites

Python 3.8+
PyTorch
Transformers

Steps to execute the code:

Create necessary directories.

mkdir data base_models saved_models

Download the MATH dataset and store it in the data folder.

cd data
wget https://people.eecs.berkeley.edu/~hendrycks/MATH.tar
tar -xvf MATH.tar
cd ..

Create conda environment using environment.yml.

conda env create -f environment.yml

Activate the environment.

conda activate llm-self-correct

Create folder base_models and download the base model from Huggingface by running main.py in download task. Please note a token might be required to access certain models.

cd code/
python main.py --task download

Once downloaded, run the main.py in train task.

python main.py --task train

Or, use the run.slurm file for submitting a job in HPRC.

cd ..
sbatch run.slurm

To run the saved model in evaluation mode, use the following command.

python main.py --task evaluate

Citation

If you build upon this work, please cite the original paper:

@article{kumar2024training, title={Training Language Models to Self-Correct via Reinforcement Learning}, author={Kumar, Aviral and Zhuang, Vincent and Agarwal, Rishabh and Su, Yi and Co-Reyes, JD and Singh, Avi and Baumli, Kate and Iqbal, Shariq and Bishop, Colton and Roelofs, Rebecca and others}, journal={arXiv preprint arXiv:2409.12917}, year={2024} }

License

This project is licensed under the MIT License.

Contact

For questions or collaboration, please contact the authors via the email addresses provided in the paper or open an issue on this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
code		code
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
run.slurm		run.slurm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Reasoning and Self-Correction

Overview

Prerequisites

Steps to execute the code:

Citation

License

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

AnkitXP/RL-SCoRe

Folders and files

Latest commit

History

Repository files navigation

LLM Reasoning and Self-Correction

Overview

Prerequisites

Steps to execute the code:

Citation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages