This repository contains an unofficial implementation of the paper Training Language Models to Self-Correct via Reinforcement Learning by Aviral Kumar et al.
Large Language Models (LLMs) often produce errors and struggle to self-correct when prompted. This paper introduces SCoRe, a novel multi-turn online reinforcement learning framework that enables LLMs to progressively self-correct their outputs without requiring external supervision or larger teacher models.
This project aims to reproduce and experiment with SCoRe, a multi-turn reinforcement learning framework that enables large language models (LLMs) to autonomously self-correct their outputs without external supervision. The approach improves reasoning and code generation by training LLMs to iteratively refine their own responses.
SCoRe improves self-correction by:
- Training initial correction attempts that remain close to the base model's outputs.
- Leveraging multi-turn reinforcement learning with reward shaping to encourage meaningful improvements.
- Using self-generated training data to avoid dependency on human feedback.
- Preventing failure modes like behavior collapse and overfitting through regularization.
Note: This is NOT the official repository; it is a personal implementation for experimentation and learning purposes.
- Python 3.8+
- PyTorch
- Transformers
- Create necessary directories.
mkdir data base_models saved_models- Download the MATH dataset and store it in the data folder.
cd data
wget https://people.eecs.berkeley.edu/~hendrycks/MATH.tar
tar -xvf MATH.tar
cd ..- Create conda environment using environment.yml.
conda env create -f environment.yml- Activate the environment.
conda activate llm-self-correct- Create folder base_models and download the base model from Huggingface by running main.py in download task. Please note a token might be required to access certain models.
cd code/
python main.py --task download- Once downloaded, run the main.py in train task.
python main.py --task trainOr, use the run.slurm file for submitting a job in HPRC.
cd ..
sbatch run.slurm- To run the saved model in evaluation mode, use the following command.
python main.py --task evaluateIf you build upon this work, please cite the original paper:
@article{kumar2024training, title={Training Language Models to Self-Correct via Reinforcement Learning}, author={Kumar, Aviral and Zhuang, Vincent and Agarwal, Rishabh and Su, Yi and Co-Reyes, JD and Singh, Avi and Baumli, Kate and Iqbal, Shariq and Bishop, Colton and Roelofs, Rebecca and others}, journal={arXiv preprint arXiv:2409.12917}, year={2024} }
This project is licensed under the MIT License.
For questions or collaboration, please contact the authors via the email addresses provided in the paper or open an issue on this repository.