the first step : Deep Reinforcement Learning from Human Preferences [Christiano et al., 2017]. \n the second step : Fine-Tuning Language Models from Human Preferences. 2020
1、https://github.com/lucidrains/PaLM-rlhf-pytorch 基于谷歌的palm训练RLHF 2、https://github.com/CarperAI/trlx 训练rlhf的架构
3、https://github.com/LAION-AI/Open-Assistant 这个是打造聊天机器人