Skip to content

QiRebecca/deepevolve

Repository files navigation

# DeepEvolve - 工程化 LLM 训练框架

## 一键命令(Commands)

- SFT(监督微调)
python scripts/run_sft.py --config configs/sft.yaml

- GRPO(分组奖励策略优化)
# 注意:需满足 num_generations>=2 且 generation_batch_size 可被 num_generations 整除
python scripts/run_grpo.py --config configs/grpo.yaml

- DPO(直接偏好优化)
# 使用默认模型
python scripts/run_dpo.py --config configs/dpo.yaml

# 使用 SFT 检查点(推荐)
python scripts/run_dpo.py --config configs/dpo.yaml \
  --model_path output/sft/DialoGPT-small_YYYYMMDD_HHMMSS/

- OpenEvolve(代码进化)
python scripts/run_evolve.py --config configs/evolve.local.yaml


---

## 数据格式(Data Formats)

- SFT:单样本或对话样式
```json
{
  "text": "Human: 你好!\nAssistant: 你好! 很高兴为你服务。"
}
```
或
```json
{
  "messages": [
    {"role": "user", "content": "请介绍一下Python"},
    {"role": "assistant", "content": "Python 是一种易读、功能强大的编程语言..."}
  ]
}
```

- DPO:偏好对(chosen/rejected)
```json
{
  "prompt": "如何高效学习?",
  "chosen": "制定计划-刻意练习-及时反馈-长期复盘。",
  "rejected": "多看就会了。"
}
```

- GRPO:偏好/奖励所需的最小字段
```json
{
  "prompt": "Explain overfitting.",
  "meta": {"id": "sample-0001"}
}
```
> 训练时会按 `num_generations` 生成多个回复,并通过 `scripts/reward_functions.py` 计算奖励。

---

## 目录架构(Architecture)

```
deepevolve/
├── configs/
│   ├── sft.yaml                  # SFT 配置(CPU 优化、禁用混合精度)
│   ├── grpo.yaml                 # GRPO 配置(需 num_generations>=2 的合法组合)
│   ├── dpo.yaml                  # DPO 配置(Beta、loss_type 等)
│   └── evolve.local.yaml         # OpenEvolve 配置(API、迭代等)
├── scripts/
│   ├── run_sft.py                # 调用 `trl sft`,保存时间戳输出与配置
│   ├── run_grpo.py               # 调用 `trl grpo`,含 CPU/内存优化参数
│   ├── run_dpo.py                # 调用 `trl dpo`,支持 `--model_path`
│   ├── run_evolve.py             # 运行 OpenEvolve 流程
│   └── reward_functions.py       # GRPO 奖励函数(可自定义)
├── output/
│   ├── sft/
│   │   └── {model}_{timestamp}/  # config.yaml, checkpoint-*, logs/
│   ├── grpo/
│   │   └── {model}_{timestamp}/  # config.yaml, training_summary.txt, checkpoint-*
│   ├── dpo/
│   │   └── {model}_{timestamp}/  # config.yaml, training_summary.txt, checkpoint-*
│   └── evolve/
│       └── evolve_{timestamp}/   # config.yaml, best/, logs/
└── models/                       # 本地模型缓存(可选)
```

---

## 关键约束与建议(Essentials)

- GRPO:
  - 必须 `num_generations >= 2`
  - 必须 `generation_batch_size % num_generations == 0`
  - 内存紧张时:降低 `num_generations/group_size/max_*_length`,提高 `gradient_accumulation_steps`
- DPO:推荐先进行 SFT,再用 SFT 检查点进行 DPO
- SFT:CPU 下禁用混合精度;必要时减小 `batch_size/max_steps`

参考实现与参数见 TRL 项目:[https://github.com/huggingface/trl](https://github.com/huggingface/trl)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published