This repository contains the official implementation of
📄 “KnowRL: Reinforcement Learning for Reliable Self-Knowledge in Large Language Models”.
KnowRL is inspired by the SeRL framework but significantly modified to strengthen self-knowledge rather than task reasoning.
Recent research shows that even the best LLMs misjudge their own competence in more than 1 in 5 cases, limiting trust in their responses.
KnowRL addresses this by enabling models to recognize what they know and what they don’t through a lightweight self-improvement process.

- Introspection – The model generates and classifies tasks it judges feasible or infeasible for itself.
- Consensus-based Rewarding – Rewards are derived from the stability of internal agreement, requiring no external labels.
- Minimal Data – Training begins from a small manually verified seed set and scales via model-generated tasks.
In experiments on LLaMA-3.1-8B and Qwen-2.5-7B, KnowRL improved self-knowledge by up to 28 % accuracy and 12 % F1 in just a few RL iterations—
all without human annotations.
openrlhf/ # Core RL training scripts (adapted from SeRL/OpenRLHF)
knowrl_prompts/ # Seed prompts and validation instructions
evaluation/ # Intrinsic and extrinsic evaluation code
scripts/ # Example launch scripts for different models
We recommend Python 3.11 on Ubuntu 20.04 or later.
# 1. Clone repository
git clone <url>
cd KnowRL
# 2. Install dependencies
pip install -r requirements.txt
# 3. Install OpenRLHF backend (needed for RL training)
cd openrlhf
pip install -e .KnowRL supports the same training infrastructure as SeRL but with new prompts and reward logic.
Recommended starter script for LLaMA-3.1-8B using Reinforce++:
ray start --head --node-ip-address 0.0.0.0
cd openrlhf
bash ../scripts/train_llama31_8b_knowrl.shKey hyperparameters (adjust to hardware):
--micro_train_batch_size 2
--train_batch_size 16
--micro_rollout_batch_size 4
--rollout_batch_size 16
--n_samples_per_prompt 8
--max_epochs 1
--prompt_max_len 1024
--generate_max_len 1024
--actor_learning_rate 5e-7
--init_kl_coef 1e-4
Evaluation includes:
- Intrinsic self-consistency – agreement across repeated generations.
- Extrinsic benchmarking – domain-expert verification on held-out feasible/infeasible tasks.
Example:
python evaluation/run_eval.py --model-path /path/to/your/checkpoint| Model | Accuracy Gain | F1 Gain | Iterations |
|---|---|---|---|
| LLaMA-3.1-8B | +28 % | +12 % | ~30 |
| Qwen-2.5-7B | +25 % | +10 % | ~30 |
All improvements were achieved without external supervision.
- Single-language focus: experiments are currently English-only.
- Limited training horizon: performance beyond ~30 RL iterations remains untested due to compute constraints.
- Scaling uncertainty: effectiveness on models larger than 8B parameters is unverified.