- 🚀 Full RL training pipeline runnable on your laptop (via Tinker API)
- 📈 AIME 2024: 43.3% → 56.7% (+13.3%)
- 🛡️ Novel redundancy penalty to prevent reward hacking
- 💰 Total cost: < $150 (including failed experiments)
TL;DR: A minimal implementation of JustRL-style reasoning model training using the Tinker platform (run it on your Macbook!).
JustRL: Simplicity is all you need. No KL penalty, no entropy regularization, no length penalty—just RL.
This repository demonstrates a two-stage training pipeline to transform a standard instruction-tuned model into a reasoning model with explicit thinking capabilities:
| Stage | Purpose |
|---|---|
| Stage 1: Cold-Start SFT | Teach the model to use <think>...</think> tokens |
| Stage 2: JustRL (GRPO) | Reinforce reasoning via verifiable rewards |
Based on the paper "JustRL: Simplicity at Scale", we adopt a minimalist approach:
- No KL penalty (
kl_coef=0) - No length penalty (found harmful in experiments)
- Clip-higher (
clip_ratio=1.28)
Our implementation follows the evolution from standard GRPO to the simplified JustRL approach, with our own targeted improvements:
| Feature | Standard GRPO | JustRL | JustTinker (Ours) |
|---|---|---|---|
| Rollout | N responses per problem | Same | rollout_n=8 |
| Advantage | Group-relative rewards | Same | Same |
| Critic Model | None (Monte Carlo) | Same | Same |
| KL Penalty | Yes (kl_coef > 0) | Removed | None |
| Entropy Regularization | Optional | Removed | None |
| Clip Ratio | Symmetric (0.8, 1.2) | Asymmetric (0.8, 1.28) | clip-higher |
| Training Samples | All samples | Positive advantage only | Same |
| Length Penalty | Optional | Removed (harmful) | None |
Our Additional Contributions (JustTinker):
| Modification | Source | Purpose |
|---|---|---|
format_reward_weight=0.1 |
Added | Encourage <think> format |
redundancy_penalty |
Original | Prevent reward hacking via repetitive content |
The original JustRL paper uses DeepSeek-R1-Distill-Qwen-1.5B, which was distilled from DeepSeek-R1 (671B) and natively outputs <think>...</think> format. This model already "thinks out loud" by default.
In contrast, Qwen3-4B-Instruct-2507 is an internalized reasoning model:
| Model | Reasoning Style | <think> Output |
|---|---|---|
| DeepSeek-R1-Distill-Qwen-1.5B | Explicit (shows reasoning) | Native |
| Qwen3-4B-Thinking-2507 | Explicit (shows reasoning) | Native |
| Qwen3-4B-Instruct-2507 | Internalized (hidden reasoning) | No |
Qwen3-4B-Instruct-2507 was trained to produce concise, direct answers without exposing the step-by-step reasoning process. The reasoning capability is "compressed" into the model weights through distillation, optimizing for fewer output tokens.
Why use it anyway? It's the most suitable small model available on Tinker platform. To align with the JustRL setup, we perform a cold-start SFT to teach the model the <think>...</think> format before RL training.
Since Qwen3-4B-Instruct-2507 doesn't output thinking tokens by default, we need to "awaken" this capability:
Before SFT: "The answer is 42."
After SFT: "<think>Let me analyze this step by step...</think>\n\nThe answer is \boxed{42}."
This is a very short training phase (~800 steps, <$30) that teaches format compliance, not reasoning ability. The model already has strong reasoning capabilities—we're just teaching it to show its work.
| Parameter | Value |
|---|---|
| Base Model | Qwen/Qwen3-4B-Instruct-2507 |
| Dataset | OpenR1-Math-220k (10K samples, filtered) |
| Max Sequence Length | 8,192 tokens |
| Training Steps | 800 |
| Batch Size | 16 (8 × 2 gradient accumulation) |
| Learning Rate | 2e-5 |
| Total Cost | < $30 |
The trained model is made publicly available on Tinker:
tinker://b0af3bd0-9638-583f-8c2c-2bb348453023:train:0/weights/coldstart_sft_final
You can load this checkpoint directly for Stage 2 (JustRL) training:
python scripts/tinker/justrl_math_reasoning.py \
--checkpoint {coldstart_sft_final} \
--reasoning \
--scale mediumThe cold-start SFT successfully taught the model to produce structured thinking:
| Metric | Initial | Final |
|---|---|---|
| Thinking Rate | 0% | 70% |
| Boxed Answer Rate | 36.7% | 80% |
| Training Loss | 0.86 | 0.33 |
Key Observations
-
Rapid Format Learning: The model learned the
<think>format around step 250-300, with thinking rate jumping from 0% to ~50% -
Stable Convergence: Both thinking rate (~70-80%) and boxed rate (~80%) stabilized after step 500
-
Response Length Growth: Average response length increased from ~8K to ~12K tokens as the model learned to produce detailed reasoning
-
Efficient Training: 800 steps were sufficient for format learning; further training showed diminishing returns
Input: How many terms will there be if we expand (4x³ + x⁻³ + 2)²⁰¹⁶?
Output:
<think>
Okay, let's see. I need to figure out how many terms there will be when we
expand (4x³ + x⁻³ + 2)²⁰¹⁶ and combine like terms. Hmm, expanding such a
high exponent might be complicated, but maybe there's a pattern or formula
I can use instead of multiplying everything out.
First, let me recall...
</think>
The number of distinct terms is \boxed{12097}.
We conducted three experiments to explore JustRL-style training:
| Experiment | Status | Key Finding |
|---|---|---|
| Exp 001 | Failed | Training collapse due to reward hacking |
| Exp 002 | Partial | Redundancy penalty prevents collapse |
| Exp 003 | In Progress | AIME accuracy +13.3% with harder data |
Experiment ID:
failed_exp_001_training_collapse_20260111Status: Failed — Documented for learning purposes
During our first JustRL training run, we observed a classic reward hacking phenomenon where the model exploited the reward function in unintended ways.
# This configuration led to training collapse
algorithm:
clip_ratio_low: 0.8
clip_ratio_high: 1.28
kl_coef: 0.0 # ❌ No KL penalty - allowed unconstrained drift
training:
learning_rate: 1e-6
batch_size: 32
rollout_n: 8
max_response_length: 8192
eval:
eval_interval: 20 # ❌ Too infrequent to catch collapse early
eval_samples: 200
# - No format reward weightThe plot shows clear signs of collapse after step ~120: accuracy drops sharply while response length explodes.
| Step Range | Accuracy | Thinking Rate | Avg Response Length | Status |
|---|---|---|---|---|
| 1-50 | 67-88% | 78-92% | 2,600-4,200 | Normal |
| 51-80 | 70-90% | 80-89% | 3,000-4,400 | Normal |
| 80-120 | 60-85% | 72-88% | 3,500-5,000 | Warning signs |
| 120-135 | 55-75% | 65-75% | 4,500-5,200 | Degrading |
| 135-145 | 36-56% | 37-56% | 4,900-5,700 | Rapid collapse |
| 145-158 | 10-28% | 14-31% | 6,000-7,200 | Complete collapse |
The model discovered that generating longer responses occasionally led to correct answers by chance. Without constraints, this behavior was reinforced:
Feedback Loop:
┌─────────────────────────────────────────────────────────────┐
│ Occasionally long response → correct answer → reward │
│ ↓ │
│ Policy reinforces "generate longer" │
│ ↓ │
│ Quality drops → fewer correct samples │
│ ↓ │
│ Remaining correct samples are mostly long → more bias │
│ ↓ │
│ Collapse: 35,000+ char responses, no reasoning, ~10% acc │
└─────────────────────────────────────────────────────────────┘
We extracted typical reward hacking samples from Step 140 evaluation:
| Sample | Response Length | Has Thinking | Extracted Answer |
|---|---|---|---|
| #1 | 35,739 chars | No | (extraction failed) |
| #2 | 32,268 chars | No | (extraction failed) |
| #3 | 30,484 chars | No | (extraction failed) |
| #4 | 29,964 chars | No | ( ( ( ( ( ( ... (repetitive) |
| #5 | 29,195 chars | No | (extraction failed) |
Common patterns:
- Responses hit max_length (8192 tokens) and get truncated
- No
<think>...</think>structure - Repetitive text loops (e.g., "Therefore, the three sides..." repeated 100+ times)
- Self-aware "Wait" statements showing the model recognizes issues but can't stop
Full analysis:
docs/research/reward_hacking_mechanism.md
| JustRL Paper Recommendation | Our Experience |
|---|---|
kl_coef=0 |
Allowed unconstrained policy drift |
| No length penalty | Contributed to response explosion |
| 800+ steps training | Collapse started at step ~100 |
Key insight: JustRL's "simplicity" works for models already trained to reason (DeepSeek-R1-Distill), but may need guardrails for models learning to reason from scratch.
Rather than abandoning JustRL's simplicity, we developed targeted interventions:
- Format reward:
format_reward_weight=0.1to incentivize<think>usage - Redundancy penalty ⭐: Penalize repetitive content (our original contribution)
- Uses compression ratio + N-gram analysis
- Only activates when redundancy > 30%
- Max penalty: 0.3 on correct answers
- Early stopping: Stop if eval accuracy drops 5%+ for 3 consecutive evals
- Health monitoring: Warn if response length > 5000, thinking rate < 60%, or redundancy > 40%
What we deliberately avoided (following JustRL):
- ❌ KL penalty
- ❌ Length penalty (found harmful in JustRL experiments)
Experiment ID:
exp_002_with_redundancy_penaltyStatus: Not fully trained — Step 120/800
After implementing the mitigations above, we resumed training from Step 81 with the new configuration.
Following JustRL's minimalist philosophy, we avoid KL penalty and length penalty. Instead, we introduce an original redundancy penalty to combat reward hacking.
algorithm:
name: grpo
rollout_n: 8
clip_ratio_low: 0.8
clip_ratio_high: 1.28 # clip-higher (JustRL style)
kl_coef: 0.0 # ❌ No KL penalty (JustRL style)
entropy_coef: 0.0 # ❌ No entropy regularization
reward:
type: binary
correct_reward: 1.0
incorrect_reward: 0.0
length_penalty: false # ❌ No length penalty (harmful per JustRL)
format_reward_weight: 0.1 # ✅ Encourage <think> token usage
redundancy_weight: 0.3 # ✅ Original: penalize repetitive content
redundancy_threshold: 0.3 # ✅ Only penalize when redundancy > 30%
training:
learning_rate: 1e-6
max_response_length: 8192
eval_interval: 10| Technique | JustRL | Our Approach | Rationale |
|---|---|---|---|
| KL Penalty | ❌ No | ❌ No | On-policy RL has implicit regularization |
| Length Penalty | ❌ No | ❌ No | Harmful per JustRL findings |
| Format Reward | N/A | ✅ 0.1 | Encourage structured thinking |
| Redundancy Penalty | N/A | ✅ Original | Combat reward hacking without length penalty |
Why No KL Penalty?
According to RL's Razor, on-policy RL training naturally exhibits an implicit bias that keeps the policy close to the base model:
"On-policy RL training implicitly regularizes KL divergence from the base model, even without explicit KL penalties."
This theoretical insight, combined with JustRL's empirical success with kl_coef=0, supports our decision to skip KL penalty.
Redundancy Penalty: Our Original Contribution
Instead of penalizing long responses (which can hurt legitimate reasoning), we penalize repetitive/redundant content — the true signature of reward hacking.
Method: Dual-metric fusion
- Compression ratio (60% weight): High repetition → high compression → high penalty
- N-gram repetition (40% weight): Repeated 5-grams indicate redundancy
Validation on reward hacking samples:
| Sample Type | Redundancy Score | Penalty Applied |
|---|---|---|
| Normal reasoning | 0-4% | None |
| Reward hacking | 62-89% | 0.14-0.25 |
Full methodology:
docs/research/redundancy_penalty_methodology.md
| Metric | Step 80 (Before) | Step 120 (Current) | Change |
|---|---|---|---|
| Eval Accuracy (MATH) | 83.00% | 84.00% | +1.0% |
| Best Eval Accuracy | 83.50% | 85.50% (Step 100) | +2.0% |
| Thinking Rate | ~80% | ~84% | Stable |
| Avg Response Length | ~4000 | ~3700 | Controlled |
| Avg Redundancy Score | N/A | ~38% | Within limits |
Key Observations:
- No collapse at Step 120 (unlike Exp 001 which collapsed at Step 120-140)
- Redundancy penalty keeping repetitive content in check (~38%, threshold 30%)
- Eval accuracy improved from 83% to 85.5% peak
- Response length stable, not exploding
Experiment ID:
exp_003_justrl_alignedStatus: In Progress — Step 60/800
This experiment aligns more closely with the JustRL paper's training setup, using the DAPO-Math-17k dataset and adding AIME 2024 as an additional benchmark.
| Metric | SFT Baseline (Step 0) | Step 60 (Current) | Best | Change |
|---|---|---|---|---|
| Eval MATH Accuracy | 91.00% | 90.50% | 91.00% (Step 0) | -0.5% |
| Eval AIME Accuracy | 43.33% | 50.00% | 56.67% (Step 30/50) | +13.3% |
| MATH Thinking Rate | 99% | 98% | — | Stable |
| AIME Thinking Rate | 83% | 83% | — | Stable |
Key Observations:
- MATH accuracy remains stable around 88-91% (no degradation)
- AIME accuracy improved significantly: 43.33% → 56.67% peak (+13.3%)
- Thinking rate stable on both benchmarks
- No reward hacking observed (redundancy score ~39%, within limits)
Same reward/algorithm setup as Experiment 002, with key changes:
| Parameter | Exp 002 | Exp 003 | Note |
|---|---|---|---|
| Training Dataset | MATH train (~7.5K) | DAPO-Math-17k (11.4K) | Harder problems |
| Eval Datasets | MATH only | MATH + AIME-2024 | Added competition benchmark |
| Max Response Length | 8,192 tokens | 15,360 tokens | Aligned with JustRL paper |
| Dataset | Source | Size | Purpose |
|---|---|---|---|
| DAPO-Math-17k | BytedTsinghua-SIA/DAPO-Math-17k |
11,384 | Training (RL) |
| MATH-test | HuggingFaceH4/MATH-500 |
200 (stratified) | Evaluation |
| AIME-2024 | HuggingFaceH4/aime_2024 |
30 | Evaluation (competition-level) |
| Aspect | Exp 002 | Exp 003 |
|---|---|---|
| Training Data | MATH train | DAPO-Math-17k |
| Eval Benchmarks | MATH only | MATH + AIME |
| Max Response | 8,192 tokens | 15,360 tokens |
| Training Samples | ~7,500 | ~11,384 |
tinker://fbadbbce-0cfc-53dd-ad26-9117748c5070:train:0/weights/checkpoint_step_50
RLVR/
├── README.md # This file
├── scripts/
│ ├── launchers/ # Shell scripts (run_coldstart_sft.sh, run_justrl_reasoning.sh)
│ ├── tinker/ # Tinker API scripts (coldstart_sft.py, justrl_math_reasoning.py)
│ └── utils/ # Utilities (plot_rlvr_training.py, plot_sft_training.py)
├── src/
│ ├── configs/ # Training configurations (SFTConfig, RLConfig)
│ ├── data/ # Dataset loading (MATH, GSM8K, DAPO-Math-17k, AIME)
│ ├── evaluation/ # Math verification (MathVerifier)
│ └── prompts/ # Prompt formatting utilities
└── resources/ # Training curves and artifacts
├── coldstartSFT/ # Stage 1 results
├── justRL_exp001/ # Exp 001 (collapsed)
├── justRL_exp002/ # Exp 002 (with redundancy penalty)
└── justRL_exp003/ # Exp 003 (DAPO + AIME)
# Install dependencies
pip install -r requirements.txt
# Set API key
export TINKER_API_KEY=your_api_key./scripts/launchers/run_coldstart_sft.sh small# Exp 003:
./scripts/launchers/run_justrl_reasoning.sh medium --reasoning \
--checkpoint tinker://b0af3bd0-9638-583f-8c2c-2bb348453023:train:0/weights/coldstart_sft_finalCredits: All experiments were conducted using $150 free credits gifted by Tinker.
| Stage | Steps | Estimated | Actual |
|---|---|---|---|
| Cold-Start SFT | 800 | ~$46 | < $30 |
| JustRL Exp 001 + 002 | 160 (120+40) | ~$100 | $72 |
| JustRL Exp 003 | 60 | ~$48 | $34 |
| Total Spent | — | — | ~$136 |
| Remaining Credit | — | — | ~$14 |
JustRL Cost Breakdown:
- Exp 001 (collapsed): ~120 steps before reward hacking
- Exp 002 (with redundancy penalty): 40 steps (Step 81-120)
- Average cost: ~$0.45/step
Note: Costs are based on Tinker pricing for Qwen3-4B-Instruct-2507 ($0.22/M tokens). Actual costs are often lower than estimates due to early stopping and response length variance.
- JustRL: Simplicity at Scale — Core methodology
- DeepSeek-R1 Technical Report — GRPO algorithm
- DAPO: Decoupled Clip and Dynamic Sampling — Advanced techniques
- RL's Razor: On-Policy Implicit Regularization — Why KL penalty may be unnecessary
- Tinker Platform — Training infrastructure
MIT License
If you find this repository helpful, please cite:
@misc{ning2026justtinker,
author = {Ning, Guanghan},
title = {JustTinker: Minimal Reinforcement Learning with Verifiable Rewards},
year = {2026},
publisher = {GitHub},
url = {https://github.com/Guanghan/JustTinker},
note = {A minimal implementation of JustRL-style reasoning model training with redundancy penalty}
}


