This is the official repository for the paper:
EvoVLA: Self-Evolving Vision-Language-Action Model
Zeting Liu*, Zida Yang*, Zeyu Zhang*†, and Hao Tang#
*Equal contribution. †Project lead. #Corresponding author.
teaser.mp4
If you find our work helpful, please cite:
@article{liu2025evovla,
title={EvoVLA: Self-Evolving Vision-Language-Action Model},
author={Liu, Zeting and Yang, Zida and Zhang, Zeyu and Tang, Hao},
journal={arXiv preprint arXiv:2511.16166},
year={2025}
}- [2025-12-05] Code released! We have released the training code, evaluation scripts, and pre-trained models.
- [2025-11-27] Paper released on arXiv.
In long-horizon manipulation tasks, robots often suffer from "Stage Hallucination": they think they've completed a step because the visual scene looks "close enough," but they haven't actually finished the job (e.g., a block is near the target but not stacked). This "high confidence, low competence" failure mode causes catastrophic task failure.
EvoVLA is a self-evolving framework designed to cure this hallucination. By treating Gemini 2.5 Pro as a strict teacher that generates "Hard Negatives" (near-miss scenarios), EvoVLA learns to distinguish between almost done and actually done.
Combined with Pose-Based Object Exploration (POE) and Long-Horizon Memory, EvoVLA achieves SOTA performance on the challenging Discoverse-L benchmark.
Traditional rewards are too sparse. We use Gemini to generate a "Mistake Book" of Hard Negatives—states that look successful but are actually failures (e.g., "gripper near object but not touching").
- Positive: "Gripper firmly grasping the block."
- Hard Negative: "Gripper near the block, but fingers are empty." This forces the VLA to learn precise, stage-aware visual discrimination.
Instead of being curious about pixel changes (which can be noisy shadows or lighting), EvoVLA's curiosity is grounded in geometry. It explores how to change the relative pose between the gripper and the object, leading to efficient, structure-aware discovery.
For tasks with 70+ steps, simple memory averaging fails. EvoVLA uses Context Selection to recall only the critical history tokens needed for the current decision, preventing "catastrophic forgetting."
EvoVLA dominates across three tasks of increasing difficulty, especially on the 74-stage Block Bridge task.
Figure: Success rates on Discoverse-L benchmark. EvoVLA (Purple) significantly outperforms OpenVLA and π0.
- Success Rate: 69.2% (+10.2% vs OpenVLA-OFT)
- Hallucination Rate: Reduced from 38.5% to 14.8%
- Sample Efficiency: 1.5x faster convergence
Deployed on the AIRBOT-Play robot, EvoVLA shows remarkable robustness.
Figure: OpenVLA (Top) opens the gripper too early (hallucination). EvoVLA (Bottom) waits for stable contact.
EvoVLA/
├── evovla/ # Core module
│ ├── models/ # Policy models (OpenVLA-OFT)
│ ├── rewards/ # Reward modules (SAR, POE)
│ ├── ppo/ # PPO trainer
│ ├── data/ # Data utilities (Discoverse wrapper)
│ └── utils/ # Utilities
├── configs/ # Configuration files
├── scripts/ # Training and evaluation scripts
├── tools/ # Data preparation tools
└── requirements.txt # Python dependencies
# Clone the repository
git clone https://github.com/AIGeeksGroup/EvoVLA.git
cd EvoVLA
# Create environment
conda create -n evovla python=3.10
conda activate evovla
# Install dependencies
pip install -r requirements.txt# Train on Discoverse-L Bridge Task
python train.py --config configs/evovla_bridge.yaml

