Skip to content

spaceNoByte/EvoVLA

 
 

Repository files navigation

EvoVLA: Self-Evolving Vision-Language-Action Model

arXiv Project Page Model Data License

This is the official repository for the paper:

EvoVLA: Self-Evolving Vision-Language-Action Model

Zeting Liu*, Zida Yang*, Zeyu Zhang*, and Hao Tang#

*Equal contribution. Project lead. #Corresponding author.

teaser.mp4

🔗 Citation

If you find our work helpful, please cite:

@article{liu2025evovla,
  title={EvoVLA: Self-Evolving Vision-Language-Action Model},
  author={Liu, Zeting and Yang, Zida and Zhang, Zeyu and Tang, Hao},
  journal={arXiv preprint arXiv:2511.16166},
  year={2025}
}

📰 News

  • [2025-12-05] Code released! We have released the training code, evaluation scripts, and pre-trained models.
  • [2025-11-27] Paper released on arXiv.

🎥 Demo Video

Click to Watch Demo Video

☝️ Click the image above to watch the High-Res Demo Video (1080P) ☝️


📖 Abstract: Ending the Robot's "Daydream"

In long-horizon manipulation tasks, robots often suffer from "Stage Hallucination": they think they've completed a step because the visual scene looks "close enough," but they haven't actually finished the job (e.g., a block is near the target but not stacked). This "high confidence, low competence" failure mode causes catastrophic task failure.

EvoVLA is a self-evolving framework designed to cure this hallucination. By treating Gemini 2.5 Pro as a strict teacher that generates "Hard Negatives" (near-miss scenarios), EvoVLA learns to distinguish between almost done and actually done.

Combined with Pose-Based Object Exploration (POE) and Long-Horizon Memory, EvoVLA achieves SOTA performance on the challenging Discoverse-L benchmark.


🚀 Key Features

1. Stage-Aligned Reward (SAR): The "Anti-Hallucination" Mechanism

Traditional rewards are too sparse. We use Gemini to generate a "Mistake Book" of Hard Negatives—states that look successful but are actually failures (e.g., "gripper near object but not touching").

  • Positive: "Gripper firmly grasping the block."
  • Hard Negative: "Gripper near the block, but fingers are empty." This forces the VLA to learn precise, stage-aware visual discrimination.

Figure: The EvoVLA Data Engine generating triplets for contrastive learning.

2. Pose-Based Object Exploration (POE)

Instead of being curious about pixel changes (which can be noisy shadows or lighting), EvoVLA's curiosity is grounded in geometry. It explores how to change the relative pose between the gripper and the object, leading to efficient, structure-aware discovery.

3. Long-Horizon Memory

For tasks with 70+ steps, simple memory averaging fails. EvoVLA uses Context Selection to recall only the critical history tokens needed for the current decision, preventing "catastrophic forgetting."


📊 Performance

Simulation Results (Discoverse-L)

EvoVLA dominates across three tasks of increasing difficulty, especially on the 74-stage Block Bridge task.

Figure: Success rates on Discoverse-L benchmark. EvoVLA (Purple) significantly outperforms OpenVLA and π0.

  • Success Rate: 69.2% (+10.2% vs OpenVLA-OFT)
  • Hallucination Rate: Reduced from 38.5% to 14.8%
  • Sample Efficiency: 1.5x faster convergence

Real-World Sim2Real Transfer

Deployed on the AIRBOT-Play robot, EvoVLA shows remarkable robustness.

Figure: Real-world deployment performance.

Qualitative Analysis: "No More Faking It"

Figure: OpenVLA (Top) opens the gripper too early (hallucination). EvoVLA (Bottom) waits for stable contact.


📂 Project Structure

EvoVLA/
├── evovla/                    # Core module
│   ├── models/                # Policy models (OpenVLA-OFT)
│   ├── rewards/               # Reward modules (SAR, POE)
│   ├── ppo/                   # PPO trainer
│   ├── data/                  # Data utilities (Discoverse wrapper)
│   └── utils/                 # Utilities
├── configs/                   # Configuration files
├── scripts/                   # Training and evaluation scripts
├── tools/                     # Data preparation tools
└── requirements.txt           # Python dependencies

🛠️ Quick Start

Installation

# Clone the repository
git clone https://github.com/AIGeeksGroup/EvoVLA.git
cd EvoVLA

# Create environment
conda create -n evovla python=3.10
conda activate evovla

# Install dependencies
pip install -r requirements.txt

Training

# Train on Discoverse-L Bridge Task
python train.py --config configs/evovla_bridge.yaml

Star History

Star History Chart

About

EvoVLA: Self-Evolving Vision-Language-Action Model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%