EvoVLA: Self-Evolving Vision-Language-Action Model

This is the official repository for the paper:

EvoVLA: Self-Evolving Vision-Language-Action Model

Zeting Liu*, Zida Yang*, Zeyu Zhang*^†, and Hao Tang^#

*Equal contribution. ^†Project lead. ^#Corresponding author.

Paper | Website | Data | Models | HF Paper

teaser.mp4

🔗 Citation

If you find our work helpful, please cite:

@article{liu2025evovla,
  title={EvoVLA: Self-Evolving Vision-Language-Action Model},
  author={Liu, Zeting and Yang, Zida and Zhang, Zeyu and Tang, Hao},
  journal={arXiv preprint arXiv:2511.16166},
  year={2025}
}

📰 News

[2025-12-05] Code released! We have released the training code, evaluation scripts, and pre-trained models.
[2025-11-27] Paper released on arXiv.

🎥 Demo Video

☝️ Click the image above to watch the High-Res Demo Video (1080P) ☝️

📖 Abstract: Ending the Robot's "Daydream"

In long-horizon manipulation tasks, robots often suffer from "Stage Hallucination": they think they've completed a step because the visual scene looks "close enough," but they haven't actually finished the job (e.g., a block is near the target but not stacked). This "high confidence, low competence" failure mode causes catastrophic task failure.

EvoVLA is a self-evolving framework designed to cure this hallucination. By treating Gemini 2.5 Pro as a strict teacher that generates "Hard Negatives" (near-miss scenarios), EvoVLA learns to distinguish between almost done and actually done.

Combined with Pose-Based Object Exploration (POE) and Long-Horizon Memory, EvoVLA achieves SOTA performance on the challenging Discoverse-L benchmark.

🚀 Key Features

1. Stage-Aligned Reward (SAR): The "Anti-Hallucination" Mechanism

Traditional rewards are too sparse. We use Gemini to generate a "Mistake Book" of Hard Negatives—states that look successful but are actually failures (e.g., "gripper near object but not touching").

Positive: "Gripper firmly grasping the block."
Hard Negative: "Gripper near the block, but fingers are empty." This forces the VLA to learn precise, stage-aware visual discrimination.

Figure: The EvoVLA Data Engine generating triplets for contrastive learning.

2. Pose-Based Object Exploration (POE)

Instead of being curious about pixel changes (which can be noisy shadows or lighting), EvoVLA's curiosity is grounded in geometry. It explores how to change the relative pose between the gripper and the object, leading to efficient, structure-aware discovery.

3. Long-Horizon Memory

For tasks with 70+ steps, simple memory averaging fails. EvoVLA uses Context Selection to recall only the critical history tokens needed for the current decision, preventing "catastrophic forgetting."

📊 Performance

Simulation Results (Discoverse-L)

EvoVLA dominates across three tasks of increasing difficulty, especially on the 74-stage Block Bridge task.

Figure: Success rates on Discoverse-L benchmark. EvoVLA (Purple) significantly outperforms OpenVLA and π0.

Success Rate: 69.2% (+10.2% vs OpenVLA-OFT)
Hallucination Rate: Reduced from 38.5% to 14.8%
Sample Efficiency: 1.5x faster convergence

Real-World Sim2Real Transfer

Deployed on the AIRBOT-Play robot, EvoVLA shows remarkable robustness.

Figure: Real-world deployment performance.

Qualitative Analysis: "No More Faking It"

Figure: OpenVLA (Top) opens the gripper too early (hallucination). EvoVLA (Bottom) waits for stable contact.

📂 Project Structure

EvoVLA/
├── evovla/                    # Core module
│   ├── models/                # Policy models (OpenVLA-OFT)
│   ├── rewards/               # Reward modules (SAR, POE)
│   ├── ppo/                   # PPO trainer
│   ├── data/                  # Data utilities (Discoverse wrapper)
│   └── utils/                 # Utilities
├── configs/                   # Configuration files
├── scripts/                   # Training and evaluation scripts
├── tools/                     # Data preparation tools
└── requirements.txt           # Python dependencies

🛠️ Quick Start

Installation

# Clone the repository
git clone https://github.com/AIGeeksGroup/EvoVLA.git
cd EvoVLA

# Create environment
conda create -n evovla python=3.10
conda activate evovla

# Install dependencies
pip install -r requirements.txt

Training

# Train on Discoverse-L Bridge Task
python train.py --config configs/evovla_bridge.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets		assets
configs		configs
evovla		evovla
scripts		scripts
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo_video.mp4		demo_video.mp4
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EvoVLA: Self-Evolving Vision-Language-Action Model

Paper | Website | Data | Models | HF Paper

🔗 Citation

📰 News

🎥 Demo Video

📖 Abstract: Ending the Robot's "Daydream"

🚀 Key Features

1. Stage-Aligned Reward (SAR): The "Anti-Hallucination" Mechanism

2. Pose-Based Object Exploration (POE)

3. Long-Horizon Memory

📊 Performance

Simulation Results (Discoverse-L)

Real-World Sim2Real Transfer

Qualitative Analysis: "No More Faking It"

📂 Project Structure

🛠️ Quick Start

Installation

Training

Star History

About

Uh oh!

Releases

Packages

Languages

License

spaceNoByte/EvoVLA

Folders and files

Latest commit

History

Repository files navigation

EvoVLA: Self-Evolving Vision-Language-Action Model

Paper | Website | Data | Models | HF Paper

🔗 Citation

📰 News

🎥 Demo Video

📖 Abstract: Ending the Robot's "Daydream"

🚀 Key Features

1. Stage-Aligned Reward (SAR): The "Anti-Hallucination" Mechanism

2. Pose-Based Object Exploration (POE)

3. Long-Horizon Memory

📊 Performance

Simulation Results (Discoverse-L)

Real-World Sim2Real Transfer

Qualitative Analysis: "No More Faking It"

📂 Project Structure

🛠️ Quick Start

Installation

Training

Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages