Humans can perceive "what is happening in a scene" and situate it within the dynamic structure of a task, allowing them to "estimate task progress from a single observation". We study whether VLMs possess a similar form of progress reasoning by following:
Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing
what is visible, it remains unclear whether they can infer how far a task has progressed from partial
observations. To this end, we first introduce PROGRESS-BENCH, a benchmark for systematically
evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human inspired two-stage progress reasoning paradigm through both training-free prompting and
training-based approach based on curated dataset PROGRESSLM-45K. Experiments on 14 VLMs
show that most models are not yet ready for task progress estimation, exhibiting sensitivity
to demonstration modality and viewpoint changes, as well as poor handling of unanswerable
cases. While training-free prompting that enforces structured progress reasoning yields limited
and model-dependent gains, the training-based PROGRESSLM-3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the
evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why
progress reasoning succeeds or fails.
Click to jump to each section.
PROGRESS-BENCH evaluates whether a model can situate a single observation within the temporal structure of an ongoing task, going beyond static perception to reason about task progression. Rather than introducing all design factors at once, we construct the benchmark in successive stages, gradually increasing the reasoning challenges involved in progress estimation.
Specifically, PROGRESS-BENCH is designed around three key dimensions:
We frame progress reasoning as a human-inspired two-stage process (Figure 1). Given a demonstration and a partial observation, humans first perform episodic retrieval to identify a representative reference step as a coarse anchor, and then apply mental simulation to reason how the task state evolves from this anchor to the current observation. This formulation treats progress estimation as reasoning over a latent task trajectory, rather than matching observations to fixed timestamps.
We instantiate this two-stage reasoning via structured prompting without parameter updates. The prompt enforces an explicit schema with four fields: <ref_think> (episodic retrieval reasoning), <ref> (retrieved reference step), <score_think> (mental simulation), and <score> (final progress estimate), which the model follows at inference time.
We further explore a training-based strategy to explicitly teach the two-stage reasoning process of episodic retrieval and mental simulation. We first perform supervised fine-tuning on our ProgressLM-25K-CoT dataset to internalize the structured reasoning format, where each example includes a task demonstration, a single observation, and a target reasoning trace specifying the reference step and progress score. To improve robustness and score calibration, we further apply a reinforcement learning stage that jointly rewards structured output, accurate reference retrieval, and precise progress estimation. This two-stage training procedure encourages models to produce interpretable reasoning while improving reliability under challenging progress estimation scenarios.
We study progress estimation as a long-horizon, dynamic reasoning problem beyond static visual understanding. We introduce Progress-Bench to systematically evaluate progress reasoning from a single observation under controlled variations of modality, viewpoint, and answerability. Experiments on 14 VLMs show that existing models struggle with this task, exhibiting strong sensitivity to modality and viewpoint changes, degenerate progress predictions, and weak handling of unanswerable cases. Our analyses expose systematic failure modes in existing VLMs and show that robust progress estimation emerges only when coarse anchor retrieval and fine-grained reasoning are explicitly learned.
We would like to thank to Cambrian authors for providing this webpage template.