Skip to content

visinf/mr-dinosaur

Repository files navigation

Motion-Refined DINOSAUR for Unsupervised Multi-Object Discovery

Xinrui Gong* 1 Oliver Hahn* 1 Christoph Reich1,2,3,4 Krishnakant Singh1 Simone Schaub-Meyer1,5 Daniel Cremers2,3,4 Stefan Roth1,4,5

1TU Darmstadt 2TU Munich 3MCML 4ELIZA 5 hessian.AI *equal contribution

ICCVW 2025

Paper PDF

License Framework

TL;DR: We present MR-DINOSAUR, a fully unsupervised framework for multi-object discovery (MOD). Our method constructs pseudo labels by extracting video frames without camera motion and applying motion segmentation. Using our pseudo labels, we extend the object-centric learning model DINOSAUR to unsupervised MOD. MR-DINOSAUR achieves strong performance on TRI-PD and KITTI, surpassing prior state-of-the-art methods despite being fully unsupervised.

Abstract

Unsupervised multi-object discovery (MOD) aims to detect and localize distinct object instances in visual scenes without any form of human supervision. Recent approaches leverage object-centric learning (OCL) and motion cues from video to identify individual objects. However, these approaches use supervision to generate pseudo labels to train the OCL model. We address this limitation with MR-DINOSAUR---Motion-Refined DINOSAUR---a minimalistic unsupervised approach that extends the self-supervised pre-trained OCL model, DINOSAUR, to the task of unsupervised multi-object discovery. We generate high-quality unsupervised pseudo labels by retrieving video frames without camera motion for which we perform motion segmentation of unsupervised optical flow. We refine DINOSAUR's slot representations using these pseudo labels and train a slot deactivation module to assign slots to foreground and background. Despite its conceptual simplicity, MR-DINOSAUR achieves strong multi-object discovery results on the TRI-PD and KITTI datasets, outperforming the previous state of the art despite being fully unsupervised.

Method: MR-DINOSAUR

Figure 1. Pseudo-label generation from quasi-static frames via motion segmentation and clustering.

Figure 2. MR-DINOSAUR architecture and training overview. Stage 1 refines the DINOSAUR slot representations (left). Stage 2 learns foreground/background discrimination (right).

We propose MR-DINOSAUR, an unsupervised multi-object discovery method leveraging motion-based pseudo-labels to refine the object-centric learning model DINOSAUR. Our pipeline consists of two steps: pseudo-label generation and DINOSAUR refinement. For pseudo-label generation, we first retrieve quasi-static frames, characterized by minimal camera motion, to ensure optical flow predominantly arises from moving objects. Foreground masks are obtained by thresholding optical flow magnitudes, and connected components are partitioned into instance masks using gradient-based motion discontinuities and HDBSCAN clustering. For DINOSAUR refinement, we introduce a two-stage training scheme: first, pseudo-instance masks supervise DINOSAUR's slot-attention module to improve object segmentation; second, we propose a slot-deactivation module that distinguishes foreground from background, guided by pseudo-labels and a similarity-based drop-loss.

Development Setup

This project is built on the object-centric-learning-framework, which allows for ablations without creating duplicate code by defining models and experiments in configuration files and allowing their composition in configuration space via hydra. For more information about OCLF, please refer to the tutorial.

Follow the steps below to install the dependencies and cli scripts to run experiments in a poetry managed virtual environment, installing requires at least Python 3.8.

git clone https://github.com/XinruiGong/mr-dinosaur.git
cd mr-dinosaur
conda create -n mrdinosaur python=3.8
conda activate mrdinosaur
pip install poetry==1.8.5
export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring 
poetry install -E timm -E clip

Dataset Organization

To prepare a dataset, follow this pipeline: (1) download each dataset and reorganize it into the required directory structure. In our project, we've used KITTI, TRI-PD and MOVI-E; (2) generate pseudo-labels; and (3) convert the training set, validation set, and pseudo-label set into the WebDataset format. Following, we provide an example using the KITTI dataset.

Dataset Download and Reorganize

Follow the steps below to install the dependencies needed for dataset download, reorganization and conversion to the WebDataset format.

cd scripts/datasets
conda create -n webdataset python=3.9
conda activate webdataset
pip install poetry==1.8.5 
poetry install
pip install imageio
bash download_and_convert.sh KITTI_pre

After this, the KITTI-train dataset is downloaded and reorganized into the required structure. Please also download the KITTI-test. Please download the TRI-PD dataset shared by DOM or refer to the script. The required folder structure is as follows.

Dataset/
├── Dataset_train/
│   └── camera_folder/
│       └── PNGImage/
│           ├── scene_01/
│           │   ├── XXXXXXXX.png
│           │   ├── XXXXXXXX.png
│           │   └── ...
│           └── scene_02/
│               ├── XXXXXXXX.png
│               └── ...
└── Dataset_test/
    ├── image/
    │   ├── XXXXXXXX.png
    │   ├── XXXXXXXX.png
    │   └── ...
    └── ground_truth/
        ├── XXXXXXXX.png
        ├── XXXXXXXX.png
        └── ...

Pseudo-Label Generation

To download the pretrained SMURF and RAFT model, please refer to SMURF, the PyTorch reimplementation of SMURF, and RAFT to download the required checkpoints.

For step (2) pseudo-label generation, run the following commands to generate the labels for the KITTI dataset.

cd ../../pseudo_label_generation
conda create -n pseudo python=3.8
conda activate pseudo
pip install -r requirements.txt
python preprocess_kitti_smurf.py --compute_flow --base_input_dir "KITTI_train/image_02" --rgb_path "PNGImages_02"
python preprocess_kitti_smurf.py --compute_flow --base_input_dir "KITTI_train/image_03" --rgb_path "PNGImages_03"

Note: To generate pseudo-labels for the TRI-PD dataset, one needs to resize and crop the image by using the --diod_pd flag.

python preprocess_pd_smurf.py --compute_flow --diod_pd --base_input_dir "PD_simplified/camera_01" --rgb_path "PNGImages_01"
python preprocess_pd_smurf.py --compute_flow --diod_pd --base_input_dir "PD_simplified/camera_05" --rgb_path "PNGImages_05"
python preprocess_pd_smurf.py --compute_flow --diod_pd --base_input_dir "PD_simplified/camera_06" --rgb_path "PNGImages_06"

Convert to WebDataset

For step (3), convert the training set, validation set, and pseudo-label set into the WebDataset format, following the steps below.

cd ../../scripts/datasets
conda activate webdataset
bash download_and_convert.sh KITTI

Now, we can use the dataset and pseudo labels for model training and evaluation.

MR-DINOSAUR Training & Evaluation

Model Training

The baseline model DINOSAUR can be trained as follows:

cd ../..   # Go back to root folder
export DATASET_PREFIX=scripts/datasets/outputs
conda activate mrdinosaur
poetry run ocl_train +experiment=projects/mr_dinosaur/dinosaur_dinov2_kitti/kitti_config.yaml

First, we refine the slot representations in MR-DINOSAUR training stage 1:

poetry run ocl_train_sa +experiment=projects/mr_dinosaur/mr_dinosaur_dinov2_kitti/kitti_config_sa.yaml\
+load_checkpoint=path/to/dinosaur_checkpoint.ckpt

Next, we learn to distinguish foreground from background in MR-DINOSAUR training stage 2:

poetry run ocl_train_mlp +experiment=projects/mr_dinosaur/mr_dinosaur_dinov2_kitti/kitti_config_mlp.yaml\
+load_checkpoint=path/to/mr_stage1_checkpoint.ckpt

Model Evaluation

For the evaluation of the DINOSAUR model or MR-DINOSAUR stage 1 run:

poetry run ocl_eval +evaluation=projects/mr_dinosaur/metrics_kitti_dinov2.yaml\
train_config_path=.yaml\
checkpoint_path=.ckpt\
output_dir=outputs

For the evaluation of the final MR-DINOSAUR model run:

poetry run ocl_eval +evaluation=projects/mr_dinosaur/metrics_kitti_mr_dinosaur_dinov2.yaml\
train_config_path=.yaml\
checkpoint_path=.ckpt\
output_dir=outputs

Inference on Custom Data

Next, we provide instructions to run MR-DINOSAUR on custom data. Given images at /path/to/custom, run the following steps.

1. Download the checkpoints and config files.

Download the trained checkpoints and corresponding config files from here, e.g., kitti_mr_stage2.ckpt and kitti_config_mr_stage2.yaml.

2. Convert custom images to WebDataset format.

cd scripts/datasets
conda activate webdataset
bash download_and_convert.sh Custom /path/to/custom

3. Run inference.

Adapt the custom.yaml file. Change the test_shards path and test_size in custom.yaml according to the location of the processed data and number of images to be processed. Next, run inference via:

cd ../..
conda activate mrdinosaur
poetry run ocl_eval +evaluation=projects/mr_dinosaur/inference_custom_mr_dinosaur_dinov2.yaml\
train_config_path=path/to/kitti_mr_stage2.yaml\
checkpoint_path=path/to/kitti_config_mr_stage2.ckpt\
output_dir=outputs

Checkpoints

Here we provide checkpoints of the trained models.

Citation

If you find our work useful, please consider citing the following paper.

@inproceedings{Gong:2025:MRD,
  author = {Xinrui Gong and 
            Oliver Hahn and 
            Christoph Reich and 
            Krishnakant Singh and 
            Simone Schaub-Meyer and 
            Daniel Cremers and 
            Stefan Roth},
  title = {Motion-Refined {DINOSAUR} for Unsupervised Multi-Object Discovery},
  booktitle = {Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW)},
  year  = {2025},
}

Acknowledgements

The code is built on the OCLF codebase, and used the pre-trained checkpoints from DINOv2, SMURF and RAFT. We acknowledge the authors for their excellent work and open-sourcing their code/pre-trained models.

This project has received funding from the ERC under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 866008). This work has further been co-funded by the LOEWE initiative (Hesse, Germany) within the emergenCITY center [LOEWE/1/12/519/03/05.001(0016)/72] and the Deutsche Forschungsgemeinschaft (German Research Foundation, DFG) under Germany’s Excellence Strategy (EXC 3066/1 “The Adaptive Mind”, Project No. 533717223). This project was also partially supported by the European Research Council (ERC) Advanced Grant SIMULACRON, DFG project CR 250/26-1 “4D-YouTube”, and GNI Project “AICC”. Christoph Reich is supported by the Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA) through the DAAD programme Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the Federal Ministry of Education and Research. Finally, we acknowledge the support of the European Laboratory for Learning and Intelligent Systems (ELLIS). Special thanks go to Divyam Sheth for his last-minute help with the paper.