MOCHA: Multimodal Object-driven Cross-arcHitecture Alignment ☕

Elena Camuffo $^{1,2}$, Francesco Barbato $^{1,2}$, Mete Ozay $^1$, Simone Milani $^2$, Umberto Michieli $^1$

$^1$ Samsung R&D Institute UK, United Kingdom; $^2$ University of Padova, Italy

MOCHA recipe. (1) Pretraining student model. (2) Knowledge distillation on rich joint visual and textual features from a frozen teacher. (3) Few-shot personalization with frozen student and prototypical learner.

Environment Setup 🛠️🐍

Create a Python 3.11 environment and install requirements:

conda create -n mocha python==3.11.5
conda activate mocha
pip install -r requirements.txt

Download the evaluation datasets processed into YOLO format and place them in the data directory. If needed, modify the paths in datasets/data_paths.yaml. The datasets include:
- CORe50
- iCubWorld
- PerSeg
- POD (currently unavailable to the public)

Download the OpenImages in data/openimages. Put the script OI_subset.py in data/openimages/OpenImages in such a way that the folder structure is the following:

data/openimages/
├── OpenImagesRaw/
│   ├── oidv6-class-descriptions.csv
│   ├── oidv6-train-annotations-bbox.csv
│   ├── oidv6-test-annotations-bbox.csv
│   ├── validation-annotations-bbox.csv
│   ├── train/
│   ├── val/
│   └── test/
└── OpenImages/
      └── OI_subset.py

Rearrange the folder structure to match the YOLO dataset format (for more details, refer to the Ultralytics documentation):

data/openimages/
├── train/
│   ├── images/
│   │   ├── im1.jpg
│   │   ├── im2.jpg
│   │   └── ...
│   ├── labels/
│   │   ├── im1.txt
│   │   ├── im2.txt
│   │   └── ...
├── val/
│   ├── images/
│   ├── labels/
└── ...

How to make MOCHA ☕️ 🥐

1) Getting Started

When running the code for the first time on the OpenImages dataset using LLaVa or CLIP as the teacher model, the necessary cache files will be automatically generated. To ensure consistency with prior work, we use a subset of the OpenImages dataset. This subset can be extracted by executing the OI_subset.py script.

1b) Compute PCA [Optional]

After caching separately llava and clip features, these can be merged and compressed via PCA, as follows.

To cache the Text-Vision embeddings:

python caching_oi.py

To compute the PCA on the merged OpenImages embeddings:

python intitialize_pca_oi.py

If needed you can repeat the previous steps and use the fs version to compute the PCA on few-shot datasets, specifying the dataset.

Note that you can skip this step by downloading the already computed PCA on OpenImages from here and store it in ckpts/pca_oi.npz. The checkpoints ckpts/auxft.pth, ckpts/base.pth belong to the original repo of AuXFT here and can also be downloaded.

2) Feature Distillation

To train the MOCHA architecture you can run the following command:

torchrun --nproc_per_node=1 distillation.py --pca_dim 512 --epochs 50

By default the code will initialize the YOLOv8 architecture on COCO (--init_ckpt=none). You can set --init_ckpt=ckpts/auxft.pth and --epochs=20 to train the MOCHA (AuXFT) variant.

3) Few-Shot Personalization

To evaluate MOCHA:

python train_protonet.py --model mocha --pca_dim 512 --pca_path ckpts/pca_oi.npz --dataset perseg --ckpt ckpts/mocha.pth

Note that, similarly to AuXFT, before evaluating on the CORe50 and iCubWorld dataset you should generate finetuned checkpoints by running the finetune_core_icub.py script.

Citation 📜✨

If you find this work useful, please cite:

@article{camuffo2025mocha,
    title={MOCHA: Multimodal Few-Shot Learning with Knowledge Distillation},
    author={Camuffo, Elena and Barbato, Francesco and Ozay, Mete and Milani, Simone and Michieli, Umberto},
    journal={arXiv preprint arXiv:2509.14001},
    year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
ckpts		ckpts
datasets		datasets
models		models
utils		utils
.gitignore		.gitignore
LICENSE.md		LICENSE.md
OI_subset.py		OI_subset.py
README.md		README.md
caching_oi.py		caching_oi.py
distillation.py		distillation.py
finetune_core_icub.py		finetune_core_icub.py
image-2.png		image-2.png
initialize_pca_fs.py		initialize_pca_fs.py
initialize_pca_oi.py		initialize_pca_oi.py
requirements.txt		requirements.txt
train_protonet.py		train_protonet.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MOCHA: Multimodal Object-driven Cross-arcHitecture Alignment ☕

Environment Setup 🛠️🐍

How to make MOCHA ☕️ 🥐

1) Getting Started

1b) Compute PCA [Optional]

2) Feature Distillation

3) Few-Shot Personalization

Citation 📜✨

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

SamsungLabs/MOCHA

Folders and files

Latest commit

History

Repository files navigation

MOCHA: Multimodal Object-driven Cross-arcHitecture Alignment ☕

Environment Setup 🛠️🐍

How to make MOCHA ☕️ 🥐

1) Getting Started

1b) Compute PCA [Optional]

2) Feature Distillation

3) Few-Shot Personalization

Citation 📜✨

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages