Official implementation of the paper "Hierarchical Vector Quantization for Unsupervised Action Segmentation"
Full implementation coming soon!
If you find this code or our model useful, please cite our paper:
@inproceedings{hvq2025spurio,
author = {Federico Spurio and Emad Bahrami and Gianpiero Francesca and Juergen Gall},
title = {Hierarchical Vector Quantization for Unsupervised Action Segmentation},
booktitle = {AAAI Conference on Artificial Intelligence (AAAI)},
year = {2025}
}Hierarchical Vector Quantization (HVQ) is an unsupervised action segmentation approach that exploits the natuaral compositionality of actions. By employing a hierarchical stack of vector quantization modules, it effectively achieves accurate segmentation.
Here is the overview of our proposed model:
More in details, the pre-extracted features are processed through an Encoder, implemented as a two-stage MS-TCN. The resulting encodings are then progressively quantized using a sequence of vector quantization modules. Each module operates with a decreasing codebook size, gradually refining the representation until the desired number of action classes is achieved.
The Jensen-Shannon Distance (JSD) is a metric for evaluating the bias in the predicted segment lengths. For each video within the same activity, we compute the histogram of the predicted segment lengths, using a bin width of 20 frames. We then compare this histogram with the corresponding ground-truth histogram using the Jensen-Shannon Distance. These JSD scores are averaged across all videos for each activity. Finally, we calculate a weighted average across all activity, where the weights are the number of frames in each activity. In particular:
where
where
- Breakfast [1]
The features and annotations of the Breakfast dataset can be downloaded from features and ground-truth and mapping, as in [5]
- YouTube INRIA Instructional (YTI) [2]
- IKEA ASM [3]
Link for the features coming soon.
The data folder should be aranged in the following way:
data
|--breakfast
| `--features
| `--cereals
| `--P03_cam01_P03_cereals.txt
| `...
| `--coffee
| `--friedegg
| `...
| `--groundTruth
| `--P03_cam01_P03_cereals
| `...
| `--mapping
| `--mapping.txt
|
|--YTI
| `...
|
|--IKEA
| `...
To create the conda environment run the following command:
conda env create --name hvq --file environment.yml
source activate hvqThen run pip install -e . to avoid Module name hvq not found error.
After activating the conda environment, just run the training file for the chosen dataset, e.g. for breakfast:
python ./BF_utils/bf_train.pyTo run and evaluate for every epoch with FIFA[4] decoding and Hungarian Matching, set opt.epochs=20 and opt.vqt_epochs=1. With the default combinations of parameter opt.epochs=1, opt.vqt_epochs=20, the model is trained and evaluated once at the end.
To have more stable predictions between epochs, it is suggested to set opt.use_cls=True. With this option, a classifier is trained on the embeddings produced by the HVQ model using the pseudo-labels as ground-truth. Paper's results are produced WITHOUT this option.
In our code we made use of the following repositories: MS-TCN, CTE and VQ. We sincerely thank the authors for their codebases!
Segmentation results for a sample from the Breakfast dataset (P22 friedegg). HVQ delivers highly consistent results across multiple videos (V1, V2, V3, V4) recorded from different cameras, but with the same ground truth.
[1] Kuehne, H.; Arslan, A.; and Serre T. The Language of Actions: Recovering the Syntax and Semantics of GoalDirected Human Activities. In CVPR, 2014
[2] Alayrac, J.-B.; Bojanowski, P.; Agrawal, N.; Sivic, J.; Laptev, I.; and Lacoste-Julien, S. Unsupervised Learning From Narrated Instruction Videos. In CVPR, 2016
[3] Ben-Shabat, Y.; Yu, X.; Saleh, F.; Campbell, D.; RodriguezOpazo, C.; Li, H.; and Gould, S. The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose. In WACV, 2021
[4] Souri, Y.; Farha, Y.A.; Despinoy, F.; Francesca, G.; and Gall, J. Fifa: Fast inference approximation for action segmentation. In GCPR, 2021.
[5] Kukleva, A.; Kuehne, H.; Sener, F.; and Gall, J. Unsupervised learning of action classes with continuous temporal embedding. In CVPR 2019.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.



