Skip to content

Jiacheng8/HALD

Repository files navigation

Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift

Jiacheng Cui, Bingkui Tong, Xinyue Bi, Xiaohan Zhao, Jiacheng Liu, Zhiqiang Shen

Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE   

Corresponding Author

Version Arxiv Open In Spaces Python GitHub Repo stars License: MIT Contact

📰 News

  • 🚀 [2025-] We’re thrilled to release HALD v1.0.0. Everything is ready for plug-and-play reproducibility and benchmarking. 👉 Grab it here

🚀 Overview

Soft labels generated by teacher models have become a dominant paradigm for knowledge transfer and recent large-scale dataset distillation such as SRe2L, RDED, LPLD, offering richer supervision than conventional hard labels. However, we observe that when only a limited number of crops per image are used, soft labels are prone to local semantic drift: a crop may visually resemble another class, causing its soft embedding to deviate from the ground-truth semantics of the original image. This mismatch between local visual content and global semantic meaning introduces systematic errors and distribution misalignment between training and testing. In this work, we revisit the overlooked role of hard labels and show that, when appropriately integrated, they provide a powerful content-agnostic anchor to calibrate semantic drift. We theoretically characterize the emergence of drift under few soft-label supervision and demonstrate that hybridizing soft and hard labels restores alignment between visual content and semantic supervision. Building on this insight, we propose a new training paradigm, Hard Label for Alleviating Local Semantic Drift (HALD), which leverages hard labels as intermediate corrective signals while retaining the fine-grained advantages of soft labels. Extensive experiments on dataset distillation and large-scale conventional classification benchmarks validate our approach, showing consistent improvements in generalization. On ImageNet-1K, we achieve 41.8% with only 285M storage for soft labels, outperforming prior state-of-the-art LPLD by 8.1%. Our findings re-establish the importance of hard labels as a complementary tool, and call for a rethinking of their role in soft-label–dominated training.

⭐ Contributions

We revisit the role of hard labels in dataset distillation and introduce a hybrid training paradigm, Hard Label for Alleviating Local Semantic Drift (HALD).
Our core idea is to strategically combine hard and soft labels:

  • Hard labels recalibrate the semantic space of image crops
  • Soft labels preserve fine-grained, nuanced supervision

From a theoretical perspective, we show that using only a limited number of soft labels inevitably induces local semantic drift. We then mathematically demonstrate how integrating hard labels effectively counteracts this drift. Extensive experiments across multiple benchmarks validate that HALD:

✅ Reduces distribution mismatch
✅ Improves generalization
✅ Remains robust even under aggressive soft-label compression

🛠️ Installation Guide

Follow these steps to set up the environment and access necessary resources:

📦 Step 1: Clone the Repository

git https://github.com/Jiacheng8/HALD.git
cd HALD

🧪 Step 2: Set Up Conda Environment

Create and activate the PyTorch environment using the provided configuration:

conda env create -f environment.yml
conda activate HALD

📁 Step 3: Download Required Files

Some files must be downloaded manually from Google Drive. After downloading, update the corresponding paths in the config.sh file.

📌 Reminder: Don’t forget to update config.sh!
Set Main_Data_Path to the folder where your generated data is saved, this is crucial for proper loading. Please put the pretrained models under the Main_Data_Path. There is no restriction for where you should put the patches or the validation set, as long as you update them correctly in config.sh.

🚀 Running Experiments

All experiment pipelines are modularized into three stages:

  1. 🏷️ relabel/ — Generate the corresponding soft labels
  2. 📊 validate/Evaluate the performance of the distilled dataset

📚 Bibliography

If you find this repository helpful for your research or project, please consider citing our work:

🔖 Citation (BibTeX):

@article{cui2025hard,
  title={Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift},
  author={Cui, Jiacheng and Tong, Bingkui and Bi, Xinyue and Zhao, Xiaohan and Liu, Jiacheng and Zhiqiang Shen},
  journal={arXiv preprint arXiv:2512.15647},
  year={2025}
}

📌 Your citation helps support and acknowledge our research contributions to the dataset distillation community.

About

Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published