WassersteinSSL: A New Uniformity Metric for Self-Supervised Learning

Rethinking the Uniformity Metric in Self-Supervised Learning

Overview

This repository encompasses four components of code. Firstly, in the “Distribution Approximation” folder, we visualize an asymptotic equivalence between a uniform spherical distribution and an isotropic Gaussian distribution. Secondly, the “Empirical Study” folder presents an empirical analysis, which includes examinations of dimensional collapse degrees, dimensions, the Feature Baby Constraint, the Feature Cloning Constraint, and the Instance Cloning Constraint. In the “Large Means” folder, we illustrate how large means can lead to severe representation collapse. Lastly, within the “code” folder, we integrate the Wasserstein distance $\mathcal{W}_{2}$ as an additional loss term in various self-supervised learning methods such as BYOL, BarlowTwins, and MoCo v2, leading to enhanced performance in downstream tasks. Using this package, the empirical results presented in Table 2 of this paper can be reproduced.

Distribution Approximation

To illustrate the asymptotic equivalence between a uniform spherical distribution (where $Y_i$ represents the $i$-th coordinate) and an isotropic Gaussian distribution ($\hat{Y}_i \sim \mathcal{N}(0, 1/m) $), we begin by randomly sampling data points and estimating their distribution using:

python ./DistributionApproximation/Density1DPlot.py or python ./DistributionApproximation/Density2DPlot.py

Then, we draw figures by running juypter notebook files:

 Density1DPlot.ipynb or Density2DPlot.ipynb

Using the estimated distributions, we visualize $Y_i$ and $\hat{Y}_i$ across different dimensions $m \in [2, 4, 8, 16, 32, 64, 128, 256]$.

We also analyze the joint binning densities and present 2D joint binning densities of $(Y_i, Y_j)$ ($i \neq j$) and $(\hat{Y}_i, \hat{Y}_j)$ ($i \neq j$). Even if $m$ is relatively small (i.e., 32), the densities of the two distributions are close.

Empirical Study

We empirically compared our proposed uniformity metric $-\mathcal{W}_2$ and the baseline uniformity metric $-\mathcal{L_U}$ from five different perspectives.

On Dimensional Collapse Degrees

To generate data reflecting varying degrees of dimensional collapse, we sample data vectors from an isotropic Gaussian distribution, normalize them to have $\ell_2$ norms, and then zero out a proportion ($\eta$) of the coordinates by running the following code:

python ./EmpiricalStudy/DimensionalCollapseDegrees/AnalysisOnCollapseLevel.py

Then we draw figures:

AnalysisOnCollapseLevel.ipynb

as visualized:

On Sensitiveness of Dimensions

We also analyze the sensitiveness of dimensions. We generate data points and draw figures by:

python  ./EmpiricalStudy/Dimensions/AnalysisOnDimension.py   then  AnalysisOnDimension.ipynb

The analyses results can be found as follow:

On Feature Cloning Constraint

We generate data points and draw figures by:

python  ./EmpiricalStudy/FeatureCloningConstraint/AnalysisOnProperty4.py   then  AnalysisOnProperty4.ipynb

The analyses results can be found as follow:

On Feature Baby Constraint

We generate data points and draw figures by:

python  ./EmpiricalStudy/FeatureBabyConstraint/AnalysisOnProperty5.py   then  AnalysisOnProperty5.ipynb

The analyses results can be found as follow:

On Instance Cloning Constraint

We generate data points and draw figures by:

python  ./EmpiricalStudy/InstanceCloningConstraint/AnalysisOnProperty3.py   then  AnalysisOnProperty3.ipynb

The analyses results can be found as follow:

Large Means

To investigate the influence of the mean on uniformity, we consider $\mathbf{X}\in \mathbb{R}^2$ following a Gaussian distribution $\mathcal{N}(\mathbf{0}, \mathbf{I}_2)$, where $\mathbf{Y} = \mathbf{X} + k\cdot \mathbf{1}$, leading to $\mathbf{Y} \sim \mathcal{N}(k \cdot\mathbf{1}, \mathbf{I}_2)$ with $\mathbf{1} \in \mathbb{R}^k$ representing a vector of all ones. By varying $k$ from $0$ to $32$, we generate $\mathbf{Y}$ and draw figures by:

python ./LargeMeans/PlotMean2D.py

as visualized:

It is clear that an excessively large means will cause representations to collapse to a single point, even if the covariance matrix is isotropic.

Code

In this repository, we integrate the our proposed uniformity loss $\mathcal{W}_{2}$ as an additional loss term in the existing self-supervised learning methods such as BYOL, BarlowTwins, and MoCo v2.