Scalable Clustering: Large Scale Unsupervised Learning of Gaussian Mixture Models with Outliers [Paper]
Clustering is a widely used technique with a long and rich history in a variety of areas. However, most existing algorithms do not scale well to large datasets, or are missing theoretical guarantees of convergence. We introduce a provably robust clustering algorithm based on loss minimization that performs well on Gaussian mixture models with outliers. We also provide theoretical guarantees that the algorithm obtains high accuracy with high probability under certain assumptions. Moreover, it can also be used as an initialization strategy for
A Gaussian mixture model with outliers is a weighted sum of
Each component density is a
The distribution of outliers is given by,
See Feature_Extraction for more details
| MNIST | CIFAR-10 | CIFAR-100 | ImageNet validation | ImageNet train |
p (dimension) | 512 | 4096 | 4096 | 640 | 640 |
N (number of observations) | 60000 | 50000 | 50000 | 50000 | 1281167 |
m (number of clusters) | 10 | 10 | 100 | 1000 | 1000 |
See test_bounds for more details.
-
$k$ -means++ - Robust
$k$ -means++ - Spectral Clustering (SC)
- Tensor Decomposition (TD)
- Expectation Maximization (EM)
- Complete Linkage Clustering (CL)
-
$t$ -Distributed Stochastic Neighbor Embedding ($t$ -SNE)