How to Use K-means for Big Data Clustering?

Mussabayev, Ravil

How to Use K-means for Big Data Clustering?

Files

Mussabayev_washington_0250E_26648.pdf (4.41 MB)

Date

2024-09-09

relationships.isAuthorOf

Mussabayev, Ravil

Abstract

K-means plays a vital role in data mining, being the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of data. Therefore, it is crucial to improve K-means by scaling it to big data using as few of the following computational resources as possible: data, time, and algorithmic ingredients. We introduce a novel parallel scheme that leverages K-means and K-means++ algorithms for big data clustering, offering a "true big data" algorithm that excels in both solution quality and runtime, surpassing classical and recent state-of-the-art MSSC approaches. The new approach naturally implements global search by decomposing the MSSC problem without using additional metaheuristics. On the other hand, this approach can be generalized to a novel metaheuristic, providing fresh perspectives for creating new powerful optimization heuristics. This work shows that data decomposition is the basic approach to solve the big data clustering problem. The empirical success of the new algorithm and its derivatives allowed us to challenge the common belief that more data is required to obtain a good clustering solution. Moreover, the present work questions the established trend that more sophisticated hybrid approaches and algorithms are required to obtain a better clustering solution.