How to Use K-means for Big Data Clustering?

Mussabayev, Ravil

How to Use K-means for Big Data Clustering?

dc.contributor.advisor	Uhlmann, Gunther
dc.contributor.author	Mussabayev, Ravil
dc.date.accessioned	2024-09-09T23:12:41Z
dc.date.issued	2024-09-09
dc.date.submitted	2024
dc.description	Thesis (Ph.D.)--University of Washington, 2024
dc.description.abstract	K-means plays a vital role in data mining, being the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of data. Therefore, it is crucial to improve K-means by scaling it to big data using as few of the following computational resources as possible: data, time, and algorithmic ingredients. We introduce a novel parallel scheme that leverages K-means and K-means++ algorithms for big data clustering, offering a "true big data" algorithm that excels in both solution quality and runtime, surpassing classical and recent state-of-the-art MSSC approaches. The new approach naturally implements global search by decomposing the MSSC problem without using additional metaheuristics. On the other hand, this approach can be generalized to a novel metaheuristic, providing fresh perspectives for creating new powerful optimization heuristics. This work shows that data decomposition is the basic approach to solve the big data clustering problem. The empirical success of the new algorithm and its derivatives allowed us to challenge the common belief that more data is required to obtain a good clustering solution. Moreover, the present work questions the established trend that more sophisticated hybrid approaches and algorithms are required to obtain a better clustering solution.
dc.embargo.lift	2025-09-09T23:12:41Z
dc.embargo.terms	Delay release for 1 year -- then make Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Mussabayev_washington_0250E_26648.pdf
dc.identifier.uri	https://hdl.handle.net/1773/52103
dc.language.iso	en_US
dc.rights	CC BY-NC-SA
dc.subject	Big Data
dc.subject	Clustering
dc.subject	Global Optimization
dc.subject	K-means
dc.subject	Minimum Sum-of-Squares Clustering (MSSC)
dc.subject	Parallel Computing
dc.subject	Mathematics
dc.subject	Computer science
dc.subject	Artificial intelligence
dc.subject.other	Mathematics
dc.title	How to Use K-means for Big Data Clustering?
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Mussabayev_washington_0250E_26648.pdf
Size:: 4.41 MB
Format:: Adobe Portable Document Format

Download

Collections

Mathematics