Evolutionary clustering using classifiers: definition, implementation, scalability, and applications
Export citation
Abstract
Clustering is a Machine Learning tool for partitioning multi-dimensional data automatically into mutually exclusive groups, aiming to reflect the patterns of the phenomena it represents. Clustering algorithms perform this task conditioned by the clustering criterion modeled in its objective function. However, selecting the optimal criterion is a domain-dependent task that requires information on the cluster structure that a user often does not count on due to the unsupervised nature of the technique. Available approaches accentuate this problem as they perform clustering according to a similarity notion often limited to the concepts of compactness and connectedness, inducing bias and favoring clusters with certain shape, size, or density properties from using conventional distance functions. However, we cannot consider this a complete notion of a cluster because not every dataset will comply with both notions in the same proportion. Hence, research on this topic has not converged to a standard definition of a cluster, which raises the need for algorithms that produce adaptive solutions that mirror the underlying structures and relations within the data.
This thesis is focused on the design of single-objective Evolutionary Clustering Algorithms that generate solutions that are not biased towards any cluster structure by optimizing a novel generalization clustering criterion. To achieve that, we designed objective functions modeled as a supervised learning problem, considering that a good partition should induce a well-trained classifier. That is how we decided to assess the quality of a clustering solution, according to its capability to train an ensemble of classifiers. The main contribution of this thesis is our series of Evolutionary Clustering Algorithms using Classifiers (the ECAC series), which introduces the aforementioned clustering criterion along with evolutionary computation. This meta-heuristic allows us to model distinct criteria to optimize while creating and evaluating multiple solutions along the process. The experimental results in the design of our family of methods ECAC, F1-ECAC, and ECAC-S, show an increase in similarity between the partitions created by our algorithms and the ground truth labels (obtained from the publicly available repositories where we retrieved the data) with a maximum Adjusted RAND Index of 0.96. Our second algorithm, F1-ECAC, proved the competitiveness of our contributions against traditional, single, and multi-objective Evolutionary Clustering algorithms showing no statistically significant difference against k-means, HG-means, and MOCLE. Our latest contribution, ECAC-S, was tested on a satellite image segmentation task, and it produced segmentations with higher average Adjusted RAND Index than k-means, Spectral-clustering, Birch, and DBSCAN in 4 out of 10 images.