Distance-Based Algorithms for Clustering ~ Data Science For Lifelong Learning

The differences of clustering algorithms, as explained on the Data Clustering post, lies in how "similarity" is defined. In probabilistic models, the main idea is to model the data from a generative process, where a specific distribution is assumed (e.g. Gaussian), then parameters of the model are estimated using Expectation Maximum algorithm (EM). One time the parameters are found, the generative probabilities (fit probabilities) of the underlying data points are calculated.

Many times, different kind of generative models can be reduced to distance-based algorithms, because the mixture components often use a distance function within the probability distribution. For example, the Gaussian distribution represents probabilities in terms of the Euclidean distance from the mean of the mixture. Distance-based methods are more desirable than probabilistic models, because of the simplicity and easy implementation, these can be generally divided in two types, flat and hierarchical algorithms.

Flat methods: The data is divided into different clusters in one shot using partitioning representatives, optimizing specific objective functions (generally distance functions) by iteratively improving the quality of partitions. In each iteration, the data points are assigned to their closest partitioning representatives, an then the representative is adjusted according the data points assigned to the cluster, in this way the quality of partitions improves with each iteration.

k-Means: These algorithm employs the Euclidean Distance to calculate the distance between the data points and the partitioning representative of each cluster, which correspond to the mean value of each cluster. This method is considered one of the simplest and most classical models for data clustering and is widely used because of its simplicity.

k-Medians: The operation of this algorithm is equivalent to k-Means but the partitioning representative is the median along each dimension. This provides several advantages, such as stability in presence of outliers and certain types of noise, because the median is less sensitive than the mean to extreme values in the data.

k-Medoids: In this algorithm, contrary to k-Means and k-Medians, the partitioning representative is sampled from the original data. Such techniques are useful in cases where de data points are arbitrary objects, for example, when a set of discrete sequence objects need to be clustered, since it may not make sense to talk about mean or median.

Hierarchical methods: The clusters are represented hierarchically through a dendrogram (binary tree-based data structure) at varying levels of granularity. Hierarchical algorithms were developed to overcome some disadvantages of flat models, because these generally require a predefined parameter K (number of clusters) to obtain a solution and they are often nondeterministic. Hierarchical methods can be categorized into agglomerative and divisive algorithms.

Agglomerative: These methos start by taking singleton clusters (clusters with only one data object) at the bottom level, then continue by merging two clusters at time to build a bottom-up hierarchy. There are a variety of options about how these clusters are merged, which provide different levels of quality and efficiency, for example, single-linkage, all-pairs linkage, sampled-linkage and centroid-linkage clustering.

Divisive: These works as opposite than agglomerative, start with all the data objects in a huge cluster and split it continuously in two groups producing a top-down hierarchy. Divisive algorithms generally are more efficient compared to agglomerative clustering, specially where there is no need to generate a complete hierarchy all the way down to individual leaves, in other words, it's not necessary to have a perfectly balanced tree in terms of node depths.

Tuesday, December 6, 2022

Distance-Based Algorithms for Clustering

0 comments:

Post a Comment

About Me

Recent Post

Intensity Transformations (Part 2)

Pages