Clustering is an ‘unsupervised’ analysis which categorises your observations into groups, or ‘clusters’. There are numerous variations but in each case there is some form of distance measurement to determine how close or far apart observations are within each cluster.
k-Means clustering is probably the most common partitioning method for segmenting data. It requires the analyst to specify ‘k’, which is the amount of distinct clusters you will be segmenting into.
This method begins by adding the ‘k’ number of clusters with evenly split markers, known as cluster centroids. It then assigns each observation to whichever cluster centroid it is nearest to.
Usually this initial clustering won’t have the observations very evenly split around the cluster’s centroid, so the algorithm will take the mean value of all observations in the cluster and re-position the centroid based on that. It will then look again to see whether any observations need moving into a different cluster. The algorithm continues doing this and re-positioning the centroids until it finds the best fit.
Best fit is defined as when the average distance from each observation to it’s cluster centre is at its smallest.
You can draw lines of demarcation between clusters with a Voronoi diagram, displaying the area of each cluster and depending which side of the line an observation sits it will be assigned to the relevant cluster.
The algorithm used to calculate this is known as the k-nearest neighbour (KNN) algorithm. Given a known set of cases, the algorithm clusters the value of ‘k’ number of points nearest to the values of the new case i.e. the ‘nearest neighbours’.
Example: You may run a supervised machine learning exercise using k-means clustering on a customer base, comparing the susceptibility to marketing campaigns from the “brand loyalist” cluster of customers against your “value conscious” cluster.
Hierarchical clustering builds multiple levels of clusters, creating a hierarchy with a cluster tree.