A cluster analysis uses various algorithms to form groups of data points. It does not, however, assign much of any significance or meaning to the resulting clusters. Nor does it explain the processes behind the formation of those clusters, even if they had a lot of significance. The explaining is up to the scientist.
In yesterday's post I presented the results I had obtained with artificial data. I also implied that a cluster analysis would be most successful if the scientist already suspected some sort of clustering in the data. Yes, it seems like a bit of a circular process.
One (other) problem with any type cluster analysis is that even when no meaningful clustering is expected in a data set, a cluster analysis is likely to come up with clusters. Today I will demonstrate this by applying clustering analysis to real data.
This is a histogram of the distribution of shell lengths of 54 live Vertigo pygmaea from my backyard. These types of measurements are usually distributed normally or near-normally. There is really no requirement that they be normal, but because many independent or almost independent factors contribute in an additive fashion to the development of the measured values, the resulting distributions usually approach normality.
The histogram for the heights of Vertigo shells also appears normal, although its tails seem to be extended a bit too far out in either direction. The 2 shortest shells were 2.75 and 2.25 standard deviation (SD) units away from the sample mean, while the 2 longest shells were 2 and 2.5 SD units away from the sample mean. In a theoretical normal distribution, only about 5% of the data points are expected to be more than 2 SD units away from the mean. So in a sample of 54 specimens, the presence of almost 4 measurements outside the 2-SD limit seems unusual. But, as I explained above, snails don't have to stay within the boundaries of any theory and may behave any way they wish. Besides, there may also have been measurement errors.
These measurements belonged to one species and were obtained from a sample of shells collected at the same time from a very small plot. So there really was no reason to expect any clustering in the data. But here is the clustering dendrogram PAST created with this data set.
You see the analysis comes up with all sorts of clusters. What it seems to be doing is that it is grouping every 2 data points that are closest to each other as a cluster and then clustering those with other clusters or points. As a matter of fact, that is exactly how clustering algorithms work.
Interestingly, the 2 clusters that stand out most from the others, and which I colored red and green, correspond to the tails of the histogram. The red one includes the 2 shortest shells and the green one the 2 longest shells. These are, however, artifactual clusters without any biological significance. The lesson from this exercise is that one needs to be careful about how one interprets a dendrogram. Once again, it is almost as if one needs to have some idea of whether or not clusters are expected before the analysis is performed.