05 June 2007

Cluster analysis - part 2

A cluster analysis uses various algorithms to form groups of data points. It does not, however, assign much of any significance or meaning to the resulting clusters. Nor does it explain the processes behind the formation of those clusters, even if they had a lot of significance. The explaining is up to the scientist.

In yesterday's post I presented the results I had obtained with artificial data. I also implied that a cluster analysis would be most successful if the scientist already suspected some sort of clustering in the data. Yes, it seems like a bit of a circular process.

One (other) problem with any type cluster analysis is that even when no meaningful clustering is expected in a data set, a cluster analysis is likely to come up with clusters. Today I will demonstrate this by applying clustering analysis to real data.

vertigohisto

This is a histogram of the distribution of shell lengths of 54 live Vertigo pygmaea from my backyard. These types of measurements are usually distributed normally or near-normally. There is really no requirement that they be normal, but because many independent or almost independent factors contribute in an additive fashion to the development of the measured values, the resulting distributions usually approach normality.

The histogram for the heights of Vertigo shells also appears normal, although its tails seem to be extended a bit too far out in either direction. The 2 shortest shells were 2.75 and 2.25 standard deviation (SD) units away from the sample mean, while the 2 longest shells were 2 and 2.5 SD units away from the sample mean. In a theoretical normal distribution, only about 5% of the data points are expected to be more than 2 SD units away from the mean. So in a sample of 54 specimens, the presence of almost 4 measurements outside the 2-SD limit seems unusual. But, as I explained above, snails don't have to stay within the boundaries of any theory and may behave any way they wish. Besides, there may also have been measurement errors.

These measurements belonged to one species and were obtained from a sample of shells collected at the same time from a very small plot. So there really was no reason to expect any clustering in the data. But here is the clustering dendrogram PAST created with this data set.

vertigocluster

You see the analysis comes up with all sorts of clusters. What it seems to be doing is that it is grouping every 2 data points that are closest to each other as a cluster and then clustering those with other clusters or points. As a matter of fact, that is exactly how clustering algorithms work.

Interestingly, the 2 clusters that stand out most from the others, and which I colored red and green, correspond to the tails of the histogram. The red one includes the 2 shortest shells and the green one the 2 longest shells. These are, however, artifactual clusters without any biological significance. The lesson from this exercise is that one needs to be careful about how one interprets a dendrogram. Once again, it is almost as if one needs to have some idea of whether or not clusters are expected before the analysis is performed.

4 comments:

Anonymous said...

Aydin,
While you can apply cluster analysis to your size data within one species, its most effective use is for ecological assemblages, where a similarity matrix can be calculated for between-site or -species analysis. Since the data you present is normally distributed, you could easily (assuming that you have a sufficient n from each locality) calculate a mean and then test the significance of any differences between sites using a parametric (because it looks normally distributed) test, like a t-test/Chi-square or something. I believe that applying cluster analyis to your data is not a good respresenation of what this technique, which is an exploritory proceedure in the first place, is capable of.

Chad Ferguson
University of Cincinnati
ferguscd@email.uc.edu

Frank Anderson said...

But, as I explained above, snails don't have to stay within the boundaries of any theory and may behave any way they wish.

Surely you don't mean this. I can think of several theories that describe natural phenomena that undeniably constrain snail behavior, development, physiology, etc. The laws of thermodynamics and gravity leap to mind.

Interestingly, the 2 clusters that stand out most from the others, and which I colored red and green, correspond to the tails of the histogram. The red one includes the 2 shortest shells and the green one the 2 longest shells. These are, however, artifactual clusters...

I don't find this to be very surprising, so I'm not sure what you mean. I don't know exactly what algorithm or measure of similarity (i.e., distance) you used, but you asked an algorithm to cluster data based on some measure of similarity (although as I just alluded to above, sometimes it is easier to think of "similarity" as "distance"). The largest group of most similar data points are in that hump in the middle of the distribution, and they are clustered in the dendrogram to reflect that. The tails of the distribution "stick out" by this distance measure, so they are clustered far from the points in the center of the distribution. The method seems to have done its job. Now, if it was a stupid job to do....well, see below.

...without any biological significance.

Yes, but as you noted, the significance of any finding (leaving aside the issue of statistical significance) is of course always open to interpretation.

The lesson from this exercise is that one needs to be careful about how one interprets a dendogram.

I agree, but of course that can't be taken as an indictment of the method. But I'm sure you weren't trying to imply that.

Once again, it is almost as if one needs to have some idea of whether or not clusters are expected before the analysis is performed.

We do have some expectation of clusters in phylogenetics, which uses related methods, if we accept descent with modification.

I agree with Chad here. Cluster analysis is typically used for species assemblages and can be very useful in that context. The statistical method used must be tailored to the data to adress a particular question.

I guess my point is that you can analyze data with a wide array of different tests and algorithms and find apparently highly significant results that are, as you wrote, meaningless. This isn't a failure of the method, but it can be a failure of the investigator. The tests are just tools for detecting interesting patterns in data.

AYDIN ÖRSTAN said...

Chad: I agree. There really is no point in applying cluster analysis to a set of normally distributed measurements. I was only trying to show that cluster analysis creates clusters even when there are none.

Andy:
Surely you don't mean this.

I surely don't mean that. What I meant was snails don't have to distribute themselves normally. A normal distribution is a mathematical construction. Snails and other biological entities approximate it.

...that can't be taken as an indictment of the method. But I'm sure you weren't trying to imply that.

Right. I am not trying to discredit cluster analysis. As a matter of fact, I am hoping to use it in a paper I am writing.

This isn't a failure of the method, but it can be a failure of the investigator. The tests are just tools for detecting interesting patterns in data.

I agree.

Frank Anderson said...

Right. I am not trying to discredit cluster analysis. As a matter of fact, I am hoping to use it in a paper I am writing.

I think I was reacting more to a comment about your post that ended up on CONCH-L than to what you actually wrote in the post. People should be careful when interpreting dendograms (or phylogenies, or any other form of data interpretation), but my hackles go up when a method is derided because it's "slick". I heard a bit too much of that as a young phylogeneticist. It's all tools -- if someone wants to try to remove a bolt with a drill, go ahead...but don't blame the drill afterwards!