04 June 2007

Cluster analysis - part 1

I don't know much about cluster analysis. In fact, I know so little about it that I shouldn't even be writing about it. But I am learning about it and one way to learn about something is by writing about it on one's blog. That helps because, unless you want the entire world to think what an idiot you are, before you write about something new, you study the subject matter and then put in your blog post only those things that you think you have understood and hope for the best.

Another way of learning about something like doing cluster analyses is by doing simulations. So, to carry on a cluster analysis simulation, I created a data set of 25 numbers intentionally grouped in 2 groups. What could such a group of numbers represent in real life? Well, they could be the measurements of a body part of a sample of organisms, for example, the lengths of snail shells.

10.1, 10, 10.2, 10.1, 6.4, 10.0, 6.2, 5.9, 10.4, 6.0, 6.0, 6.1, 10.3, 10.2, 6.3, 6.2, 10.0, 10.3, 6.1, 6.0, 10.4, 10.3, 6.0, 6.1, 10.0

I just made these numbers up so as to have 2 sufficiently distant sets of numbers. Even a casual glance at these numbers will make one suspect that there are indeed 2 clusters.

6.4, 6.2, 5.9, 6.0, 6.0, 6.1, 6.3, 6.2, 6.1, 6.0, 6.0, 6.1

10.1, 10, 10.2, 10.1, 10.0, 10.4, 10.3, 10.2, 10.0, 10.3, 10.4, 10.3, 10.0

Obviously, in this case one doesn't need any sophisticated statistical tests to confirm one's suspicions. Nevertheless, a quick and easy test is to compare the confidence intervals (CI) of the means of the 2 groups. The 95% CI of a sample mean is calculated using its standard error and it is the interval in which the population mean is expected to be 95% of the time. In this case, the CI of the mean (6.11) of the 1st group is 6.02 to 6.20, while the CI of the mean (10.18) of the 2nd group is 10.09 to 10.27. The higher limit of the CI of the 1st group doesn't even come close to the lower limit of the CI of the 2nd group. So, yes, we can be quite confident that we have 2 non-overlapping clusters.

Now, let's do a cluster analysis with this set of numbers. Luckily, there is available a free and fairly user-friendly statistical software called PAST (Palaeontological Statistics) that does cluster analysis. Here is the dendrogram I got back from PAST for my data set.


The vertical axis gives the similarity values that PAST calculated using an algorithm called "unweighted pair-group average" the details of which need not concern us here. Note that the similarities have negative values; the larger the negative number at the branching point of 2 clusters, the less similar they are.

In this dendrogram there are 2 well-separated clusters of numbers corresponding to the 2 groups. But what about all those other clusters within each main cluster? They have relatively small similarity values at their branch points and appear to be are artifacts of the analysis; we are going to ignore them.

The next post in this series will apply cluster analysis to real data.


Clare said...

Interesting. I think it's a good idea to do posts like these for two reasons - interesting for your readers but also good for getting the blogger's thoughts in order!

pascal said...

I wish there was something like this for the mac. One more reason to get one of the intel-based powerbooks next time...