Chapter 4: Sequential Methods.

Agglomerative Methods.

1. Nearest-neighbour or Single-linkage Method.

All agglomerative methods start with t clusters, so that each object forms its own separate cluster. The object of the method is then to join the objects together to form a smaller number of larger clusters.

Let us first consider the nearest-neighbour method applied to a distance matrix corresponding to the objects shown in Figure 4.1.At each step in the method, two clusters are merged together, one row and column of the distance matrix is removed and the distances in the other one are recalculated, according to the following formula. Assume clusters P and Q have been merged. If the distance between clusters P and R is given by d(P,R) and the distance of R from the new combined cluster is d(P+Q,R), then we have:

d(P+Q,R) = 0.5 (d(P,R)+d(Q,R)) - 0.5|(d(P,R)-d(Q,R))|

This is equivalent to saying that

d(P+Q,R) = min( d(P,R),d(Q,R) )

This means that the distance of any other object to the cluster is the distance of that object to the nearest object within the cluster. Likewise, the distance between two multi-object clusters will be the distance between the nearest pair of points, one in each cluster.

Now looking at the simple case in Figure 4.1, the two closest points are probably objects 8 and 9. Thus we have a new cluster 8 and lose row and column 9 from the distance matrix. The next closest are probably the new cluster 8 and the object 7, so these become the new cluster 7. Then objects 1, 2 and 3 will join together in a cluster.

Figure 4.1 shows the process at this stage, with each object or cluster joined by lines to its nearest neighbours, and the shortest of these lines connects object 6 and cluster 7, so this is the next pair to be merged. The process may halt with two clusters, one containing (1, 2, 3, and 4) and the other containing (5, 6, 7, 8, and 9), or it may continue until all are joined together in one large cluster.

Now let us consider the nearest-neighbour method as applied to a similarity matrix. In this case we take the largest similarity rather than the smallest distance. When calculating the new values of similarity we have the choice of equations:

s(P+Q,R) = 0.5 (s(P,R)+s(Q,R)) + 0.5|(s(P,R)-s(Q,R))|
or
s(P+Q,R) = max(s(P,R),s(Q,R))

Let us take a small 6x6 similarity matrix and work through this method. The initial similarity matrix is:

Objects 1 2 3 4 5 6
1 1.0 0.6 0.8 0.0 0.6 0.0
2 0.6 1.0 0.8 0.4 1.0 0.4
3 0.8 0.8 1.0 0.2 0.8 0.2
4 0.0 0.4 0.2 1.0 0.4 1.0
5 0.6 1.0 0.8 0.4 1.0 0.4
6 0.0 0.4 0.2 1.0 0.4 1.0

We begin the method by scanning down the columns of the array, one after the other, only looking at those values below the diagonals.

If we adopt the convention that we only change our entry when we find a larger value, then we shall select the first value of 1.0 which we find, that is the similarity between object 5 and object 2. So we join objects 5 and 2 in a new cluster, called 2, and set row 5 and column 5 to zeros. The new values in row and column 2 are the largest values of similarity. In this case, objects 5 and 2 are identical (similarity = 1), and so row and column 2 remain unchanged.

The array now takes the form:

Objects 1 2 3 4 6
1 1.0 0.6 0.8 0.0 0.0 0.0
2 0.6 1.0 0.8 0.4 0.0 0.4
3 0.8 0.8 1.0 0.2 0.0 0.2
4 0.0 0.4 0.2 1.0 0.0 1.0
0.0 0.0 0.0 0.0 0.0 0.0
6 0.0 0.4 0.2 1.0 0.0 1.0

The collection of clusters now has the entry.

Cluster 2 contains 2 and 5 at similarity level 1.0.

Looking again at the values in this array below the diagonal, we join object 4 and object 6 at similarity level 1. This has the effect of reducing our array to 4x4 and adding to the record of clusters:

Cluster 2 contains 2 and 5 at similarity level 1.0
Cluster 4 contains 4 and 6 at similarity level 1.0

The next search gives object 1 and object 3 at similarity 0.8, so these are merged. To obtain the new values in column 1, we have:

For cluster 2, compare 0.6 and 0.8 and choose 0.8.
For cluster 4, compare 0.0 and 0.2 and choose 0.2.

This gives the array:

clusters 1 2 4
1 1.0 0.8 0.0 0.2
2 0.8 1.0 0.0 0.4
        0.0 0.0 0.0 0.0
4 0.2 0.4 0.0 1.0

and the record of clusters:

Cluster 2 contains 2 and 5 at similarity level 1.0.
Cluster 4 contains 4 and 6 at similarity level 1.0.
Cluster 1 contains 1 and 3 at similarity level 0.8.

Now we see that clusters 1 and 2 should be combined at similarity level 0.8 and their similarity to the remaining cluster is the largest of 0.2 and 0.4, which is 0.4. Probably we should stop here, with the record:

Cluster 2 contains 2 and 5 at level 1.0.
Cluster 4 contains 4 and 6 at level 1.0.
Cluster 1 contains 1 and 3 at level 0.8.
Cluster 1 contains 1 and 2 at level 0.8.

The dendrogram for this clustering is shown in Figure 4.2.

Furthest-neighbour or Complete-linkage Method.

Once again the method is very similar, but now we take the distance from the furthest point in the cluster for the new value of distance.

Figure 4.3 shows the same example as Figure 4.1, but here the lines indicate the distances used for complete linkage. The equations take the forms:

d(P+Q,R) = 0.5(d(P,R)+d(Q,R)) + 0.5|(d(P,R)-d(Q,R))|
and
d(P+Q,R) = max( d(P,R),d(Q,R) )

for the distances, while for the similarities we have:

s(P+Q,R) = 0.5 (s(P,R)+s(Q,R)) - 0.5|(s(P,R)-s(Q,R))|
and
s(P+Q,R) = min( s(P,R),s(Q,R) )

These may be carried out in the same way as the first method. However you will find that complete linkage favours compact spherical clusters and single linkage may lead to an effect called “chaining” where you get long straggly clusters as adjacent points merge into the cluster. Sometimes this is the effect we require.

The paper by Day and Edelsbrunner suggests a single equation for the new distance coefficients and various values of parameters to give the various different methods. Taking that part of the equation which relates to the methods discussed in this chapter, we have:

d(P+Q,R) = aP d(P,R) + aQ d(Q,R) + g|(d(P,R)-d(Q,R)|

When gamma = -0.5 we have the Single-linkage method and when gamma = +0.5 we have the Complete-linkage method. When gamma = 0, we have McQuitty's Method, described in a later section. As gamma varies from -0.5 to +0.5, this should give the full variation of behaviour, from the long straggly clusters of the Single-linkage method to the compact spherical clusters of the Complete-linkage method. This method assumes that alpha P and alpha Q are both equal to 0.5. The average linkage method does not make this assumption.

Average-linkage Method.

In this case, when calculating the new values of similarity or distance, we take a weighted avarage of the individual similarities or distances. They are weighted according to the number of objects already in the clusters. If cluster P contained NP objects and cluster Q contained NQ objects, then when these are combined, we have the equation:

s(P+Q,R) = NP s(P,R) /{NP+NQ} + NQ s(Q,R) /{NP+NQ}

and similarly for distances. This method gives results similar to those obtained with the complete-linkage method.

McQuitty's Method.

This is a further simplification and takes a simple, unweighted average of the previous similarities or distances. i.e.

s(P+Q,R) = 0.5( s(P,R)+s(Q,R) )

This method often gives results similar to those obtained from the single-linkage method.

Metric Methods.

The remaining methods may only be applied to distance matrices, that is dissimilarity matrices whose values satisfy the conditions for a metric. The general equation for these methods will take the form:

 d(P+Q,R) = aP d(P,R) + aQ d(Q,R) + b d(PQ)

Different values for the parameters will give the four different methods.

a) Centroid Sorting Method.

Here the two closest clusters are merged and the new values of distance are obtained from the equation.

d(P+Q,R) = NP d(P,R) /{NP+NQ} + NQ d(Q,R) / {NP+NQ} - NP.NQ d(P,Q)/(NP+NQ)2

Thus it takes the weighted average and subtracts a term related to the size of the new cluster.

b) Gower's Median Method.

In this case, the new distance is the distance from the centroid of R to the mid-point of the line joining the centroids of P and Q. The equation is an obvious simplification of the previous method.

d(P+Q,R) = 0.5 d(P,R) + 0.5 d(Q,R) – 0.25 d(P,Q)

c) Ward's Method.

This makes use of the “error sum of squares” which is the sum of squares of the distances of each object from the centroid of its parent cluster. At each step, we combine the pair of clusters whose fusion causes the least increase in this sum. The new distances are calculated according to the formula

d(P+Q,R) = {(NR+NP)d(P,R)+(NR+NQ)d(Q,R)-NR.d(P,R)}/{(NP+NQ+NR)}

d) Lance-Williams Flexible-beta Method

The equation for the new values of distance are given by

d(P+Q,R) = (1-b)/2 (d(P,R)+d(Q,R))+b d(P,Q)

The behaviour of this method depends on the value chosen for the variable beta. It is claimed that beta=-0.25 will give behaviour very similar to that shown by Ward's method.

References.

Clustan manual mark 2C.

Cluster Analysis. 2nd edition. B.Everett. 1980. Heinemann.

Lance and Williams. Computer J. 9 (1967) p373-380. “A General Theory of Classificatory Sorting Strategies. 1. Hierarchical Systems”.

Defays. Computer Journal 20 (1977) p364-366. “An Efficient Algorithm for a Complete Link Method”.

Murtagh. Computer J. 26 (1983) p354-359. “A Survey of Recent Advances in Hierarchical Clustering Algorithms”.

Day and Edelsbrunner. J. Classification. 1 (1984) p7-24. “Efficient Algorithms for Agglomerative Hierarchical Clustering Methods”.

Divisive Methods.

Association Analysis.

This assumes that all the attributes are binary and is applied to the original data matrix. One attribute k is chosen and the data is split with k=0 forming one cluster and k=1 forming the other. The attribute is chosen so that the distance between clusters is maximised and this is measured by one of the following quantities.

SUM{chi2} = SUM{(ad-bc)2t /(a+b)(a+c)(b+d)(b+c) }

or SUM sqrt{chi2} or SUM |(ad-bc)| or SUM(ad-bc)2

References for this method.

Everitt. “Cluster Analysis” 2nd edition. section 3.2 p 37-39.

Clustan Version 1C. Section “DIVIDE”.

Sneath & Sokal. “Numerical Taxonomy” p202-204.

Lance & Williams. Computer Journal 8 (1965) p246-249. “Computer Programs for Monothetic Classification.”

Dissimilarity Analysis.

This is applied to a dissimilarity matrix obtained from any mixture of attribute types. It starts with one group of t objects and evaluates the sum of dissimilarities within groups. This will usually be large because of the comparatively large dissimilarities between the clusters.

Select the object with maximum dissimilarity to all other objects and use this to start a new group. It may be very dissimilar to ALL other objects and form a single-object cluster, but more often it will be a member of a larger cluster which is very dissimilar to the rest of the objects in the study.

Try moving other objects from the initial group to the new one, until a configuration is found which minimises the sum of dissimilarities within groups. This is taken to be the new grouping.

Continue with this method until no improvement is found by further splitting.

References for this method

Everitt. “Cluster Analysis” 2nd edition. p.35-37.

MacNaughton-Smith et al. Nature. (1964) p1034 -1035.

Dendrite Method.

This is applied to distances in attribute space. It requires the calculation of a “minimum spanning tree” or “shortest dendrite”, a technique described in the references. It is the graph of t-1 edges connecting the t objects which is chosen to have the shortest overall length. To split this into k clusters, it is necessary to remove k-1 edges and these edges are chosen to minimise either SUM{ d{ij})} or SUM{ d2{ij}/nj}. The method tests all possible configurations and stops when no further improvement is possible.

References for this method

Clustan 1C. section “DNDRITE”.

Rohlf. Computer J. 16 (1973) p 93-95. “Algorithm 76. Hierarchical Clustering Using the Minimum Spanning Tree”.

Overlapping Methods.

At the early stages of these methods, some objects may belong to more than one group. This means the method must conclude with a process to check through and ensure that each object appears in only one group in the final configuration. The paper by Jardine and Sibson develops a class of Bk methods (in which a maximum of k-1 objects are allowed to overlap in any grouping) as an extension of the nearest-neighbour method.

A later paper by Cole and Wishart gives an improved version of the algorithm and this was included by Wishart in his Clustan package (section KDEND in version 1C of Clustan). The KDEND method is applied to t objects, for which a similarity matrix has been calculated. It requires values of the ‘Linkage parameter’, k and the ‘Similarity threshold’, H. Each of the t objects is a node on the graph and all pairs of objects with a similarity above or equal to the threshold H are connected.

1. The method selects maximal complete subgraphs (i.e. groups of points for which all pairs are connected).

2. It also selects subgraphs for which at least k nodes are connected and then the additional links are added.

When this process is complete, the Jardine-Sibson overlapping classification has been obtained. Further configurations with different values of k and/or H may be obtained and compared with it.

References for this method

Clustan 1C. section KDEND.

Jardine and Sibson. Computer Journal 11 (1968) p177-184. “Construction of Hierarchic and Non-hierarchic Classifications.”

Cole and Wishart. Computer Journal 13 (1970) p156-163. “Improved Algorithm for the Jardine-Sibson method of Generating Overlapping Clusters.”

Copyright (c) Susan Laflin 1998.