Chapter 6: Density Methods.
Useful Concepts.
These methods regard clusters as patches of “high-density”
in attribute space, surrounded by lower density. To devise methods using this
concept, we need to have some means of calculating density. The simplest is to
take the number of objects in a unit volume in attribute space and to define a
unit volume as that included in a unit step for each attribute. This assumes
the attributes are not scaled to lie in the range (0,1), but probably expects
they are all scaled to the same range of values, say 0-100.
In discussing these methods, there are several concepts associated with density
that we shall find useful:
1. Straggliness.
Straggliness implies a much greater extension of the group in a few dimensions
than in all the others. A simple example is a “sausage-shaped” cluster in
three-dimensions.
2. Gaps and Moats.
A “moat” is a volume of low-density surrounding a
high-density cluster. A gap is a low-density volume, but it need not surround a
cluster. The distance from each member within a cluster to the closest object
in another cluster may be computed and will give a measure of “separateness” or
“isolation” of clusters.
3. Connectivity.
Two objects are said to be “connected” if their distance apart in attribute
space is less than a certain specified value. In some presentations of results, such
objects are shown joined by a straight line.
A cluster is said to be “fully-connected” if every
object in the cluster is directly connected to all other objects in the
cluster. This occurs when the dimensions of the cluster are less than the
specified value for connectivity.
A cluster is said to “minimally-connected” if each object
is connected to all the others, either directly or via a chain, and there are
t-1 connections joining the t objects. Thus severing any one connection
will result in two separate clusters.
A cluster anywhere between these two extremes is said to be partially-connected
and this is the most common state of affairs.
Carmichael's TAXMAP Method.
TAXMAP stands for “Taxonomic Mapping” and is an extension of
the nearest-neighbour method. It attempts to reduce the effects of chaining by
testing for the presence of a moat before each new object is added. This is done
by evaluating the discontinuity at each step, where
“discontinuity” = “new average similarity” – “drop in average similarity”.
This measure has been found to decrease smoothly until there is a
discontinuity, unlike the drop in average similarity which may vary widely. A
worked example of this method may be found in the text by Everitt.
Cartet-Count Method.
This technique was described by Cattell and Coulter in 1966. The basic idea was to partition the attribute space into “cartets” or hypercubes and then count the number of objects in each cartet. By choosing suitable boundary values, it is possible to identify the clusters in the distribution.
Hypersphere Method.
This method, which I developed to deal with a particular
problem, uses a similar concept to the Cartet-count method. However my method
was designed to deal with the cases where a distance or similarity matrix is
available but the original data is not. This occurs in a number of cases of
published archaeological data. In this case, we have values of the distance
between objects, but no information on the direction of the line joining
them.
Thus if we take one object, somewhere in
the data, as the centre, we know that another object at distance 0.1 from it
lies somewhere on the hypersphere of radius 0.1. Figure 6.1 shows a set of
clusters in two-dimensions and successive circles, centred on one of the
points, with increasing radii. Note that for this illustration, we have
two-dimensional data and the distribution is known.
In the general case, we would start with the distance (or similarity) matrix and deduce the distribution in n-dimensional attribute space. The number of points lying in each annulus is plotted in the histogram in Figure 6.2. This suggests an approach along the following lines.
1. Find a starting point. Choose some value d for the increment
in distance. For example, if the distances are normalised to lie in the range
(0,1), then d=0.1 would be a reasonable value for a first attempt.
For each object in the distance matrix, count the number of
other objects lying within a distance d of this object. Select the object with the
maximum number of other objects close to it. If there are any clusters in the
data, then this object must lie near the centre of one such cluster.
2. Calculate a histogram centred on this object. For each annulus
of width d, count the number of objects lying in the annulus. Then plot the
histogram, giving a result similar to Figure 6.3.
3. Identify the high-density cluster and the low-density moat from this histogram. If the edge of the moat is at distance D from the seed-point, then remove the seed-point and all other objects within distance D from it and take these as one cluster.
The view of other clusters in a histogram centred on the
seed-point will probably not identify them very clearly. However the whole
method may easily be re-applied to the remainder of the data and generate
histograms centred on other points, each of which allows one cluster to be
clearly identified. Sometimes different values of d must be tried to give
clear resolution of the edge of the moat.
Using the histogram will give a poor value of the density of objects in the
outer annuli, because the area over which the points are spread is so much
larger. One possible solution would be to divide the number of points by the
area of the annulus. In two dimensions, the area has the equation
a(i) =
p(r2{i} - r2{i-1})
and so the density of the annulus from r{i} to r{i-1} is
given by
D(i) = n(i)/a(i)
If the whole distribution contains N objects spread over a circle
of radius R, then the average density D is given by
D = N / p R2
Thus the relative density of each annulus is given by D(i)/D and if we plot this against the distance from the seed-point, we would expect higher than average values of density within the cluster, lower than average in the surrounding moat and settling down to average values (relative density close to 1.0) as we get further away from the seed-point. The plot in figure 6.5 shows this graph for the data-set used in the earlier examples, and you will note that it has a very large value for the centre of the cluster and much lower values for the other two clusters. With this small set of data, thereis no sign of the data settling down to the average value, but it is not a typical problem since there are only 50 values in all, instead of the several hundred in most real problems, and also they are organised into three clearly defined clusters. Many real problems have a scatter of points which do not belong to any cluster and this will blur both the histogram and the density plot.
Mode Analysis.
This method is supplied in the Clustan package instead of any
of the density methods. It is described in terms of distance coefficients and
requires a parameter k. Then for each object in the matrix, we calculate the
average of the 2k smallest distances, call this a(i).
If a(i) is small, then many other objects lie close to
object i and so it occupies a dense area of attribute space. Sort the objects
from the data matrix into increasing order of a(i). This means that the
“densest” objects will be placed at the top of the array and so treated first.
Now work in terms of a threshold, which is reset to R=a(i)
at each step and for each new object i and value of a(i), we have four
possibilities.
1. The next object i is separated from all other “dense” points by a distance
greater than R. In this case, this object starts a new cluster.
2. The next object i is within distance R of only one cluster. In this case,
add object i to this cluster.
3. The next object is within distance R of more than one cluster. In this case
the clusters are merged together and the object is added to the new combined
cluster.
4. Each time the distance
R is changed, all the clusters have to be checked and any which are now less
than distance R apart are merged together.
The cluster nucleii are defined as “dense” points and so
are any other objects within distance R of them. All other objects are
“unclassified” at this stage. The process continues until some finishing
criterion has been satisfied. This may be expressed as an upper limit on R, or
the percentage of objects to be classified, or the minimum number of clusters
required in the final configuration.
References.
Clustan package. sections MODE and DENSITY.
Everitt. Cluster Analysis 2nd edition.
Copyright (c) Susan Laflin 1998.