Chapter Two: Similarity and Distance.

Similarity Matrices.

For t objects, we have a symmetric txt matrix, calculated from the data matrix, which contains numbers giving a measure of the similarity of the objects to each other. The element s{i,j} gives a measure of the similarity of object i to object j, and so is equal to the element s{j,i}. In theory, it is possible to have an asymmetric similarity matrix, but all the methods used in practice lead to symmetric similarity matrices.

It would be easy to invent an example which leads to asymmetry. If we wished to study ancient trade-routes, we might calculate the similarity of one town to another according to the number of days taken to travel between them, using packhorses overland and boats when travelling by river. This would introduce asymmetry since boats travel faster with the current than they do against it. If n{i,j} is the number of days to travel from town i to town j, then our similarity could be :

If n{i,j} < 1
THEN s{i,j} = 1.0
ELSE s{i,j} = 1.0/n{i,j}

This gives a similarity matrix, with the most similar towns having a similarity of 1.0 and the similarity reducing as they get less similar, but it does give some asymmetry to the resulting matrix. Those methods which have been coded on the assumption that these matrices are always symmetric will be unable to cope with this. It is not necessary to produce asymmetric matrices and I'm sure you can all think of other definitions of the similarity which would be just as useful for this problem and would give a symmetric matrix.

From now on, we shall assume all similarity matrices are symmetric. We shall also assume they are scaled to lie in the range 0 to 1, with identical objects having a similarity of 1. Since the diagonal elements of the matrix always give the similarity of an object to itself, these values will all be 1. Since the similarity matrices are usually very large (remember most studies involve several hundred if not several thousand objects) it can be very convenient to store only the elements below the diagonal and assume symmetry and diagonal elements equal to 1 when writing the code for our various methods.

Sometimes we may prefer to consider the "dissimilarity" between the objects and this will be zero when the objects are identical and increase steadily as they become more dissimilar. The dissimilarity, d{i,j}, may be calculated directly from the data matrix or it may be related to the value of similarity. The relation

d{i,j} = 1.0 - s{i,j}

gives a measure of dissimilarity whose maximum value is one. This is a useful property in some cases. The relation

d{i,j} = 1.0 / s{i,j}

has no upper bound and will cause overflow on all computers when the objects have a similarity of zero (completely dissimilar).

In many cases, this measure of dissimilarity may be thought of as a distance between the objects in attribute space. Strictly speaking, this is only true if the dissimilarity coefficients satisfy the conditions for a metric. If d{i,j} is the distance between i and j, then the following three conditions must be satisfied

d{i,j} = d{j,i}
d{i,j} >=0 and d{i,j} = 0 if and only if i = j
d{i,k} + d{k,j} >= d{i,j}

The last condition is known as the "triangle inequality". Most of the rules for calculating dissimilarities from ordered data (numeric or integer) will automatically lead to distances in attribute space, but this may not always be the case for the binary or unordered multistate data.

Association Coefficients.

Binary Data.

For these coefficients, we compare the two rows of the data matrix, one for object i and the other for object j, and count the number of attributes for which the values occur as specified. If I is the value of the attribute for object i and J is its value for object j then:

a = number of cases where I=1 and J=1.
b = number of cases where I=1 and J=0.
c = number of cases where I=0 and J=1.
d = number of cases where I=0 and J=0.

Since we cannot have missing values for binary data, we know that for n attributes we must have,

a + b + c + d = n

Unordered Data.

In this case, there will be a fixed number of possible values for each attribute, some of which may be used to indicate missing data. Once again we compare the two rows of the data matrix and note whether the values differ or not. This time we have to count up as follows:

x = number of cases where the value of i equals the value of j.
y = number of cases where the value of i does not equal the value of j.
z = number of cases where one or both of the values are missing.

Once again,

n = x + y + z

and relating this back to the binary case, we should have
x = a + d and y = b + c
The only case when this is not true is when d includes some zeros meaning absence of this attribute and other cases meaning missing data.

Examples of Coefficients.

The literature contains many examples of these coefficients and the CLUSTAN package also contains a large selection. We shall look at just two examples, which give good results in most cases.

1. Simple Matching Coefficient.

In our general form, this has the equation

s{i,j} = x/n

The CLUSTAN package only covers this for binary data, and has three related forms. e.g.

Coefficient number 4.
s{i,j} = (a+d)/(a+b+c+d)

Coefficient number 5. Jaccard coefficient
s{i,j} = a / (a+b+c)

Coefficient number 13. Dot-product
s{i,j} = a / (a+b+c+d)

All of these are equivalent to the value "x/n" when generalised, and the reason for this duplication is the uncertainty over the significance of mutual absence of values, especially when this includes confusion between absent and missing data.

2. Dice-Sorenson Coefficient (also called Czekanowski-Dice Coefficient).

This coefficient gives more weight to matches in the data than to differences and so it has the equation:

s{i,j} = 2x /(2x+y)

Once again the CLUSTAN package has two related forms for this coefficient, namely

Coefficient number 6.
s{i,j} = 2a/(2a +b+c)
Coefficient number 7.
s{i,j} = 2(a+d)/(2a+2d+b+c)

This is contrary to the previous assumption that all attributes shall have equal weight in determining the final classification, but there may be some circumstances in which we may prefer this coefficient.

3. Distance Coefficients.

If we want to calculate a "distance" or dissimilarity coefficient for this type of data, we may obtain it from the similarity coefficient, probably from the formula:

d{i,j} = 1.0 - s{i,j}

or we may calculate it directly from the data by a "Simple Difference Coefficient" using the formula

d{i,j} = y/n = y/(x+y+z)

If there is no missing data, then these two values will be identical. If some data is missing, then we have from the first equation

d{i,j} = 1 - x/(x+y+z) = (x+y+z - x)/(x+y+z)
and so
d{i,j} = (y + z)/n

thus implying that all missing values are unequal for the two objects. This may be true, but there is no justification for assuming that it must be true for all missing data. Consequently when data is missing, we should calculate the coefficients directly from the data matrix.

Ordered Data.

When we are dealing with numeric data, it is most unlikely that any of the readings will be duplicated exactly. Consequently we shall have to extend the idea of matching and say that "when the difference between values is less than some small number, we count this as a match". i.e.

if |att(k,i)-att(k,j)| < e

then attribute k is a match. With this extension, it is possible to derive association coefficients for numeric data, but they are not well suited to this approach.

When we are dealing with integer or ordered-multistate data, there are a limited number of values and we can check for equality. It is easier to apply this method to integer data, but again it is not well suited to this approach.

For these types of data, the ordering implies information about the distances between unequal values and it is better to choose a coefficient which allows us to make use of this extra information.

Metric Coefficients.

These are distance coefficients which can only be applied to ordered data. Distances in attribute space are calculated according to a number of formulae.

"Manhattan" or "City-block" metric.

If d{i,j} is the distance between objects i and j, a{k,i} is the value of attribute k for object i and n is the number of attributes, then we have the formula

In two dimensions, this value is obtained by taking the difference in x-values and adding it to the difference in y-values. When walking from one place to another in a city laid out in rectangular blocks, you have to walk the required distance along the east-west streets and then along the north-south streets. Hence the name "city-block" for this metric.

Euclidean metric.

This has the formula:

In two-dimensional space, the distances resulting from this are the straight-line distances joining the two points.

General Minkowski Metric.

This has the formula:

Although this is a more general form which includes the previous two metrics, it is unusual to use values of r > 2.

All these metrics have the same disadvantage, there is no upper bound on the value and so they may become very large for some values of some attributes. To overcome this, two metrics have been proposed which have an upper bound of one. This means that we may also define similarities by the equation

s{i,j} = 1.0-d{i,j}

Canberra Metric.

This has the equation:

The values will never exceed one, being equal to one whenever either of the attributes is zero. Thus it would seem to be a good expression to use. Its disadvantage is rather a subtle one.

Consider two values of one particular attribute, say 2.0 and 4.0. Their difference is 2.0 and their contribution to the Canberra metric is given by

{4.0-2.0}{4.0+2.0} = 1/3

Now consider two more values of this attribute, differing by the sameamount, but whose values are 24.0 and 22.0. In this case their contribution to the Canberra metric will be

{24.0-22.0}{24.0+22.0} = 1/23

Thus the same difference between attributes will generate different values of distance at different parts of attribute space. This warping effect may lead to inaccurate analyses in some cases.

Scaled Minkowski Metric.

For each attribute, we have to calculate the range of values taken by that attribute,

R(k) = |max(a{k,i}) - min(a{k,i}) |

measured over all the objects in the study. When r=1,we are dealing with the scaled Manhatten metric and in this case the equation takes the form

Because the range R(k) is constant for each attribute, this does not have the warping effect of the Canberra metric. It does however require the extra computation to search all the values of the attribute and find its maximum and minimum. If another object with a value of this attribute outside the already calculated range is added to the study, then all the distance values will have to be recalculated. In some cases, it may be possible to decide on themaximum and minimum values which are theoretically possible for this attribute and such a range would not need any recalculation.

General Coefficients for Mixed Data.

Gower's General Coefficient.

This was postulated for a mixture of binary and numeric data and assumes some method of indicating missing data. It gives a similarity in the range 0 to 1 for objects described by such data.


The equation is:

Where the weight wk is defined by
wk=0 if the comparison is invalid.
wk=1 if the comparison is valid. (The usual reason for an invalid comparison is that one or both of the attribute values are missing.)

The value s{i,j,k} is defined by the expressions:

s{i,j,k} = 0 if wk = 0
or if we have a binary attribute whose values are unequal.

If wk = 1 then
s{i,j,k} = 1 if we have a binary attribute whose values are equal.
s{i,j,k} = 1-|a{k,i}-a{k,j}|/R(k) if we have a numeric attribute of range R(k).

This makes use of the Simple Matching Coefficient and the Scaled Manhattan Metric to give a general coefficient for binary and numeric data.

Laflin’s General Coefficient.

Following on from Gower's coefficient, we note that if we have N1 binary attributes and N2 numeric attributes, we may write the same coefficient in a different form.

Where s1 is the Simple Matching Coefficent for the binary data and s2 = 1 - d1, d1 being the Scaled Manhattan metric for the numeric data. We may extend this to the case where we have :

N1 Numeric attributes (real numbers).
N2 Binary attributes.
N3 Integer attributes (ordered multistate).
N4 Multistate attributes (unordered).

For each of these groups of attributes, we calculate a similarity, scaled to lie in the range (0,1), with suitable provision for missing values. Then the general similarity coefficient is given by the equation:

This form of equation ensures that each attribute makes an equal contribution to the measure of similarity between the objects.

Similarly, we may define a general distance coefficient, with the values for numeric and integer coefficients given by a suitable metric, and those for binary and multistate attributes probably given by the Simple Difference Coefficient. All these distances/differences must be scaled to the same range and since the Simple Distance Coefficient gives values in the range (0,1) it seems reasonable to use Scaled Minkowski metric for the distances, giving a final coefficient in the range (0,1). The equation for distances takes the same form as that for similarities.

Reading and Exercises.

1. Chapter Four of "Numerical Taxonomy" by Sneath and Sokal gives a good account of this topic. Other accounts may be found in "Cluster Analysis" by B.S.Everitt (2nd edition chapter 2) and "An introduction to Mathematical Taxonomy" by G.Dunn and B.S.Everitt, chapter 3. The manual for the CLUSTAN package contains a vast number of similarity and distance coefficients, many more than you are ever likely to need.

2. Design a procedure which will calculate a General Coefficient either for similarities, using the Simple Matching Coefficient, with its extension for numeric data OR for distances using the Simple Difference Coefficient and the scaled Euclidean metric. In either case, you will need some means of indicating which attributes are of which type. You may require the user to group the types of data, so that he or she must input all the numeric data before proceeding to the integer data etc, or you may arrange to accept data in any order and then treat the different types of data in the appropriate ways.

3. The assessment for this course will require you to develop software to carry out cluster analysis on several sets of data and interpret the results. Design, write and test a procedure, function or subroutine, in the language of your choosing, to accept n numeric attributes for each of t objects and calculate a distance or similarity matrix for this data.

Copyright (c) Susan Laflin 1998.