Curve Fitting - Approximation.

Copyright (c) Susan Laflin. August 1999.

Least-Squares Approximation.

This method is mathematically elegant, but horribly ill-conditioned when you come to evaluate it and so should be used with great caution. It is derived by finding the minimum of the L2 norm of the error function. For the continuous case, where we wish to replace a complicated equation by a simpler one, the equation is :


SQRT{S} = ||e(x,a_k)|| 
        = SQRT{integral from a to b of w(x) * | f(x) - g(x,a_k) |^2 dx}

Here w(x) is a weight function, frequently set equal to one everywhere, which allows us to attach greater weight to the values in part of the interval if we wish to do so.

For the discrete case, where we have values of (xj,yj) at a number of points and we wish to fit a curve y = g(x,ak), we have the equation:

SQRT{S}=|| e(xj,ak) || 
       = SQRT{ sum {j = 0 to n} w(xj)*( yj - g(xj,ak) )^2 }

Again we usually set w(xj) = 1 and give equal weight to each of the datapoints.

Least-squares curve fitting is frequently used to find the best-fitting polynomial to a set of data points. To find the values of the coefficients ak which minimise the norm of the error function, we note that the same values give the minimum of ||e|| and the minimum of S = ||e||^2 and set the partial derivatives dS/ dak to zero. This gives a set of simultaneous linear equations to be solved for the unknown coefficients ak.

The disadvantage with this method is that this set of equations (known as the `normal equations') rapidly becomes very ill-conditioned. This means that as we solve the set of equations, we have to perform calculations on numbers which differ greatly in magnitude and so the roundoff errors, which are always present in calculations, become large, thus reducing the accuracy of the final result. For some sets of data, if we attempt to fit a polynomial of degree 7, we can finish up with NO correct digits in the resulting coefficients.

Least-squares curve fitting is widely used to fit straight lines to sets of data (often called `linear regression') and it is a reliable method for this size of problem. Accordingly, the discussion here will give the general derivation for a quadratic and then consider examples of linear regression in some detail.

Derivation of Normal Equations.

In general, a polynomial may be expressed as a linear combination of orthogonal polynomials, and the powers of x are the simplest example of orthogonal polynomials.

Pn (x) = sum{k=0 to n} ak*x^k

For a quadratic, n=2 and so we have:

P2 (x) = a0*x^0 + a1*x + a2*x*x

To fit a set of data points ( (xj,yj), j=1,...,N) we need to minimise :

S = sum{j=1 to N} [ yj - P2(xj)] ^2

Setting the partial derivatives d S/d ak to zero for each k gives the equations:

dS/dak = sum{j=1 to N} d/d ak ( yj - a0 - a1*xj - a2*xj*xj )^2

Differentiating these gives the three equations:

	d S/da0 = 2 sum{j=1 to N}  (yj-a0-a1*xj-a2*xj*xj)*(-1) 
	d S/da1 = 2 sum{j=1 to N}  (yj-a0-a1*xj-a2*xj*xj)*(-xj)
	d S/da2 = 2 sum{j=1 to N}  (yj-a0-a1*xj-a2*xj*xj)*(-xj*xj) 

and each of these is equal to zero. Now we note that the summation only applies to xj and yj and anything independant of j can be treated as a constant and taken outside the sums. This gives the set of equations :

	sum yj = 	a0* sum 1    + a1* sum xj   +  a2* sum xj^2 
	sumxj*yj = 	a0* sum xj   + a1* sum xj^2 +  a2* sum xj^3 
	sum xj^2*yj = 	a0* sum xj^2 + a1* sum xj^3 +  a2* sum xj^4 

This can also be written in matrix form, which is the usual form for the normal equations. The pattern is already becoming clear and it should be easy to extend this to a polynomial of any degree.

If it were possible to re-arrange this set of equations to make it diagonal-dominant, then we could be certain that the solution would be stable or well-conditioned. However you will notice that the rows of equal values run from bottom-left to top-right and this is the wrong way for achieving diagonal-dominance. Also with the large range of powers of xj, the actual values will vary by several powers of ten and so it will rapidly become unstable, leading to large errors in the solution.

Case 1. Minimising dy.

In this case, the normal equations will fit the data to a straight line with the equation:

y = a1*x + a0

We have to minimise the equation:

S = sum (yj - a1*xj -a0)^2

which gives the normal equations:

	a0*sum 1 + 	a1* sum xj 	=	sum yj
	a0*sum xj   + 	a1* sum xj*xj 	=	sum xj*yj

These two equations have to be solved for a0 and a1. There are many suitable methods, but Gaussian Elimination is probably the easiest and most efficient. All methods are easy for a problem of this size and because the amount of computation is small, the errors which build up during the computation will also remain fairly small.

Case 2. Minimising dx.

This is similar to the previous example, but with x and y interchanged. It is relevant when the errors in xj are greater than those in yj.

Case 3. Minimising Perpendicular Distances.

This is relevant when we have large errors in both xj and yj. In this case, we need to minimise dj, the perpendicular distance between the line and the points (xj,yj).

Let the line have the equation y = a1*x + a0 and let (Xj,Yj) be the point where the normal through (xj,yj) meets the line. Then we wish to minimise:

S = sum dj^2 = sum ((xj-Xj)^2 + (yj-Yj)^2)

The equation of the line through (xj,yj) perpendicular to a1 is:

	(y-yj) = {-1}/{a1} *(x-xj) 
or
 	a1 * (y-yj) = xj-x 

The two lines meet at the point (Xj,Yj) and so we have: 

	Xj = (yj+ {xj}/{a1} -a0) *{a1}/{(1+a1^2)}
and
	Yj = (yj + {xj}/{a1} + {a0}/{a1^2} )*{a1^2}/{(1+a1^2)}

Substituting these expressions in our equation for S and 
simplifying, we get
		S = sum { (yj-a1*xj-a0)^2}/{(1+a1)^2} 

Now we have to evaluate the partial derivatives and set them to zero.

d S/ a0 = 0  gives   a0* sum 1 + a1* sum xj = sum yj 

d S/ a1 = 0 is more complicated. 
 

Eventually this gives a non-linear equation in a1. Solving this for a1 gives the "best-fit" and the "worst-fit". It should be easy to decide between them. Then substitute a1 in the first equation to find a0.

While minimising the perpendicular distances of the points from the curve may give the most satisfactary results in many cases, the equations become excessively complicated for higher-order curves.

With this problem, as with so many in numerical analysis, we can continue to modify the problem in the hope of finding one which is easier to solve. We have already converted the problem of finding the MINIMUM of the L2 norm into the problem of solving a set of equations. Now we will make use of some properties of orthogonal polynomials (and in particular the Chebyshev polynomials) to find a solution of these equations.

Use of Chebyshev Polynomials.

If we have a set of polynomials Qi(xj), which are orthogonal over the set of values (xj, j=1,N), then for any two polynomials Qi(xj) and Qk(xj) belonging to the set (i not equal to k), we know

	sum Qi(xj) *  Qk(xj) = 0 
and 	sum Qi(xj) *  Qi(xj)  is non-zero 

  Now any polynomial Pn(x) can be expressed as a linear combination 
of the orthogonal polynomials, i.e. 	 
			Pn(x) = sum bi* Qi(x)
So applying the least-squares method to a polynomial of this form, 
we obtain the normal equations for P2 (x) in the form 

	b0*sum Q0*Q0 + b1*sum Q0*Q1 + b2*sum Q0*Q2 = sum Q0*yj
	b0*sum Q0*Q1 + b1*sum Q1*Q1 + b2*sum Q1*Q2 = sum Q1*yj
 	b0*sum Q0*Q2 + b1*sum Q1*Q2 + b2*sum Q2*Q2 = sum Q2*yj

If the points xj are chosen so that the polynomials Qi(xj) are orthogonal over these points, then FOR THESE POINTS ONLY, the normal equations have a diagonal matrix and can be evaluated easily and accurately.

So we have exchanged the difficult problem of solving an ill-conditioned set of simultaneous linear equations for the problem of finding a set of polynomials which are orthogonal for our set of data points. In general, this problem is equally intractable and so this approach is no help. However there is one particular case for which we do know the answer. If the data points are equally spaced, we can find a set of Chebyshev polynomials which are orthogonal over these points.

The Chebyshev polynomials were discovered by the Russian mathematician Chebyshev (or Tchebyshev) and called after him, which is why we use the symbol Ti(x) for the first set of these polynomials.

They are defined by over the interval (-1,+1), by the recurrance relationship

	T0(x) = 1 	T1(x) = x 	
and	T{r+1}(x) = 2x*Tr(x) - T{r-1}(x)

So we have 
		T2(x) = 2x*T1(x) - T0(x) = 2x^2 - 1 
		T3(x) = 2x*(2x^2-1) - x  = 4x^3 - 3x		
and so on.

In particular, if x = cos t then Tn(x) = cos n and the polynomials are orthogonal over any set of values of t which are equally spaced over the interval (0,pi).

So for any set of values of xj equally spaced over the interval (a,b) we can convert these to points tj over the interval (-1,+1) over which the Chebyshev polynomials are orthogonal.


1. Linear scaling to convert the interval (a,b) to the interval  (0,pi).

		Tj = (xj-a)*(pi-0)/(b-a)

2. Non-linear scaling to convert the interval (0,pi) to (-1,+1).

		tj = cos Tj 

Then the Chebyshev polynomials are orthogonal over the tj.