Lecture 3:

Least Squares and regression techniques, goodness of fit and tests, non-linear least squares techniques

Relevant Reading:


Table of Contents: (clicking below goes to section, clicking "GoTo TOC" returns here) Last Changed 27 Sep 2002

3.1 Statistical Basis for Regression:

3.2 Least Squares Fit to a Straight Line:

3.3 General Linear Least Squares:

3.4 Non-Linear Least Squares:


GoTo the Previous Lecture
GoTo the Next Lecture
GoTo 12.747 TOC

3.1 Statistical Basis for Regression

Often in working with data, we must ask specific quantitative questions about how well the data reflects some underlying model of processes in nature. The question may arise, for example, as to whether some property (e.g., dissolved oxygen in seawater) changes linearly with time. The actual rate of change (the slope of the relationship) may be of quantitative interest, or the initial value (the intercept of the relationship) may be important. Moreover, in addition to obtaining the most accurate estimate of one or more of these parameters, we need to know how precisely we know the values, or more explicitly, the confidence interval (range of probable values) of the parameters. Finally, we must decide if our linear model (or some other function) is appropriate or robust by checking the goodness-of-fit.

3.1.1 Underlying Distributions Sometimes Lie

Implicit in the Least Squares techniques is the belief that there is some underlying “truth” that we are attempting to model, however imperfectly, due to errors in our observations. A typical assumption is that the data are normally distributed, i.e., the deviations between the actually measured data from the underlying, true values, if plotted as a histogram, form a Gaussian (bell shaped) curve. As experimentalists, or as consumers of experimental data, you must realize that although the bulk of experimental data does indeed adhere to this Gaussian ideal, there is a tendency for unknown, episodic interferences to produce outliers (also known as fliers) whose probability of occurrence approaches the astronomically small under the assumption of a normal distribution. We are taught in undergraduate laboratories that with a Gaussian distribution, only about 1/3 of the data lies more than one standard deviation from the mean (the standard deviation is the root mean square width, aka RMS width, of the Gaussian curve). Further, you learn that only about 5% fall more than two standard deviations from the mean, about 1% more than 3 standard deviations, etc. What is the probability, then, of that point you just measured that is 10 standard deviations away from the others? Try roughly one in 1011. To put this in perspective, having just one flier like that, you would be obliged to make of order 1011 more measurements to adequately unbias the data set that has been contaminated by that one flier. Thus the take home lesson is that you must be very vigilant in determining whether you have outliers influencing your data. Often what experimentalists do, if they want an unbiased estimate of some specific parameter (e.g. a mean, slope or intercept) is to first fit all of the data to the model, eliminated the improbable data (i.e., those data more than 3 or 4 standard deviations from the model fit), then refit the model to the reduced data set. If you find yourself throwing out large portions of your data, though, you had better think again.
GoTo Lecture TOC

3.1.2 The Chi Squared Defined (and Goodness of Fit)

How do you judge (aside from the eyeball test) the goodness of a fit? The most common goal is to reduce the "distance" between the observations and the model. The standard measure, which can be derived from the Gaussian nature of the underlying distributions, is the Chi-Squared, which is defined as

                                                                              (3.1.1)

where yf is the model (the fit) estimate, yi is the actual observation, and si in the denominator is the uncertainty in the individual measurement (yi). You could of course choose other measures of distance, e.g. the absolute values | yi – yf |. But chi-squared has the nice property that it results in relatively simple, analytical forms and it can be derived for a normal distribution from the method of Maximum Likelihood. You may arrive at the choice of si in a number of ways :

Note also, that this si may/can/will vary from measurement to measurement and hence represents some kind of weighting factor that you might use when incorporating data into a larger set. That is, you would not want to value a poorly made or inaccurate measurement as much as a more carefully made, precise measurement.

Regardless of how you arrived at your choice of si, you would tend to think that the Reduced Chi Squared, which is really just the root mean square deviation normalized to measurement errors, would tend to be close to 1 if things are working correctly. Here we define the reduced chi squared (note the subscript r, to distinguish it from its bigger brother) as

                                                              (3.1.2)

where n = N-n is the degrees of freedom, and n is the number of coefficients or parameters used in the regression fit. For example, computing a mean gives N-1 as the degrees of freedom, and a regression to a straight line gives N-2.

If your reduced chi squared is much larger than 1, say 10 or 100, it means that you are either doing a lousy job making measurements (i.e., something is wrong with your apparatus or technique), or you have been overly optimistic about your measurement uncertainties. Another possibility is that you may have a bad model, that is an inappropriate fitting function. For example the data better fit a quadratic or exponential relationship rather than a straight line. If the reduced chi squared is too small, say .1 or .01, it may mean that you have been too pessimistic about measurement errors. It is very important to pay careful attention to estimation of your measurement errors.

GoTo Lecture TOC

3.1.3 Look at Your Residuals

Finally, a good reduced chi squared may not mean you have a good fit or model. It may be a conspiracy between the above factors, or that you have overlooked something. Look at your data, and particularly look for patterns or structure in the residuals of the data. That is, plot up the differences between your data and the values predicted by your regression model. If the deviations between your data and your model fit show a characteristic large scale structure, this is indicative of unresolved and unfit characteristics. Look, for example, at the two plots below:

Figure 3.1.1 The data in the left panel appear to be a good fit to a straight-line. But when you look at the data residuals (right panel) from the least squares fit, you see that the residuals actually have structure (in this case a sine wave)!

The left hand figure shows a straight line regression to a data set in which the fit looks rather good, yet when you subtract the fit values from the observations (figure on the right) you see a definite pattern. The reduced chi squared is just one of many diagnostics of model success, and like any single number indicator, does not tell the whole story.

If you do not have an independent determination of your measurement uncertainty (this, however, should be a most unusual and unphysical thing to happen) do not despair. You can use the reduced chi squared as a means of determining your measurement error, presupposing that you know you have a good "model" of your data, i.e., that you are fitting it to the correct curve. Regardless of your situation, one thing is important to remember:

MINIMIZATION OF THE CHI SQUARED IS THE FOUNDATION OF ALL THE LEAST SQUARES REGRESSION TECHNIQUES.

Hence the term "least squares". All of the routines discussed below are aimed at finding the model parameters (coefficients) which minimize the chi squared of a given data population.


GoTo Lecture TOC
GoTo 12.747 TOC

3.2 Least Squares Fitting a Straight Line

Perhaps the most common model of data regression (aside from mean and standard deviation) is the fit to a straight line. It is also the easiest formulation to visualize and derive from basic principles (although we will try to convince you that there is an even more robust, succinct, general, and easy to understand approach in section 3.3). By substituting into the definition of c2 the formula for a least squares regression y = a1 + a2x (where a1 is the intercept, and a2 is the slope), you have

                                                                 (3.2.1)

The name of the game is to choose values for the coefficients a1 and a2 to minimize the value of chi squared. What does that really mean? If you look at the figure below, it is equivalent to minimizing the sum of the squares of the lengths of the green lines, which are the vertical distances between the observations (yi) and the regression estimated values (a1 + a2xi):

Figure 3.2.1 The basis of the least-squares method is to minimize the sum of the squared model data-misfit, for a type I regression taken as the vertical distance between the data points and the fit line.

GoTo Lecture TOC

3.2.1 Doing Things the Hard Way (the Normal Equations)

Now how do we choose the coefficients a1 and a2 to minimize chi squared? Well, you need to recall back to your first year calculus course. The extrema (maxima or minima) of a function is characterized by its first derivative going to zero. So if we differentiate the chi squared with respect to each of the coefficients and set the derivatives equal to zero, we have two equations in two unknowns. That is, we have

                                                        (3.2.2)


and

 

                                                     (3.2.3)

This can be reduced to two equations in two unknowns (the coefficients),

                                                                            (3.2.4)

                                                                           (3.2.5)

where for short-hand notation, we have used the notation that

, ,                                                              .

,                                                               (3.2.6)

Keep in mind that the above are just numbers that you simply and mechanically compute from your data. You then have a pair of simultaneous equations that you can easily solve for your coefficients. You can solve this with MATLAB by first populating a matrix A with the sums (the S's), such that

                                                                             (3.2.7)

and (for the right hand side of the equations),

                                                                                    (3.2.8)

and thus you solve the equations Aa = b (here the unknowns to solve for are the coefficients and are in the column vector "a") like we did in lecture number 1 with a = A\b. Pretty simple, huh?

These equations are called the normal equations and can be generalized for higher order polynomial fits. For example, if you were to fit a quadratic equation, y = a1 + a2x + a3x2, the normal equations would be

                                                                           .

                                                               (3.2.9)

                                                                    .

and so on. Do note one thing, however, that the size of the sums begin to mount very rapidly (for the quadratic, you are summing up the 4th powers of x) and roundoff errors soon become a problem.

The solution to the normal equations for the straight line regression can be easily obtained in MATLAB using the a = A\b solution. Commonly, however, you'll find this solution solved by something called Cramer's Rule (we won't go into this here), which gives explicitly

                                                                                 .

                                                                              (3.2.10)

where we have defined the denominator (actually the determinant of A) as

                                                                         (3.2.11)

Remembering the concept of singular matrices, if A is singular, due to inadequate data, then the determinant will be zero, and the solutions to equations 3.2.10 will fail for obvious reasons. This is almost never a problem for straight-line fits but can be problematic for more complex situations, where the normal equations become "almost singular" (we will talk about this more later).

GoTo Lecture TOC

3.2.2 Uncertainties in Coefficients:

But we also need to know the uncertainties in the coefficients. This is a relatively straightforward thing to do, if a little tedious to calculate. But if you assume there are no systematic errors in your measurements (systematic errors induce covariances, not random errors), then you can derive that the uncertainty in any function f(x) is given by

                                                                       (3.2.12)

But here, the function f(x) that we are interested in is a1 (or a2), and the variable is yi since that is the one that is uncertain. We differentiate equations 4 with respect to the observations yi (i.e., the contribution of each data point the fit parameters) and substitute into the equation above to obtain

                                                                                            .

                                                                                      (3.2.13)

Note something quite profound in the structure of equations 3.2.13; the size of the uncertainties in the coefficients depend not on the data you measured (the yi's) but on where you made the measurements (the xi's) and the uncertainties in the measurements (the s's). No summation incorporating yi is involved! This says something about experiment design; obviously maximizing the D (the A determinant) in relation to the Sxx and S is a good thing. Maximizing the determinant is a question of maximizing the difference between Sxx and (Sx)2, which is equivalent to maximizing the spread (range of x values) of your measurements. The larger the range in your data, the better you know the slope and intercept. That makes sense.

Figure 3.3.2 The error in the slope estimate depends on the x sampling locations; closely clumped samples in x will have a larger error.

Look again at the upper equation in 3.2.13. The numerator (Sxx) says that the larger the 2nd moment of the x distribution (i.e., the distance between the centroid of your data and the x=0 axis, where the intercept is) the larger your intercept error. This also makes sense.

Figure 3.2.3 The error estimate for the intercept depends on how far the x sampling locations are from x=0.

But where does the number of measurements come into this? You would expect that increasing the number of measurements (all other things being equal) would improve our knowledge of the coefficients. It only seems fair. Well, N does come into the coefficients, but through the determinant D. The reason is not too subtle to argue here. Consider a random distribution of measurements (i.e., various xi's) centered around zero (this argument works regardless of this stipulation, but it makes it easier to understand it this way). Note the terms in the equation that defines the determinant. The second term, is the square of Sx, which is the sum of xi's. This sum will be close to zero, since about half of the xi's will be negative, and the other half positive. The first term, Sxx which is the sum of the squares of the xi's will always be positive, and will continue to grow as the number of measurements increases. Thus D will always increase for increasing numbers of measurements. The figure below shows the uncertainties in the intercept (upper curve) and slope (lower curve) for a random group of measurements of varying numbers (horizontal axis).

Figure 3.2.4 The uncertainty estimates (y-axis) for the intercept (upper, red line) and slope (lower, green line) parameters from a least-squares fit decrease as the number of data points increases (x-axis), shown here for randomly generated data.

You could easily do this yourself because we have a MATLAB routine called linfit.m, which you can download and use as a general straight line fitting routine. It is very simple to use, requiring three arguments (x, y, and sy) that you need to supply. It returns the coefficients, uncertainties in the coefficients, the covariance of intercept on slope, and the linear correlation coefficient.

GoTo Lecture TOC

3.2.3 Uncertainties in an Estimated Y-value

Now that you have determined the coefficients of your straight-line fit, and the uncertainties in those coefficients, you might want to go ahead and calculate some best fit value of ye at some desired location xe. That's easy, ye = a1 + a2xe does the trick. But how well do you know this new estimate? Your first instinct is to use the error propagation equation, but that estimate

                                                                       (3.2.14)

is fundamentally wrong! You have left out the fact that the uncertainties in the intercept (a1) and the slope (a2) are correlated and a third term enters into the above equation involving the covariance of the two uncertainties

                                                          (3.2.15)

This is a specific case of error propogation, which gives the uncertainty in a function f due to uncertainties in variables xi as

                                                            (3.2.16)

where m is the number of variables and where i and j vary over all possible combinations of the variables. The factor of two in the last term of the explicit equation above is because the covariance of a1a2 is the same as the covariance of a2a1 (the covariance matrix is always symmetric) and thus enters twice into the summation.

The s in the very last term is the covariance term that is returned from linfit.m, and is defined by

                                                                              (3.2.17)

Look again at the plots above, and note how the intercept changes with the slope. The only time that the two uncertainties are uncorrelated is when the x-variable is normalized so that its mean is zero (you can picture the slope pivoting around the centroid of the data cloud, and since the intercept now resides at the centroid, the two coefficients are independent. If you do this, then Sx = 0, and equations 3.2.10 and 3.2.11 reduce to

                                                                                              .

                                                                                      (3.2.18)

                                                                                           .

And the uncertainties reduce to

                                                                                                .

                                                                                 (3.2.19)

which makes life a little simpler. So all you need to do is to subtract the mean of x from your x-values before you do the regression and add it back in before you calculate your results. It pays to normalize! In the last lecture we normalized by dividing each x by the sum of all x, but subtracting the mean also works too. It all depends on units, if the x and the y are in the same units this trick will work, but if they are in different units it is better to standardize.

GoTo Lecture TOC

3.2.4 Type II Regressions (Two Dependent Variables):

The regression performed above presumes that you know x infinitely well. That is one way of defining the independent variable. What happens when both y and x have uncertainties. Do you regress y against x, or x against y? It turns out that neither is correct. You can prove this experimentally by taking the same data set and doing both. In a perfect world, the slope of y regressed against x should be the inverse of the slope of x regressed on y. When you try this for a synthetic data set that has some noise in it, however, you get a difference:

Figure 3.2.5 Type I regressions (assume one independent variable and one dependent variable) can give very different slope estimates depending on whether y is regressed on x or x is regressed on y. For many real world applications, both variables have error and a Type II regression should be used.

Then the process of minimizing the vertical distances between your data and the fit line is not correct, and you should be minimizing the perpendicular distance between your data points and the regression line. This sounds simple, and it is conceptually. But the math becomes a little tangled because the equations corresponding to those for a type I regression above are now non-linear and require an iterative solution. There is a simple approach, however, that works well. You do the following for a data set called x and y:

  1. Regress y on x and obtain a slope, which we'll call myx
  2. Regress x on y and obtain an inverse slope, call it mxy
  3. Calculate the new slope as the geometric mean of the combined fits:

                                                                               (3.2.20)

  1. From this new slope, calculate the new intercept from:

                                                                                (3.2.21)

There are other, more general techniques, which we can use (see for example Numerical Recipies). The reader is also directed toward a MATLAB m-file on our WWW site that was developed by Ed Peltzer at WHOI called lsqfitma.m, which uses one of these techniques. For the full meal deal (which deals with the nasty math), we also have a more advanced least squares routine known as the "least squares cubic". However, this routine requires that you have error estimates of both the x and y for each point; you can find this routine at lsqcubic.m.


GoTo Lecture TOC
GoTo 12.747 TOC

3.3 General Linear Least Squares Technique

3.3.1 Choose your model functions wisely

It is possible to set up the normal equations for any arbitrary set of functions (just as we did for the straight-line formula above). The difficulty is that the more complicated the functions, the more difficult the formulation, and the more the risk that the solution to the normal equations becomes ill-behaved. This happens when the functions chosen are not very orthogonal (that is they are more similar than not), or when the measurements actually made are not well posed to distinguish between the differing functions. The latter happens more than people realize. Finally, the choice of functions may not really do a good job of matching the actual observations. Regardless of the underlying reason, when this happens the solutions end up being a delicate balance between very large numbers. This may yield the minimum chi squared for the actual measurements, but outside of the range of the measurements, or sometimes in between data points (if there are gaps) the solution does rather strange things. A classic example (and often the worst offender) is a polynomial regression to an abruptly changing data set, which is otherwise very quiescent. Consider, for example a step function in data, which you try desperately to fit with increasing order of polynomial.

Figure 3.3.1 Discontinuties in data set such as the step function in the left panel can lead to all sorts of problems for fitting routinues. Also note that once you get beyond the range of data either to the left or the right, the polynomial fits go haywire. Don’t use polynomial fits for extrapolation!

The left hand figure above shows the data, which is 0 for x less than 0, and 1 for x greater than 0 (with a little noise added on). The right hand plot shows the efforts of various order polynomial fits ranging from order 3 (cubic) up to 13th order. Note that you have to go to very high order (the cyan curve is 13th order) to begin to approximate the step function's sharpness, but the price you pay is tight little oscillations where you have data (kind of an induced "ringing") plus wildly aberrant behavior outside of the data range. The latter may be particularly troublesome if you wish to try extrapolating your fits beyond your measurement range.
GoTo Lecture TOC

3.3.2 There Is an Easier Way: the Design Matrix Approach

The solution by means of the normal equations can be unstable, and it is rather tedious to have to build the analytical forms for them when the functions are more complicated. Now we'll show you an easier, numerical way. By the way, the term "Linear Least Squares" means not that the functions you are fitting are linear, but rather that the formulation is linear in the coefficients. Thus you could have a function to fit that looks like, for example

                                         (3.3.1)

which can be fit with linear least squares, as long as there are no unknown coefficients inside the parenthetically closed arguments to the non-linear functions. For example, the following is definitely a nonlinear regression prospect:

                                                                      (3.3.2)

The culprit is the a2 term, which appears as part of the argument of a non-linear function. We will deal with this kind of problem in section 3.4.

Now about this supposed easy way. Well, think of constructing a series of simultaneous equations with one equation for each measurement (we'll use the nasty little equation we gave as an example):

                                                   .

                                         (3.3.3)

                                                                                                        .

                                                   .

The above can be represented by a matrix called the design matrix, which would look like

                                                 (3.3.4)

It would be a N x 4 matrix (N measurements and 4 columns). One way to think about the problem is that the design matrix is just a set of basis functions that we are trying to fit to the data. Remember that all of the elements in this design matrix are simply numbers that you calculate from your x-data. Then you would have a column vector consisting of your 4 unknown coefficients (which you want to solve for):

                                                                                    (3.3.5)

And the column vector of your N observations:

                                                                                   (3.2.6)

The matrix version of these equations now reduces to Aa = y. Looks simple, doesn't it? But what about the weighting factors (the s’s)? Well, just like the straight line case, you just divide your x entries and your y entries by s2. We won't repeat the matrices listed above, but for example the ith row of the design matrix would look like

                                                (3.3.7)

and the ith element of the y-vector would also be divided by sigma-squared. Everything else proceeds as before.

GoTo Lecture TOC

3.3.3 Solving the Design Matrix Equation with SVD

Now that we've shown you how to build a design matrix, etc., we'll show you what to do. You cannot uniquely solve this as a set of simultaneous equations because it is overdetermined. That is, there are more equations than unknowns. Now since data is always imperfect, the data points will never agree on the true coefficients. The equations will always be inconsistent to some extent. This is equivalent to saying that not all the points will all lie on the regression line. But what you want to do is to minimize the square of the distance between the regression function and the observations. That is, you want to minimize the function |Aa –y|2. This is exactly what singular value decomposition does for you. So in MATLAB you enter the commands (after constructing the matrices, of course)

 
[U,S,V] = svd(A,0);        % SVD of design matrix
W = diag(1./diag(S));      % not checked for zero singular values
a = V*W*(U'*y);              % your coefficients!
Covmat = V*W.^2*V';     % compute covariance matrix
[N M] = size(A);   % size of design matrix (row X column)
redchisqr = sum((A*a-y).^2)/(N-M);  % compute reduced chi squared
sa = sqrt(redchisqr*diag(Covmat));   % uncertainties in coefficients

And that's your answer. Note that the primes mean matrix transposes in MATLAB, and in the second line I've skipped an important step of checking for zeroes in the singular value list before inverting. The second line is a bit tricky: it is equivalent to taking the diagonal of S, inverting the individual elements of the resulting vector, and then reconstructing a square matrix with those new elements on the diagonal. The calculations look obscure, but they are efficient and powerful. You can use this 6 lines of code as an engine to a general linear least squares regression program. All you need to do is build the design matrix and data vector for the specific linear model you want to fit.

GoTo Lecture TOC

3.3.4 Multidimensional Regressions

What if you want to fit your data to higher dimensions? For example, suppose you wanted to model the distribution of dissolved oxygen at some depth level in the North Atlantic. You might do this, for example, if you had observations in an area of the North Atlantic and you were interested in calculating the large scale gradient (the rate and direction of change with distance). Supposing further that you believed that the distribution was best fit with a biquadratic function, that is, a function which was quadratic in both x (longitude) and y (latitude). Then your data, z would be modeled after

                                               (3.3.8)

Be careful not to be confused here, because we are now using "y" as an independent variable, rather than an observation, and we have introduced "z" as your dependent variable (observation). Actually, this whole thing sounds more complicated than it is, but it is mathematically the same as the general linear least squares for which you already have the equations. You just calculate your design matrix just like before, as

                                        (3.3.9)

the right hand data column vector would be

                                                                                (3.3.10)

and your coefficient matrix is given by

                                                                                 (3.3.11)

which you then solve with the MATLAB code we showed you above. Not too hard. You might want to look at an m-file called surfit.m, which does an unweighted two dimensional fit to an arbitrary order polynomial, to see how you might extend it to weighted fits (i.e., with different si’s for each data point).
GoTo Lecture TOC

3.3.5 Transformably Linear Models

There is one obvious case where the model equation is non-linear in its coefficients, but you can transform your data to make the model linear. Consider the model equation

                                                                          (3.3.12)

If you take the logarithm of this equation, it reduces to

                                                                (3.3.13)

So all you have to do is to take the log of your results (yi's), do a linear fit, and transform the first coefficient by taking the exp of the intercept. You can thus avoid doing non-linear fits sometimes in this fashion.

3.3.6 Non-Coefficients

Also, beware the "non-coefficient" problem. That is, you can sometimes introduce two model functions that are identical in behavior. The net result is that the normal equations become exactly singular. The SVD approach above, however, will give you an answer (after telling you there is a problem by giving you a zero singular value). Consider the following equation:

                                                                  (3.3.14)

which looks reasonable, until you realize that a1 and exp(a2) are essentially the same basis functions (both constants multiplying the exp(a3x) term).


GoTo Lecture TOC
GoTo 12.747 TOC

3.4 Non-Linear Least Squares Techniques

3.4.1 Iterative Techniques (Like a Rolling Stone …)

What happens when the model equations are non-linear in the coefficients? You do the same thing: MINIMIZE CHI SQUARED (c 2). The difference is that you have to do it iteratively. There is no simple way of writing down and solving a set of linear analytic equations.

Most non-linear fitting routines systematically search the coefficient space for a c2 minimum by repeatedly calculating the c 2 for your model equation and data and then nudging the coefficients in differing directions until they reach a minimum (or as close to a minimum as you stipulate). The only difference between the techniques is the precise search mechanism, which can range from very crude to quite elaborate. The Bevington and Robinson book (chapter 8) describes three basic approaches: grid search, gradient search, and expansion methods. The grid search and expansion methods work well when you are near the c 2 minimum, but are not very efficient or effective at moving large distances in coefficient space. The gradient search moves large distances well, since it travels down the path of steepest descent, adjusting all the coefficients at once. But it tends to sometimes get trapped in long valleys and doesn't converge well near the global c 2 minimum.

The more sophisticated routines tend to use gradient search routines initially, which adjust the size of the nudge that they give to the coefficients to the size of the change in c 2. Near the minimum they may perform a grid search, and the last effort usually involves some kind of polynomial expansion (interpolation) of the c 2 surface. The price is that you need to supply them with the means not only to evaluate your model function for your data (xi's) but also the gradient of c 2. A popular method is the "Levenberg-Marquardt Method". We won't go into the details right now.

Think of the algorithm like a marble rolling around on a surface of hills and valleys (a little like the arcade game "marble madness"). The ball will continue to roll down hill until it reaches the lowest point (sometimes oscillating around the minimum as it goes, depending on the character of the algorithm used).

All of these routines require you to provide:

Keep one thing in mind. Nothing works better than an initial first guess. You might get a little sloppy and tend to let the computer do the work, but the c 2 surface in coefficient hyperspace may have more than one minimum, and your algorithm may get trapped in a local minimum, which is not nearly as good as a global minimum perhaps just over the next “hill” in hyperspace. As a corollary you should always plot your results: the eyeball is not often fooled. You may not be able to distinguish precise fits, but you can tell if the fit is far from optimal. Sometimes it is worthwhile to do a grid search of initial starting values, or to approach the optimum from extremely different directions (e.g., from positive, then negative values).

If your problem is particularly plagued by “false minima”, one alternative method to consider is simulated annealing (see Numerical Recipes for more discussion and example programs). In this and related techniques, the linear (“straight-down-the-hill”) approach is somewhat abandoned for a more probabilistic view. In essence, the search routine starts by hopping randomly around parameter-space computing c 2 and then iteratively hones in on the hoped for (but not rigourously provable) “global minimum”.  These searching techniques are practical for small to moderately sized problems, but be forewarned that the same caveats apply as for the grid and gradient searches.

For large data sets and/or complicated non-linear functions with many coefficients, the iterative techniques are computationally expensive. You may find that you have no choice. Keep in mind that a sensible choice of model equations, based on a realistic physical or chemical model of the system you are fitting will take you much further than any sophisticated mathematics or powerful computers. Common sense is a powerful ally.

GoTo Lecture TOC

3.4.2 Uncertainties in Coefficients

Computing the uncertainties in coefficients for non-linear regressions is a challenging problem. It involves examining the shape of the c 2 surface in coefficient hyperspace and estimating the boundaries of confidence intervals associated with “elevation” changes in c 2. Many sophisticated routines return the covariance matrix (for the coefficients) on convergence from which the associated parameter uncertainties are computed. In addition to the excellent discussion by Bevington and Robinson in chapter 8, the reader is referred to section 15.6 of Numerical Recipes for a discussion.

GoTo Lecture TOC

3.4.3 Example Calculation: Gaussian on a Constant Background

To give you an idea of how this might work, consider a data set which you suspect to be a Gaussian shaped peak embedded in a constant background, with some measurement noise. The measurements are represented by the dots below.

Figure 3.4.1 For some problems, such as the Gaussian peak in this data set, non-linear least-squares fitting methods are required.

The first thing you do is create a MATLAB m-file whose sole purpose is to compute your model function. You can call it anything you want, but here we will call it “modfunc.m”. It is a function that takes as its arguments the x data and vector containing the coefficients and returns the model fitted data. The m-file in this example contains a whole two lines:

 function out=modfunc(x,p)
 out = p(1) + p(2)*exp(-((x-p(3))/p(4)).^2);

Note the following about this m-file. First the function statement makes a function call rather than just a simple script. You need to do this if you want it to return any values to MATLAB. Second, note the "dot" before the "caret". This way you don't end up getting a matrix from squaring "x". This statement calculates all of the y's from the x's at once. The above is equivalent to the following mathematical equation:

                                                           (3.4.1)

Next, you download from our Web Site a non-linear least squares fitting routine called nlleasqr.m and a helper routine called dfdp.m, which you also need. Both files ultimately came from the MathWorks Web Site, but have been fixed up a little to remove some bugs. The first file is the main engine. The second is a program that calculates the c 2 gradient in coefficient space. It does this numerically rather than analytically. You can make your own analytic function (and call it some distinct name), which may do a better job, but it is not really necessary. Finally, there is the data x.dat and y.dat (how original!).

OK, now you have the m-files, you start up MATLAB and load in the x and y data. You then plot it up, since you need to have some idea of your initial starting guesses for parameters. You eyeball the plot and guess what the baseline value should be (this is a1) which we'll guess to be 1. You then guess the height of the Gaussian (a2) to be 8. The center of the Gaussian (a3) we'll guess to be at 5, and its width (a4) to be 2. (Note that from equation 3.4.1 a4=s). So we set our initial parameter guess to be

 load x.dat
 load y.dat
 pin=[1 8 5 2];

Note that the order is important and should be the same as you use in the m-file "modfunc.m". Next, you type "help nlleasqr" to find out how to feed it information. Note that most of the input parameters are optional, so don't be intimidated, and don't wear out your fingers. We'll just try

[f,p,kvg,iter,corp,covp,covr,stdresid,Z,r2] = nlleasqr(x,y,pin,'modfunc');

The stuff on the left is all the goodies you get back, but we'll only look at a few of them. Your “answer” or coefficient vector is stored in “p”. First, let's look at what kind of job this routine did for us. Let's plot it:

xf=[0:.1:10];  % dummy array to plot
yf=modfunc(xf,p); % compute fit for the dummy array
plot(x,y,'*',xf,yf) % plot both data and fit on same graph
 

Hmmm, not bad. About the numbers: "p" is the output coefficients, "kvg" is a flag to say if convergence was achieved before the routine gave up, "iter" is the number of iterations, "covp" is the covariance matrix for the coefficients (the square roots of the diagonal elements are the uncertainties in the parameters, and r2 is the overall correlation coefficient. For this example, we get

 
coefficients = [0.6595    9.9624    4.4780    0.9561]
uncertainties = [0.1220    0.3760    0.0249    0.0466]
r2 = 0.9160
 

Oh yes, the "real" values were [0.5 10.0 4.50 1.0]. The answer is roughly within errors of the real values, but will differ because of the noise that was added to the data. Your mileage may differ slightly when you do this experiment due to roundoff error.


GoTo the Next Lecture
GoTo Lecture TOC
GoTo 12.747 TOC


The text, graphics, and other materials contained in this homepage and attached documents are intended solely for scholarly use by the scientific and academic community. No reproduction, re-transmission or linking of this page to any other page without the author's expressed written permission is permitted.


1998, 2000, 2002 -- David M. Glover, William J. Jenkins, and Scott C. Doney, WHOI --