4.1 Statistical Foundations
The analysis of data, looking for underlying or hidden structure, we make use of a number of statistical concepts that you seen before. In fact, these same concepts arise again and again in the search for order and regularity in data that, at first blush, looks hopelessly random. Of all these concepts, perhaps none are as powerful as covariance and correlation.
You will recall that the variance of a data set was given by:
This is its computational form, perhaps you will remember it better in its definition form:
Although more familiar, this form does not bare as much resemblance to the equation used to compute the covariance as does the first. The covariance between two data sets (or more generally between data sets j and k) is given by:
Again if this computational form looks confusing, it may be easier to grasp in its definition form:
In eqn (4.3) the numerator is shown as
which is shorthand for Sum of Products, another
shorthand that is used a lot is
for sum of squares. This makes it easier to write down
sometimes messy equations like that for the correlation coefficient which normally would be
written as:
can be rewritten more compactly as:
Here
refers to the sum of squares of data set j. This nomenclature will show up again when
we discuss ANOVAs.
The statistical machinery we've talked about so far has been useful for comparing data sets to parent population (or theoretical) values and for comparing two data sets to each other. But how do we compare two or more data sets that contain groups of observations (such as replicates)? Suppose we had n replicates of m samples of the same thing. We could create a null hypothesis that there was no difference between the means of the replicates and the alternate hypothesis that at least one sample mean was different. How would we test this? We do this with a technique known as analysis of variance (ANOVA).
In order to apply this technique we need to calculate some sum of squares. We calculate the sum of squares among the samples, within the replicates, and for the total data set. We then arrange these results into a table that has the following entries.
| Source | Sum of Squares | Deg. of Freedom | Mean Squares | F-test |
| Among Samples | SSA | m-1 | ||
| Within Replicates | SSW | N-m | ||
| Total | SST | N-1 |
This is just like the F-test we performed in lecture two,
and
where N represent
the total of all measurements. This type of analysis will turn up many times, for example you will
be able to use it to test the statistical significance of adding an additional term to a polynomial fit
(in 1-, 2-, or n-D problems). The table above essentially examines the variance in a data set by
looking at the variance computed by going down the columns, but what if one's data set contained
m samples and n different treatments of that data (not replicates as above). To perform an
ANOVA on this kind of data set we use a Two-way ANOVA. In this case the ANOVA looks at
the variance between samples by going down the columns as above, but also at the variance
between treatments by going across the columns. The table created for a two-way ANOVA
appears as:
| Source of
Variation |
Sum of
Squares |
Degrees
of Freedom |
Mean
Squares |
F-Tests |
| Among
Samples |
||||
| Among
Treatments |
||||
| Error | ||||
| Total Variation |
Where the first F-test tests the significance of differences between samples and the second F-test
tests the significance between treatments. For more about a two-way ANOVA refer to Davis
Chapters 2, 5, and 6.
4.1.3 Standardization and Normalization
Quite often our data sets are a mixture of measurements made on different scales and/or in different units. We use normalization and standardization to get around these sorts of difficulties and minimize the influence of one component with very large magnitudes as opposed to other data components that are small in magnitude. By normalization we mean the transformation of the variable vectors into vectors of unit length, as shown in:
By standardization we are referring to the transformation that puts the variable vector into a vector of unit length, with a mean of zero, and a standard deviation of one. This is just the Z-score transformation seen in the second lecture:
Typically these transformations are done column-wise on the data matrix, in effect making all
measurements to have the same units - units of standard deviation. It's important that the data
columns be as normally distributed as possible, this makes the transformation symmetric. An
asymmetric transformation complicates the statistics and invalidates some of the underlying
assumptions we made when we began the analysis.
However, it is important to note one aspect of standardized data, the covariance matrix and the
correlation matrix are one and the same. If you don't believe me, try making up a data set and use
MatLab's CORRCOEF and COV on the unstandardized and standardized data respectively, I
think you'll see what I mean. This brings up an important point, while it is usually a good idea to
standardize your data sometimes it's unnecessary work and it may not be what you want to do.
For example, if all of your data has been measured in the same units, then it makes little
difference if you extract the eigenvectors from the unstandardized covariance matrix or from the
correlation coefficient matrix. Furthermore standardization can have a significant effect on the
variance-covariance matrix structure, which you are trying to extract information about. For
example, each variable influences this structure proportional its variance if you make all of the
variables have a standard deviation (i.e. variance) of one, then they all have equal influence. I
think you can imagine analyses where you may not want to diminish this influence.
4.1.4 Linear Independence and Complete Sets of Basis Functions
Quite often in mathematics we say that a problem space is composed of some, small, set of basis functions or vectors. This is best demonstrated by way of an example. Suppose you had the following data set, represented here as a matrix.
This data set could be represented as three column vectors as in the following diagram:
But now consider a different set of vectors.
Plotting these up in 3-D space reveals a different set of circumstances.
An important thing to keep in mind is that the basis vectors (functions) for any dimensional space is infinite, any set of linearly independent vectors can form a set of basis vectors.
GoTo Next Section