**Sequence Analysis I: Uniform Series, Cross- and
Auto-Correlation, and Fourier Transforms**

*File last modified 2 October 2000*

**6.2 Analysis in Time and Space**

**6.2.1 Auto-covariance and auto-correlation**

An obvious question to ask when you are collecting a time series
is "to what extent will the next measurement depend on the
one that I have just collected". Or, for that matter, the
one before. There is a simple statistical measure called the **auto-covariance**
which is a measure of this dependence. The procedure is very easy.
Consider a time series of length N (there are N equally spaced
measurements), which we will call "y". Each individual
element we will refer to as "y_{i}". First,
make a duplicate of the complete time series. You now build a
new series which is the lagged product of the two. First, you
multiply every element in the first with every corresponding element
in the second, and sum them up. You divide the sum by its length
N, and subtract from this the product of the means of the two
series. This is given by

which you should immediately recognize as the variance. This we call the autocovariance at lag 0. This is the first element in your series. In MATLAB, this would look like

cov0 = mean(y.^2) - mean(y)^2;

Note the dot before the first exponentiation, you are squaring
the elements of "y", not creating a matrix. We called
it "cov0" because MATLAB doesn't recognize a "zeroeth"
element in a vector or array.Now let's shift the second series
to the left by one interval. Now since the series are of finite
length, there will be *one fewer elements* which overlap
in the series (picture one of the elements hanging out to the
left in the second series, and one of the elements hanging out
to the right in the first series). You now compute a covariance
at lag 1, as

Notice that things are a little different now, for example, we divide by (N-1) because the series is shorter by 1. The summation is shorter too, since it has to start at i=2 now. Also, the means are different now, since they are averaged over different intervals. We've used a politically incorrect nomenclature (it changes from equation to equation) to designate these means

Again, pay careful attention to the summation limits. One average starts at j=1, and the other starts at j=2. The corresponding MATLAB statement would be

cov(1) = mean(y(2:N).*y(1:N-1)) - mean(y(2:N)).*mean(y(1:N-1));

Yuch! This is getting a little messy here. Remember the little dot before the multiplication sign between the y's (it's not necessary between scalars, but it doesn't hurt to put it in). This is the second element in your series. Now if we try 2 lags… no, let's make that "n" lags, you have (sigh!):

where you have now only (N-n) elements to compare, and a new definition for the means to subtract

Yeah, we know we've been playing fast and loose with definitions here, sorry about that. We're trying to make things a little clearer. The corresponding MATLAB code is

cov(n)=mean(y(1+n:N).*y(1:N-n)) - mean(y(1+n:N))*mean(y(1:N-n));

Sheesh, this is getting even worse. Don't forget the little dot
again. Now remember one thing, We *could* continue to increase
"n" until it is as large as "N-1" (a lag by
N would leave *no* overlap). But as the lag becomes larger,
the degree of overlap becomes smaller (that's the "N-n"
speaking to you). The **general rule of thumb**, largely driven
by common sense, is to not let "n" get any bigger than about 1/5^{th}
of "N". A plot of the auto-covariance *vs.* lag time is called
a **auto-covariogram**. We wonder why?

Well don't get out your text editors to make an m-file to do all that nasty stuff in MATLAB, because it's already built in. You just call one routine:

c = xcov(y,'unbiased');which returns vector of length 2N-1 containing all possible lags between -(N-1) and +(N-1), with the zero lag in the middle. That is, cov0 = c(N). Now common sense tells you not to believe those lags much more than N/5 from the center, but the makers of MATLAB decided to give you everything and let you choose. Moreover, the symmetry of the situation demands that the vector has a mirror symmetry about its center (a lag of 1 is the same as a lead of 1), so that c(N-1)=c(N+1), c(N-2)=c(N+2). Go figure.

Now what's with this 'unbiased' argument (note the single quotes around it)? Well, in their finite wisdom, the people at MathWorks offer you four flavors of the covariance. The default (with no argument), gives you the summation without dividing by (N-n) at each element (where n is the lag). They must have their reasons for this, somewhere. If you put 'biased' in, you get the summation divided by N at each argument. The term 'unbiased' gives you our equation, and is "unbiased" because it corrects for the number of overlapping elements. Finally, if you put 'coeff', it normalizes the sequence so the zero lag element is 1.0. These different options have their uses. What does the covariance tell you? Well, it speaks to you about the original question we posed at the beginning of this lecture. To what extent does a given measurement depend on its predecessors.

Below is the monthly sunspot index from 1749 to 1994 (you can download it if you like):

As you can see, (and as most people know) there is a regular fluctuation in the index with an about 11 year periodicity. Performing the auto-covariance (unbiased), and looking at the result as a function of lag, we see the following:

now we've shown it on two scales, so you can see the detail. Note
that the largest value of the covariance is at zero lag. This
makes sense, since the series will never perfectly match up except
at zero lag. The number decreases with increasing lag (left panel).
Note that there is a kind of "decorrelation time": the
sunspot index you see this year is a good predictor of what you'll
see next year. This tails off so that by the time you get to three
years, the auto-covariance is zero, and there is no direct correspondence.
This zero crossing is an important concept, since it characterizes
the length of time that must elapse before the time series becomes
uncorrelated, *i.e.*, how long it takes to forget what you
were doing. This sequence is strongly periodic, so that the auto-covariance
becomes significantly non-zero at longer lags.
Most sequences do not have such strong periodicity, and the
auto-covariance drops to zero and stays there.

This first "zero" is known the decorrelation time. If we were looking
at a space series (*e.g.,* a hydrographic section of regularly
spaces stations), the first "zero" of the auto-correlation
is a measure of the spatial decorrelation length. This would be
a measure of the eddy mixing scale: stations closer than this
distance will tend to be correlated, and hence not independent
of one another. If you were testing some statistical theory about
property distributions, this implies that *the number of degrees
of freedom may be less than the number of stations (observations)
if their spacing is less than this scale.* Computation of goodness
of fit parameters must take into account the degrees of freedom.

But look at about 5.5 years! There is a negative correspondence.
If the sunspot index is high this year, it will likely be low
about 5 years from now, and *vice versa.* Notice that as
you approach the 11 year lag, the covariance rises to another
maximum. Because there are longer term variations in the index,
the match cannot be perfect, so the covariance never reaches the
zero lag high, but it does go through a maximum. This is the indicator
of periodicity. Note that as you increase the lag, you periodically
swing progressively through minima and maxima associated with
times when the series matches up. This certainly looks like a
legitimate signal, and not noise.

What would noise look like? We're glad you asked! Below is a synthetic data set of normally distributed noise, and its auto-covariance:

Very distinctive, isn't it? The covariance drops to zero immediately, since you cannot predict what the next value would be with true noise. In fact this is regarded as a very good test of your ability to generate random numbers. Generating truly random numbers if you are a computer is a non-trivial process, unless you're broken.

Now we can also look at the **auto-correlation**, which is
simply defined as a normalized form of the auto-covariance:

Sensibly, this value starts at 1 for zero lag, and decreases thereafter, always remaining between -1 and +1. Have a look at Davis, p222 for examples of various "idealized" autocorrelation plots.

The question remains, when you look at **autocorrelograms **how
significant are the numbers? For example, instinct tells you that
the autocorrelation of a random series should go to zero very
quickly (like the example above),
so the **null hypothesis** is that for a given autocorrelation
at a specific lag, what is the probability that it has been generated
by a random series (*i.e.,* that it is indistinguishable
from zero)? Well, the expected variance for any random sequence
of length "N" at some lag "n" is

* Be warned that there is an error in Davis. Equations 4.63
and 4.65 on page 223 are in error: the sign in front of the 3
in those equations should be minus. *The 3 comes from a
correction for the degrees of freedom used in computing the means
of each of the two series used (even if they are the same) and
in the choice of the starting point of the autocorrelation. Now
just like in any statistical distribution, the ratio of the observed
deviation from the mean correlation (which is zero for a random
sequence) to that variance, is a measure of the probability of
its occurrence. So we take the ratio of observed correlation at
lag n to this number and after a little simplification, we get

So if Z is 2, then there is a probability of 0.05 that the null hypothesis is correct (that the series is random at that lag and that the autocorrelation coefficient is not significantly different from zero). Put another way (although not considered politically correct), if

then it is likely (> 95%) non-random. This dependence, by the
way, underlines the general importance of small relative lags.
As the size of the lag increases, the uncertainty in the autocorrelation
coefficient increases. This trend doesn't show for the time series
we have examined so far, because they are so long, but common
sense tells you that you ought to avoid doing this kind of analysis
for time series shorter than 50-100 observations. Below is the autocorrelation
plots for both the sunspot series and the random noise, along
with the 95% confidence intervals (Z_{n}=2).

The sunspot series correlogram is clearly significant all the way out to 50 year lags.

**6.2.2 Cross-covariance and cross-correlation
**

Well, what you can do for one series, you can do for two. The
**cross-covariance **of two time series is a tool for
discerning the relationship between two phenomena. For example,
you might be interested in the influence of sunspot index on the
weather. There was a long period of anomalously cold weather,
called the "Little Ice Age" several hundred years ago,
which appeared to be associated with a minimum in the sunspot
index (the "Maunder Minimum"). This correspondence shows
up in a number of other records, for example radiocarbon in tree
rings and corals.

Of interest is not only the degree of correspondence
between two records, but possible *time lags* between them.
For example, there are records of atmospheric CO_{2} concentrations
in ice core bubbles. There is a clear indication that atmospheric
CO_{2} levels were *lower* during glacial maxima.
Were these lower concentrations a result of the glacial events,
or did they cause them? If the CO_{2} record *leads*
the glacial advance/retreat, then this would imply, at least circumstantially,
that there is a causal relation where the impact of CO_{2}
on the radiative balance of the earth drives the ice advance and
retreat.

Further, the degree of the lag may be of interest. For example,
the time lag between the CO_{2} record and other glacial
indicators indicates the response times of the various systems.
A more direct example would be time lag between sea surface temperature
and a temperature proxy indicator (del-O18) in foram tests in
a sediment trap record. The lag would represent the settling time
of particles, and hence the settling velocity.

The cross-correlation of two time series (y_{1} and y_{2})
is given by

which is just the same as the cross-covariance divided by the
product of the variances of the individual series. This should
look familiar to those who have done linear correlation. Here
"m" is the number of overlapping pairs in the correlation
calculation. Here the zero-lag correlation coefficient "r_{0}"
will in general not be 1, since the series will never perfectly match.

Like the autocorrelation function, we can test the significance of the correlation coefficient (remember the null hypothesis is that the series are uncorrelated) with the t-test:

thus as r_{m} approaches 1, t approaches infinity (*i.e.,*
the null hypothesis becomes very improbable), and as r_{m}
approaches 0, t approaches 0.

**6.2.3 Convolution and implications for
signal theory **

The process of cross-covariance is closely related to a mathematical
operation called **convolution**. This is mathematically represented
for two functions "x" and "y" by

where the right hand equation is the continuum equivalent representation
(an integral, after all is just the continuum equivalent of a
summation). In that equation, the * represents **convolution**
and should not be confused with the MATLAB multiplication operator.
It has a specific connotation in mathematics. The integral is
done over which is the time lag. If we
were to "normalize" our data by subtracting the means
and dividing by the variances, *i.e.* make them "Z-scores",
then the convolution of the two series becomes the cross correlation
coefficient.

This has an interesting application to signal processing. If you are measuring some kind of uncontaminated process (we'll call it "u") with a device, and the device has some kind of response function (which we'll call "r", which may, for example, smear your signal out because the device has some finite response time), then the output (signal "s") of your device will be the convolution of the response function with the signal, that is

What would the "ideal" instrument response function look like? Well, you'd want something that would put out exactly what came in, independent of previous or following measurements. That is, you'd want

In the finite sum case, this kind of function would simply be a series which is 1 at zero lag, and zero everywhere else. In the continuum case, this corresponds to a "delta-function" which is zero when it is not at exactly lag zero, and has an area of 1. This , you say is impossible, because if it is infinitely narrow, it must be infinitely tall to have unit area (area = height X width). Well, it can be done. Don't worry about it. This idealized function is also called a "Green's Function", which simply means that when convolved with a function, it yields the same function back.

In real life, however, there is also the additional problem of noise (random interference with your signal). We call this "n", and say that thing that finally makes it to your chart recorder or computer ADC would be

Believe it or not, there are ways of filtering your data to extract
the original, desired signal. We'll get back to this later, but
in general, **filtering **your data involves a similar
process, whereby you design some kind of function (your filter)
"f", where you would alter the original data with

where f(t) is your filter, and p(t) is the final, and hopefully
better product. You might do this to eliminate noise, compensate
for imperfect instrumentation, or to enhance some specific aspect
of your data for further analysis. MATLAB has a whole host of
functions in its *Signal Processing Toolbox *that address
these issues.

GoTo Next Section

GoTo Lecture TOC

GoTo 12.747 TOC

The text, graphics, and other materials contained in this webpage and attached documents are intended solely for scholarly use by the scientific and academic community. No reproduction, re-transmission or linking of this page to any other page without the author's expressed written permission is permitted.

© 1998, 2000 -- David M. Glover, WHOI --