Sequence Analysis I: Uniform Series, Cross- and Auto-Correlation, and Fourier Transforms
File last modified 2 October 2000
6.2 Analysis in Time and Space
6.2.1 Auto-covariance and auto-correlation
An obvious question to ask when you are collecting a time series is "to what extent will the next measurement depend on the one that I have just collected". Or, for that matter, the one before. There is a simple statistical measure called the auto-covariance which is a measure of this dependence. The procedure is very easy. Consider a time series of length N (there are N equally spaced measurements), which we will call "y". Each individual element we will refer to as "yi". First, make a duplicate of the complete time series. You now build a new series which is the lagged product of the two. First, you multiply every element in the first with every corresponding element in the second, and sum them up. You divide the sum by its length N, and subtract from this the product of the means of the two series. This is given by
which you should immediately recognize as the variance. This we call the autocovariance at lag 0. This is the first element in your series. In MATLAB, this would look like
cov0 = mean(y.^2) - mean(y)^2;
Note the dot before the first exponentiation, you are squaring the elements of "y", not creating a matrix. We called it "cov0" because MATLAB doesn't recognize a "zeroeth" element in a vector or array.Now let's shift the second series to the left by one interval. Now since the series are of finite length, there will be one fewer elements which overlap in the series (picture one of the elements hanging out to the left in the second series, and one of the elements hanging out to the right in the first series). You now compute a covariance at lag 1, as
Notice that things are a little different now, for example, we divide by (N-1) because the series is shorter by 1. The summation is shorter too, since it has to start at i=2 now. Also, the means are different now, since they are averaged over different intervals. We've used a politically incorrect nomenclature (it changes from equation to equation) to designate these means
Again, pay careful attention to the summation limits. One average starts at j=1, and the other starts at j=2. The corresponding MATLAB statement would be
cov(1) = mean(y(2:N).*y(1:N-1)) - mean(y(2:N)).*mean(y(1:N-1));
Yuch! This is getting a little messy here. Remember the little dot before the multiplication sign between the y's (it's not necessary between scalars, but it doesn't hurt to put it in). This is the second element in your series. Now if we try 2 lags no, let's make that "n" lags, you have (sigh!):
where you have now only (N-n) elements to compare, and a new definition for the means to subtract
Yeah, we know we've been playing fast and loose with definitions here, sorry about that. We're trying to make things a little clearer. The corresponding MATLAB code is
cov(n)=mean(y(1+n:N).*y(1:N-n)) - mean(y(1+n:N))*mean(y(1:N-n));
Sheesh, this is getting even worse. Don't forget the little dot again. Now remember one thing, We could continue to increase "n" until it is as large as "N-1" (a lag by N would leave no overlap). But as the lag becomes larger, the degree of overlap becomes smaller (that's the "N-n" speaking to you). The general rule of thumb, largely driven by common sense, is to not let "n" get any bigger than about 1/5th of "N". A plot of the auto-covariance vs. lag time is called a auto-covariogram. We wonder why?
Well don't get out your text editors to make an m-file to do all that nasty stuff in MATLAB, because it's already built in. You just call one routine:
c = xcov(y,'unbiased');which returns vector of length 2N-1 containing all possible lags between -(N-1) and +(N-1), with the zero lag in the middle. That is, cov0 = c(N). Now common sense tells you not to believe those lags much more than N/5 from the center, but the makers of MATLAB decided to give you everything and let you choose. Moreover, the symmetry of the situation demands that the vector has a mirror symmetry about its center (a lag of 1 is the same as a lead of 1), so that c(N-1)=c(N+1), c(N-2)=c(N+2). Go figure.
Now what's with this 'unbiased' argument (note the single quotes around it)? Well, in their finite wisdom, the people at MathWorks offer you four flavors of the covariance. The default (with no argument), gives you the summation without dividing by (N-n) at each element (where n is the lag). They must have their reasons for this, somewhere. If you put 'biased' in, you get the summation divided by N at each argument. The term 'unbiased' gives you our equation, and is "unbiased" because it corrects for the number of overlapping elements. Finally, if you put 'coeff', it normalizes the sequence so the zero lag element is 1.0. These different options have their uses. What does the covariance tell you? Well, it speaks to you about the original question we posed at the beginning of this lecture. To what extent does a given measurement depend on its predecessors.
Below is the monthly sunspot index from 1749 to 1994 (you can download it if you like):
As you can see, (and as most people know) there is a regular fluctuation in the index with an about 11 year periodicity. Performing the auto-covariance (unbiased), and looking at the result as a function of lag, we see the following:
now we've shown it on two scales, so you can see the detail. Note that the largest value of the covariance is at zero lag. This makes sense, since the series will never perfectly match up except at zero lag. The number decreases with increasing lag (left panel). Note that there is a kind of "decorrelation time": the sunspot index you see this year is a good predictor of what you'll see next year. This tails off so that by the time you get to three years, the auto-covariance is zero, and there is no direct correspondence. This zero crossing is an important concept, since it characterizes the length of time that must elapse before the time series becomes uncorrelated, i.e., how long it takes to forget what you were doing. This sequence is strongly periodic, so that the auto-covariance becomes significantly non-zero at longer lags. Most sequences do not have such strong periodicity, and the auto-covariance drops to zero and stays there.
This first "zero" is known the decorrelation time. If we were looking at a space series (e.g., a hydrographic section of regularly spaces stations), the first "zero" of the auto-correlation is a measure of the spatial decorrelation length. This would be a measure of the eddy mixing scale: stations closer than this distance will tend to be correlated, and hence not independent of one another. If you were testing some statistical theory about property distributions, this implies that the number of degrees of freedom may be less than the number of stations (observations) if their spacing is less than this scale. Computation of goodness of fit parameters must take into account the degrees of freedom.
But look at about 5.5 years! There is a negative correspondence. If the sunspot index is high this year, it will likely be low about 5 years from now, and vice versa. Notice that as you approach the 11 year lag, the covariance rises to another maximum. Because there are longer term variations in the index, the match cannot be perfect, so the covariance never reaches the zero lag high, but it does go through a maximum. This is the indicator of periodicity. Note that as you increase the lag, you periodically swing progressively through minima and maxima associated with times when the series matches up. This certainly looks like a legitimate signal, and not noise.
What would noise look like? We're glad you asked! Below is a synthetic data set of normally distributed noise, and its auto-covariance:
Very distinctive, isn't it? The covariance drops to zero immediately, since you cannot predict what the next value would be with true noise. In fact this is regarded as a very good test of your ability to generate random numbers. Generating truly random numbers if you are a computer is a non-trivial process, unless you're broken.
Now we can also look at the auto-correlation, which is simply defined as a normalized form of the auto-covariance:
Sensibly, this value starts at 1 for zero lag, and decreases thereafter, always remaining between -1 and +1. Have a look at Davis, p222 for examples of various "idealized" autocorrelation plots.
The question remains, when you look at autocorrelograms how significant are the numbers? For example, instinct tells you that the autocorrelation of a random series should go to zero very quickly (like the example above), so the null hypothesis is that for a given autocorrelation at a specific lag, what is the probability that it has been generated by a random series (i.e., that it is indistinguishable from zero)? Well, the expected variance for any random sequence of length "N" at some lag "n" is
Be warned that there is an error in Davis. Equations 4.63 and 4.65 on page 223 are in error: the sign in front of the 3 in those equations should be minus. The 3 comes from a correction for the degrees of freedom used in computing the means of each of the two series used (even if they are the same) and in the choice of the starting point of the autocorrelation. Now just like in any statistical distribution, the ratio of the observed deviation from the mean correlation (which is zero for a random sequence) to that variance, is a measure of the probability of its occurrence. So we take the ratio of observed correlation at lag n to this number and after a little simplification, we get
So if Z is 2, then there is a probability of 0.05 that the null hypothesis is correct (that the series is random at that lag and that the autocorrelation coefficient is not significantly different from zero). Put another way (although not considered politically correct), if
then it is likely (> 95%) non-random. This dependence, by the way, underlines the general importance of small relative lags. As the size of the lag increases, the uncertainty in the autocorrelation coefficient increases. This trend doesn't show for the time series we have examined so far, because they are so long, but common sense tells you that you ought to avoid doing this kind of analysis for time series shorter than 50-100 observations. Below is the autocorrelation plots for both the sunspot series and the random noise, along with the 95% confidence intervals (Zn=2).
The sunspot series correlogram is clearly significant all the way out to 50 year lags.
6.2.2 Cross-covariance and cross-correlation
Well, what you can do for one series, you can do for two. The cross-covariance of two time series is a tool for discerning the relationship between two phenomena. For example, you might be interested in the influence of sunspot index on the weather. There was a long period of anomalously cold weather, called the "Little Ice Age" several hundred years ago, which appeared to be associated with a minimum in the sunspot index (the "Maunder Minimum"). This correspondence shows up in a number of other records, for example radiocarbon in tree rings and corals.
Of interest is not only the degree of correspondence between two records, but possible time lags between them. For example, there are records of atmospheric CO2 concentrations in ice core bubbles. There is a clear indication that atmospheric CO2 levels were lower during glacial maxima. Were these lower concentrations a result of the glacial events, or did they cause them? If the CO2 record leads the glacial advance/retreat, then this would imply, at least circumstantially, that there is a causal relation where the impact of CO2 on the radiative balance of the earth drives the ice advance and retreat.
Further, the degree of the lag may be of interest. For example, the time lag between the CO2 record and other glacial indicators indicates the response times of the various systems. A more direct example would be time lag between sea surface temperature and a temperature proxy indicator (del-O18) in foram tests in a sediment trap record. The lag would represent the settling time of particles, and hence the settling velocity.
The cross-correlation of two time series (y1 and y2) is given by
which is just the same as the cross-covariance divided by the product of the variances of the individual series. This should look familiar to those who have done linear correlation. Here "m" is the number of overlapping pairs in the correlation calculation. Here the zero-lag correlation coefficient "r0" will in general not be 1, since the series will never perfectly match.
Like the autocorrelation function, we can test the significance of the correlation coefficient (remember the null hypothesis is that the series are uncorrelated) with the t-test:
thus as rm approaches 1, t approaches infinity (i.e.,
the null hypothesis becomes very improbable), and as rm
approaches 0, t approaches 0.
6.2.3 Convolution and implications for signal theory
The process of cross-covariance is closely related to a mathematical operation called convolution. This is mathematically represented for two functions "x" and "y" by
where the right hand equation is the continuum equivalent representation (an integral, after all is just the continuum equivalent of a summation). In that equation, the * represents convolution and should not be confused with the MATLAB multiplication operator. It has a specific connotation in mathematics. The integral is done over which is the time lag. If we were to "normalize" our data by subtracting the means and dividing by the variances, i.e. make them "Z-scores", then the convolution of the two series becomes the cross correlation coefficient.
This has an interesting application to signal processing. If you are measuring some kind of uncontaminated process (we'll call it "u") with a device, and the device has some kind of response function (which we'll call "r", which may, for example, smear your signal out because the device has some finite response time), then the output (signal "s") of your device will be the convolution of the response function with the signal, that is
What would the "ideal" instrument response function look like? Well, you'd want something that would put out exactly what came in, independent of previous or following measurements. That is, you'd want
In the finite sum case, this kind of function would simply be a series which is 1 at zero lag, and zero everywhere else. In the continuum case, this corresponds to a "delta-function" which is zero when it is not at exactly lag zero, and has an area of 1. This , you say is impossible, because if it is infinitely narrow, it must be infinitely tall to have unit area (area = height X width). Well, it can be done. Don't worry about it. This idealized function is also called a "Green's Function", which simply means that when convolved with a function, it yields the same function back.
In real life, however, there is also the additional problem of noise (random interference with your signal). We call this "n", and say that thing that finally makes it to your chart recorder or computer ADC would be
Believe it or not, there are ways of filtering your data to extract the original, desired signal. We'll get back to this later, but in general, filtering your data involves a similar process, whereby you design some kind of function (your filter) "f", where you would alter the original data with
where f(t) is your filter, and p(t) is the final, and hopefully better product. You might do this to eliminate noise, compensate for imperfect instrumentation, or to enhance some specific aspect of your data for further analysis. MATLAB has a whole host of functions in its Signal Processing Toolbox that address these issues.
GoTo Next Section
GoTo Lecture TOC
GoTo 12.747 TOC