12.747 Lecture 7: Section 5:

Sequence Analysis II: Optimal Filtering and Spectral Analysis

File last modified 9 October 1996


7.5 Non-Uniform Time Series

Everything we've done so far requires evenly spaced data. There is no fundamental reason why this must be so, it is just one of mathematical convenience. The Fourier integrals presented in section 6.3 of the last lecture may be taken over any arbitrary distribution of points in time, provided that the appropriate numerical integration scheme is applied, and the attention is paid to the real limits to accuracy of such an integration (especially related to the nyquist limits). The latter becomes a serious, non-trivial issue, because the nyquist limit becomes a rather diffuse issue when points are unevenly spaced in time. Do you take the total time span of measurements divided by the number of points (i.e., the average sample interval) as the measure of ? Or do you take the closest (or furthest) spacing? We'll dwell on that a little later, but first we need to get down to pragmatic issues: how do we do the calculation if spacing is uneven? FFT seems definitely out, so what do we do now?

Some kludge methods have been used. First, people may interpolate their data onto a regular grid. This is not too bad if the data are more-or-less evenly spaced, but need to be tweaked into submission. Difficulties arise because you may be destroying information by averaging over data points. For example if some data points are very close together, you have some kind of high frequency information (the pair have a much higher nyquist frequency than the average of the series), and you throw that information out by gridding onto a courser grid. Worse yet, you may create information by interpolating to a finer grid in some areas where your data is sparse. Depending on your interpolation scheme (e.g., splines) you may introduce all sorts of bizarre frequencies.

Another involves the fact that you have a regular series, but some of the data are missing. This so-called missing data problem is quite common, and the way to deal with it ranges from simply putting zeros in where the data is missing, to clamping the missing points to the actually measured value, to interpolating the segment. This may be forgivable if you are missing a very small percentage of your data, and that the gaps are 1 or perhaps 2 data points in length. The problem arises when the gaps become large or a significant portion of the record. Any patching technique introduces spurious frequencies, generally at the low frequency end.

The answer is to use Lomb's method, which involves least squares fitting the time domain data points to sine/cosine series.

7.5.1 Lomb's Method

Lomb (Astr. Space Sci., 39 (1976) pp447-462) devised a scheme of performing a general linear least squares regression of unevenly spaced data to a sine/cosine series of different frequencies. This is similar to the approach we used to solve problem 4 in Problem Set 2. What you do is decide what frequencies that you wish to fit (usually a range of evenly spaced frequency values) and build the normal equations and solve them. That turns out to be a little tedious, so Lomb presented a simplification that permits you to solve for each coefficient independently. You first define the mean and variance, using the usual formulae:

Then you generate the Lomb normalized periodogram using, for the ith angular frequency ()

where the sums over j are from 1 to N (the series length) and the constant is defined for a given angular frequency by

This seems a rather tedious formulation, but is easily coded up in MATLAB, which we give you as lomb.m for your computing pleasure. Now because of this normalization, it turns out that the null hypothesis (i.e. that a given PN is due to a random sequence) is given by

where z is the observed value, and M is the number of independent frequencies determined by the series. This number "M" is difficult to pin down, but for an evenly spaced time series with unbinned statistics, the number of independent frequencies is N, the number of data points in the sequence. For unevenly spaced data, the answer is not so simple, but if the data is not particularly clumped, then is not a bad approximation. If the data are clumped in groups of "n" (e.g. groups of two), then the number of independent frequencies is approximately N/n. Knowing M very precisely is not really that important, for if you are thinking in terms of confidence intervals in the range of 0.1 to 5% (for the null hypothesis) then estimates which differ in probability by a factor of two are not really of concern: would you feel significantly better about a 0.1% (99.9% confidence interval) than a 0.2% (99.8% confidence interval)? Not!

Now the Lomb method has some nice features (aside from working!). First, it weights the spectral fit appropriately by the actual data points. This is the natural character of least squares regressions. Second, it takes advantage of the fact that some of your data is clustered more tightly than half the average nyquist interval (, where is the average spacing obtained by dividing the total time span of the measurements by the number of points in the series. That is, if some of the points are separated by as little as , some small fraction of the average spacing, then you have made a determination of frequencies at this much higher rate i.e. .

Below is a series of plots showing the data and the Lomb Periodogram for integral frequencies from 1 to 200 Hz. The file that created this is Lombtest.m which creates a random series of 100 points between t=0 and 1, and then computes some artificial data with noise with two dominant frequencies. Note that both frequencies are picked out, and that the lower of the two frequencies is exactly at the average nyquist frequency, and the higher of the two frequencies (130 Hz) is above the average nyquist frequency. A uniform time series of 100 points would have a sampling interval of .01 seconds, and hence a nyquist frequency of 50 Hz. Because the data are scattered, there are enough points close enough together to pick out the higher frequencies. You can see this in the upper right plot of the group, which is a histogram of the log10(separation) of the points in the series. Although the bulk of the points are separated by a distance of .01 (log10 = -2) there are a number of points with finer spacing. It is these that give you the high frequency distribution. Furthermore, there is no frequency folding with this technique.

The test algorithm also returns the frequencies, values and probabilities of the most probable peaks:

Frequency Pn Prob ---------------------------- 130.0000 34.5164 0.0000 50.0000 12.8013 0.0006 3.0000 4.1033 0.9643 93.0000 4.0804 0.9669 63.0000 3.8605 0.9858

Remember that Prob is the probability of the Null Hypothesis (that the peak is random) being true.

The above points to an important tactic in experiment design. Although it is perhaps simpler to design your sampling around a regular grid, you learn more by having an irregular sampling strategy.


GoTo Next Lecture
GoTo Lecture TOC
GoTo 12.747 TOC


The text, graphics, and other materials contained in this webpage and attached documents are intended solely for scholarly use by the scientific and academic community. No reproduction, re-transmission or linking of this page to any other page without the author's expressed written permission is permitted.
© 1998, 2000 -- David M. Glover, WHOI --