Sequence Analysis II: Optimal Filtering and Spectral Analysis
File last modified 9 October 1996
7.5 Non-Uniform Time Series
Everything we've done so far requires evenly spaced data. There
is no fundamental reason why this must be so, it is just one of
mathematical convenience. The Fourier integrals presented in
section 6.3 of the
last lecture may be taken over any arbitrary distribution of points
in time, provided that the appropriate numerical integration scheme
is applied, and the attention is paid to the real limits to accuracy
of such an integration (especially related to the nyquist limits).
The latter becomes a serious, non-trivial issue, because the nyquist
limit becomes a rather diffuse issue when points are unevenly
spaced in time. Do you take the total time span of measurements
divided by the number of points (i.e., the average sample
interval) as the measure of ? Or do you
take the closest (or furthest) spacing? We'll dwell on that a
little later, but first we need to get down to pragmatic issues:
how do we do the calculation if spacing is uneven? FFT seems definitely
out, so what do we do now?
Some kludge methods have been used. First, people may interpolate their data onto a regular grid. This is not too bad if the data are more-or-less evenly spaced, but need to be tweaked into submission. Difficulties arise because you may be destroying information by averaging over data points. For example if some data points are very close together, you have some kind of high frequency information (the pair have a much higher nyquist frequency than the average of the series), and you throw that information out by gridding onto a courser grid. Worse yet, you may create information by interpolating to a finer grid in some areas where your data is sparse. Depending on your interpolation scheme (e.g., splines) you may introduce all sorts of bizarre frequencies.
Another involves the fact that you have a regular series, but some of the data are missing. This so-called missing data problem is quite common, and the way to deal with it ranges from simply putting zeros in where the data is missing, to clamping the missing points to the actually measured value, to interpolating the segment. This may be forgivable if you are missing a very small percentage of your data, and that the gaps are 1 or perhaps 2 data points in length. The problem arises when the gaps become large or a significant portion of the record. Any patching technique introduces spurious frequencies, generally at the low frequency end.
The answer is to use Lomb's method, which involves least squares fitting the time domain data points to sine/cosine series.
Lomb (Astr. Space Sci., 39 (1976) pp447-462) devised a scheme of performing a general linear least squares regression of unevenly spaced data to a sine/cosine series of different frequencies. This is similar to the approach we used to solve problem 4 in Problem Set 2. What you do is decide what frequencies that you wish to fit (usually a range of evenly spaced frequency values) and build the normal equations and solve them. That turns out to be a little tedious, so Lomb presented a simplification that permits you to solve for each coefficient independently. You first define the mean and variance, using the usual formulae:
Then you generate the Lomb normalized periodogram using, for the
ith angular frequency ()
where the sums over j are from 1 to N (the series length) and
the constant is defined for a given angular
frequency by
This seems a rather tedious formulation, but is easily coded up in MATLAB, which we give you as lomb.m for your computing pleasure. Now because of this normalization, it turns out that the null hypothesis (i.e. that a given PN is due to a random sequence) is given by
where z is the observed value, and M is the number of independent
frequencies determined by the series. This number "M"
is difficult to pin down, but for an evenly spaced time series
with unbinned statistics, the number of independent frequencies
is N, the number of data points in the sequence. For unevenly
spaced data, the answer is not so simple, but if the data is not
particularly clumped, then is not a bad
approximation. If the data are clumped in groups of "n"
(e.g. groups of two), then the number of independent frequencies
is approximately N/n. Knowing M very precisely is not really that
important, for if you are thinking in terms of confidence intervals
in the range of 0.1 to 5% (for the null hypothesis) then estimates
which differ in probability by a factor of two are not
really of concern: would you feel significantly better about a
0.1% (99.9% confidence interval) than a 0.2% (99.8% confidence
interval)? Not!
Now the Lomb method has some nice features (aside from working!).
First, it weights the spectral fit appropriately by the actual
data points. This is the natural character of least squares regressions.
Second, it takes advantage of the fact that some of your data
is clustered more tightly than half the average nyquist interval
(, where
is the
average spacing obtained by dividing the total time span of the
measurements by the number of points in the series. That is, if
some of the points are separated by as little as
,
some small fraction of the average spacing, then you have made
a determination of frequencies at this much higher rate i.e.
.
Below is a series of plots showing the data and the Lomb Periodogram for integral frequencies from 1 to 200 Hz. The file that created this is Lombtest.m which creates a random series of 100 points between t=0 and 1, and then computes some artificial data with noise with two dominant frequencies. Note that both frequencies are picked out, and that the lower of the two frequencies is exactly at the average nyquist frequency, and the higher of the two frequencies (130 Hz) is above the average nyquist frequency. A uniform time series of 100 points would have a sampling interval of .01 seconds, and hence a nyquist frequency of 50 Hz. Because the data are scattered, there are enough points close enough together to pick out the higher frequencies. You can see this in the upper right plot of the group, which is a histogram of the log10(separation) of the points in the series. Although the bulk of the points are separated by a distance of .01 (log10 = -2) there are a number of points with finer spacing. It is these that give you the high frequency distribution. Furthermore, there is no frequency folding with this technique.

The test algorithm also returns the frequencies, values and probabilities of the most probable peaks:
Frequency Pn Prob ---------------------------- 130.0000 34.5164 0.0000 50.0000 12.8013 0.0006 3.0000 4.1033 0.9643 93.0000 4.0804 0.9669 63.0000 3.8605 0.9858
Remember that Prob is the probability of the Null Hypothesis (that the peak is random) being true.
The above points to an important tactic in experiment design. Although it is perhaps simpler to design your sampling around a regular grid, you learn more by having an irregular sampling strategy.
GoTo Next Lecture
GoTo Lecture TOC
GoTo 12.747 TOC