439/639: Vector Time Series and Spurious Correlation

Author

Dr Sergey Kushnarev

1 Cross-covariance and Cross-correlation Functions

Cross-covariance and cross-correlation functions are used to analyze the relationship between two time series. The cross-covariance function measures the covariance between two time series at different lags, while the cross-correlation function measures the correlation between two time series at different lags. \[ \gamma_m(x,y) = Cov(X_t, Y_{t+m}) \]

Cross-correlation function is defined as: \[ r_m(x,y) = \frac{\gamma_m(x,y)}{\sqrt{\gamma_0(x,x)\gamma_0(y,y)}} \]

where \(\gamma_0(x,x)\) and \(\gamma_0(y,y)\) are the variances of the time series \(x\) and \(y\), respectively. The cross-correlation function is a normalized version of the cross-covariance function, which allows for easier interpretation and comparison between different time series.

Often cross-correlation function is displayed in a grid of plots, where each plot shows the cross-correlation function between two time series at different lags. This allows for a visual representation of the relationship between the two time series and can help identify any patterns or trends in the data.

Code
# Load required packages
library(TSA)

set.seed(639)
X=rnorm(105) 
Y=zlag(X,2)+0.5*rnorm(105)
X=ts(X[-(1:5)],start=1,freq=1)
Y=ts(Y[-(1:5)],start=1,freq=1)

# Make X, Y a multivariate time series
library(astsa)
X=ts(X,start=1,freq=1)
Y=ts(Y,start=1,freq=1)
XY <- ts(cbind(X, Y), start = 1, frequency = 1)
# Cross-covariance function
acfm(XY)

1.1 Bartlett’s Theorem for Sample Cross-Correlation Function

Bartlett’s theorem states that the sample cross-correlation function converges to the true cross-correlation function as the sample size increases. This means that as the number of observations in the time series increases, the sample cross-correlation function will become a more accurate estimate of the true cross-correlation function.

\[ r_m(X,Y)\sim N(\rho_m(X,Y),\frac{1}{n}\left[1+2\sum_{k=1}^\infty\rho_k(X)\rho_k(Y)\right]) \]

where \(\rho_k(X)\) and \(\rho_k(Y)\) are the autocorrelation functions of the time series \(X\) and \(Y\), respectively.

The term \(2\sum_{k=1}^\infty\rho_k(X)\rho_k(Y)\) often result in inflation of the variance of the sample cross-correlation function, which can lead to misleading conclusions about the relationship between the two time series. This is particularly important when analyzing time series data with long memory or strong autocorrelation, as these characteristics can significantly affect the sample cross-correlation function.

In the example of \(X_t\sim AR(1)\) and \(Y_t\sim AR(1)\) the variance becomes: \[ \frac{1}{n}\frac{1+\phi_X\phi_Y}{1-\phi_X\phi_Y} \]

Code
set.seed(639)
library(astsa)
x <- rnorm(100)
y <- rnorm(100)
xy <- ts(cbind(x, y), start = 1, frequency = 1)
acfm(xy)

Code
ccf(xy[,1], xy[,2], lag.max = 20, main = 'Cross-correlation function of X and Y')

Code
x <- arima.sim(model = list(order = c(1, 0, 0), ar = 0.8), n = 100)
y <- arima.sim(model = list(order = c(1, 0, 0), ar = 0.8), n = 100)
xy <- ts(cbind(x, y), start = 1, frequency = 1)
acfm(xy)

1.2 Prewhitening

Prewhitening is a technique used to remove the effects of autocorrelation from a time series before analyzing the cross-correlation between two time series. This is particularly important when the two time series have strong autocorrelation, as this can lead to misleading conclusions about the relationship between the two time series.

Code
set.seed(123)
x <- arima.sim(n = 100, model = list(ar = 0.9))
y <- 0.5 * x + arima.sim(n = 100, model = list(ar = 0.9))

ccf(x,y)

Code
# Prewhiten x and filter y
prewhiten(x, y)

1.3 Milk vs Electricity production example

Code
data(milk) 
data(electricity)
milk.electricity <- ts.intersect(milk,log(electricity))
plot(milk.electricity,yax.flip=T)

Code
acfm(milk.electricity)

Code
ccf(milk.electricity[,1],milk.electricity[,2],
    lag.max=20,main='CCF of Milk and Electricity')

Let’s prewhiten the data

Code
me.dif <- ts.intersect(diff(diff(milk,12)),
                    diff(diff(log(electricity),12)))
prewhiten(as.vector(me.dif[,1]),
        as.vector(me.dif[,2]),
        ylab='CCF')

1.4 Bluebird potato chips example

Our first example of this section is a sales and price dataset of a certain potato chip from Bluebird Foods Ltd., New Zealand. The data consist of the log-transformed weekly unit sales of large packages of standard potato chips sold and the weekly average price over a period of 104 weeks.

Code
data(bluebird)
plot(bluebird,yax.flip=T)

Code
acfm(bluebird)

Code
prewhiten(y=diff(bluebird)[,1],x=diff(bluebird)[,2],ylab='CCF')

We see strong contemporaneous correlation between the two series. Therefore the model should be \[ Log.sales_t = \beta_0 + \beta_1 Price_t + \epsilon_t \] The model can be estimated using OLS regression.

Code
sales=bluebird[,1] 
price=bluebird[,2]
chip.m1=lm(sales~price,data=bluebird)
summary(chip.m1)

Call:
lm(formula = sales ~ price, data = bluebird)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.54950 -0.12373  0.00667  0.13136  0.45170 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   15.890      0.217   73.22   <2e-16 ***
price         -2.489      0.126  -19.75   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.188 on 102 degrees of freedom
Multiple R-squared:  0.7926,    Adjusted R-squared:  0.7906 
F-statistic: 389.9 on 1 and 102 DF,  p-value: < 2.2e-16

Analyzing the residuals of the model, we see that they are not white noise. This indicates that there is still some autocorrelation present in the data, which suggests that the model may not be fully capturing the relationship between the two time series.

Code
acf(residuals(chip.m1),ci.type='ma')

Code
pacf(residuals(chip.m1))

Code
eacf(residuals(chip.m1))
AR/MA
  0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 x x x x o o x x o o o  o  o  o 
1 x o o x o o o o o o o  o  o  o 
2 x x o x o o o o o o o  o  o  o 
3 x x o x o o o o o o o  o  o  o 
4 o x x o o o o o o o o  o  o  o 
5 x x x o x o o o o o o  o  o  o 
6 x x o x x x o o o o o  o  o  o 
7 x o x o o o o o o o o  o  o  o 

The diagnostics suggest the model: _______

Code
chip.m2=arima(sales,order=c(1,0,4),xreg=data.frame(price))
chip.m2

Call:
arima(x = sales, order = c(1, 0, 4), xreg = data.frame(price))

Coefficients:
         ar1      ma1     ma2     ma3     ma4  intercept    price
      0.1989  -0.0554  0.2521  0.0735  0.5269    15.7792  -2.4234
s.e.  0.1843   0.1660  0.0865  0.1084  0.1376     0.2166   0.1247

sigma^2 estimated as 0.02556:  log likelihood = 42.35,  aic = -70.69
Code
chip.m3=arima(sales,order=c(1,0,4),xreg=data.frame(price),
fixed=c(NA,0,NA,0,NA,NA,NA)) 
chip.m3

Call:
arima(x = sales, order = c(1, 0, 4), xreg = data.frame(price), fixed = c(NA, 
    0, NA, 0, NA, NA, NA))

Coefficients:
         ar1  ma1     ma2  ma3     ma4  intercept    price
      0.1444    0  0.2676    0  0.5210    15.8396  -2.4588
s.e.  0.0985    0  0.0858    0  0.1171     0.2027   0.1166

sigma^2 estimated as 0.02572:  log likelihood = 42.09,  aic = -74.18
Code
chip.m4=arima(sales,order=c(0,0,4),xreg=data.frame(price),
fixed=c(0,NA,0,NA,NA,NA)) 
chip.m4

Call:
arima(x = sales, order = c(0, 0, 4), xreg = data.frame(price), fixed = c(0, 
    NA, 0, NA, NA, NA))

Coefficients:
      ma1     ma2  ma3     ma4  intercept    price
        0  0.2884    0  0.5416    15.8559  -2.4682
s.e.    0  0.0794    0  0.1167     0.1909   0.1100

sigma^2 estimated as 0.02623:  log likelihood = 41.02,  aic = -74.05

Note that the regression coefficient estimate on Price is similar to that from the OLS regression fit earlier, but the standard error of the estimate is about 10% lower than that from the simple OLS regression. This illustrates the general result that the simple OLS estimator is consistent but the associated standard error is generally not trustworthy.

Diagnosing the residuals from this model

Code
tsdiag(chip.m4)