Let's take a look at the phenomenon of spurious regression. The detailed information can be seen from Harris, R., & Sollis, R. (2003) [Applied time series modelling and forecasting. Wiley.]. Data trends can create the illusion of a correlation between variables in a regression equation when in reality, only correlated time trends exist. To address this, the time trend in a trend-stationary variable can be eliminated by either: 1) using the residuals of regressing the variable on time to create a new variable that is free of trends and stationary; or 2) including a deterministic time trend as a regressor in the model. This ensures that the standard regression model is working with stationary series that have constant means and finite variances, making statistical inferences based on t- and F-tests accurate.
When a non-stationary variable is regressed on a deterministic trend, it usually does not result in a stationary variable and the series needs to be differenced before analysis. Using standard regression methods with non-stationary data can result in the problem of spurious regressions, which can lead to invalid inferences based on t- and F-tests. For example, in the following DGP (data generating process):
\begin{equation*} \begin{array}{ll} x_{1t}=x_{1,t-1}+u_t & u_t \sim \operatorname{IN}(0,1) \\ x_{2t}=x_{2,t-1}+v_t & v_t \sim \operatorname{IN}(0,1) \end{array} \end{equation*}When estimating the following regression model, both x and y are uncorrelated non-stationary variables:
\begin{equation*} x_{1t}=\alpha_{1}+\alpha_{2}x_{2t}+\varepsilon_t. \end{equation*}When performing this analysis, the null hypothesis $H_0:\alpha_{2}= 0$ should generally be accepted while the coefficient of determination ($R^2$) should also tend towards zero. Despite that, due to the non-stationary nature of the data, it implies that the error term ($\varepsilon_t$) is also non-stationary.
How can we detect this phenomenon. Consider we generated 500 observations of ut from $u_t∼ N(0, 1)$ and $v_t∼ N(0, 1)$ and assumed that the initial values of both $x_1$ and $x_2$ were zero. We also assumed that $u_t$ and $v_t$ are serially uncorrelated as well as mutually uncorrelated. As you know by now, both these time series are nonstationary; that is, they are I(1) or exhibit stochastic trends.
# The Phenomenon of Spurious Regression
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import scipy.stats as stats
# fix the random seed (I chose 753951):
np.random.seed(753951)
# I set T=500 and generate errors (ut and vt) from IIDN(0,1)
T=500
u=stats.norm.rvs(0,1,T)
u[0]=0
v=stats.norm.rvs(0,1,T)
v[0]=0
A random walk model is a type of time series model that can be represented mathematically as:
\begin{equation*} x_{1t}=x_{1,t-n}+\sum_{j=0}^{n-1} u_{t-j}, \end{equation*}where in our case $n=T=500$ so we have
\begin{equation*} x_{1t}=x_{1,0}+\sum_{j=0}^{T-1} u_{t-j}. \end{equation*}# Finally I generate actual nonstationary series from errors
x1=np.cumsum(u)
x2=np.cumsum(v)
hypotheticalData=pd.DataFrame({'x1':x1,'x2':x2})
Let's look at the time series of both series:
plt.plot(x1,color='black',marker='',linestyle='solid', label='x1')
plt.plot(x2,color='red',marker='',linestyle='dotted', label='x2')
plt.ylabel('x1,x2')
plt.legend()
plt.savefig('//.../images/nonsense regression')
# get regression results with OLS
reg=smf.ols(formula='x1 ~ x2 ',data=hypotheticalData)
results=reg.fit()
print(results.summary())
OLS Regression Results ============================================================================== Dep. Variable: x1 R-squared: 0.202 Model: OLS Adj. R-squared: 0.201 Method: Least Squares F-statistic: 126.4 Date: Sat, 28 Jan 2023 Prob (F-statistic): 2.77e-26 Time: 20:26:44 Log-Likelihood: -1943.8 No. Observations: 500 AIC: 3892. Df Residuals: 498 BIC: 3900. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.4471 1.742 0.257 0.798 -2.976 3.870 x2 0.8312 0.074 11.241 0.000 0.686 0.976 ============================================================================== Omnibus: 1954.485 Durbin-Watson: 0.012 Prob(Omnibus): 0.000 Jarque-Bera (JB): 46.378 Skew: 0.310 Prob(JB): 8.49e-11 Kurtosis: 1.642 Cond. No. 77.7 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The statistical significance of the coefficient of $x_2$ is clear, however, the low $R^2$ value suggests that the relationship between $x_1$ and $x_2$ may not be statistically significant. This is known as the phenomenon of spurious or nonsense regression, first identified by Yule. Yule demonstrated that correlation can appear in non-stationary time series, even when the sample size is large. The extremely low Durbin-Watson $d$ value in this case indicates a strong presence of first-order autocorrelation, indicating a problem with the regression. Granger and Newbold propose that a high $R^2$ compared to $d$ may indicate a spurious regression, as is the case here. It's important to note that the $R^2$ and $t$ statistics in such a regression are not reliable and cannot be used for testing hypotheses about the parameters.
G. U. Yule, “Why Do We Sometimes Get Nonsense Correlations Between Time Series? A Study in Sampling and the Nature of Time Series,” Journal of the Royal Statistical Society, vol. 89, 1926. C. W. J. Granger and P. Newbold, “Spurious Regressions in Econometrics,” Journal of Econometrics, vol. 2, 1974, pp. 111–120.