I've recently been working with linear fixed-effects panel models for my research. This class of models is a special case of fixed-effects models (used in fixed effects regression), which have wide applicability for a number of problems. Although these models can be fit in R using the the built-in lm() function most users are familiar with, for large numbers of fixed effects this function can be intractable and return poor results. In addition, they clutter in statistical summary() of your model because they're reported alongside any covariates of interest. In my work, I have about 4000-6000 fixed effects and, fortunately, the R community has delivered two excellent libraries for working with these models: lfe and plm. A more detailed introduction to these packages can be found in [1] and [2], respectively. Here, I'll summarize how to fit these models with each of these packages and how to develop goodness-of-fit tests and tests for the linear model assumptions, which are trickier when working with these packages (as of this writing).

I should state up front that I am going to gloss over much of the statistical red meat, speaking, as I usually do, to practitioners rather than statisticians. Also, there are a variety of flavors of models that can be estimated with this framework. I'm going to focus on just one type of model, the panel model by the "within" estimator.

Introduction to Fixed-Effects Panel Models

Fixed-effects panel models have several salient features for investigating drivers of change. They originate from the social sciences, where experimental setups allow for intervention-based prospective studies, and from economics, where intervention is typically impossible but inference is needed on observational data alone. In these prospective studies, a panel of subjects (e.g., patients, children, families) are observed at multiple times (at least twice) over the study period. The chief premise behind fixed effects panel models is that each observational unit or individual (e.g., a patient) is used as its own control, exploiting powerful estimation techniques that remove the effects of any unobserved, time-invariant heterogeneity [3,4]. By estimating the effects of parameters of interest within an individual over time, we can eliminate the effects of all unobserved, time-invariant heterogeneity between the observational units [5]. This feature has led some investigators to propose fixed-effects panel models for weak causal inference [3] as the common problem of omitted variable bias (or "latent variable bias") is removed through differencing. Causal inference with panel models still requires an assumption of strong exogeneity (simply put: no hidden variables and no feedbacks).

The linear fixed-effects panel model extents the well-known linear model, below. The response of individual \(i\), denoted \(y_i\), is a function of some group mean effect or intercept, \(\alpha\), one or more predictors, \(\beta x_i\), and an error term, \(\varepsilon_i\).

$$ y_i = \alpha + \beta x_i + \varepsilon_i $$

The basic linear fixed-effect panel model can be formulated as follows, where we add an intercept term for each of the individual units of observation, \(i\), which are observed at two or more times, \(t\):

$$ y_{it} = \alpha_i + \beta x_{it} + \varepsilon_{it} $$

It's important to note that this approach requires multiple observations of each individual. Obviously, if the number of observations \(N\) was equal to the number of individuals \(i \in M\), we would exhaust the degrees of freedom in our model simply by adding \(M\) intercept terms, \(\alpha_i + ... + \alpha_M\). With as few as two observations \((t \in [1,2])\) of each subject, however, we've doubled the number of observations and the individual intercept terms now correspond to any time-invariant, idiosyncratic change between those two observations.

This model can be extended further to include both individual fixed effects (as above) and time fixed effects; the "two-ways" model:

$$ y_{it} = \alpha_i + \beta x_{it} + \mu_t + \varepsilon_{it} $$

Here, \(\mu_t\) is an intercept term specific to the time period of observation; it represents any change over time that affects all observational units in the same way (e.g., the weather or the news in an outpatient study). These effects can also be thought of as "transitory and idiosyncratic forces acting upon [observational] units (i.e., disturbances)" [3].

Thus, there are two basic kinds of fixed-effects panel models. The "unobserved effects" or individual model accounts for unobserved heterogeneity between individuals by partitioning the error term into two parts. One part is specific to the individual unit of observation and doesn't change over time while the other is the idiosyncratic error term, \(\varepsilon\), we are familiar with from basic linear models [2]. The second type is an extension of the first, the "two-ways" panel model that includes both individual and time fixed effects.

A further extension of the panel model, one often seen in the literature, is given below.

$$ y_{it} = \alpha_i + \beta x_{it} + \phi z_i + \mu_t + \varepsilon_{it} $$

While \(\beta\) and \(\epsilon\) do not differ from the meanings in the basic linear model, \(\alpha_i\) is the individual fixed effect and \(\phi\) is a vector of coefficients for time-invariant, unit-specific effects. These effects can be estimated in a linear model but are removed in some kinds of estimation of panel models (\(\phi \equiv 0\)). They are removed in estimation through differencing; since each observational unit is used as its own control, we are unable to distinguish between heterogeneity that we didn't observe (and can't account for)---the kind we wish to remove from our model---and known (observed) differences between observational units (e.g., race or sex in patients) [3].

Fitting Fixed-Effects Panel Models in R

In these examples, I'm using data from my research, where the observational units are residential neighborhoods and the response variable is percent change in vegetation cover (d.veg). Here, the neighborhood's FIPS code is the unique identifier for individuals (neighborhoods) and the year is the identifier for the time period; these are the individual and time fixed effects, respectively. Both the individual and time fixed effects, FIPS and year, must be factors where the levels correspond to the individual and time period identifiers, respectively. I'll investigate to what extent variation in the response is predicted by the baseline vegetation cover (mean.veg) and the change in precipitation (d.ppt).

Let's start with lfe. A basic panel model can be with using lfe with the provided felm() method. This approach exploits the vertical-bar in the Wilkinson-Rogers syntax (R formulas) to specify the "levels" by which our panel data are organized. Here, we

library(Matrix)
library(lfe)
m1 <- felm(d.veg ~ d.ppt | FIPS, data=panel.data)

We can be more explicit by specifying the contrasts of our model but the result is the same.

m1 <- felm(d.veg ~ d.ppt | FIPS, data=panel.data, contrasts=c('FIPS', 'year')

The results are the same. We can see that our predictor, change in precipitation (d.ppt) is highly significant and we are given two pairs of goodness-of-fit statistics: the multiple and adjusted R-squared for the "full" and "projected" models. The full model is our model with the individual fixed effects included; the projected model is the estimated model where our fixed effects are not included. The full model always performs better than the projected model because the individual fixed effects always explain additional variation in the response: they account for any idiosyncratic differences between each observational unit.

> summary(m1)

Call:
   felm(formula = d.veg ~ d.ppt | FIPS + year, data = panel.data)

Residuals:
     Min       1Q   Median       3Q      Max
-0.19850 -0.02941  0.00000  0.02941  0.19850

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
d.ppt 0.0001131  0.0000150   7.539 6.17e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.05668 on 3127 degrees of freedom
Multiple R-squared(full model): 0.1861   Adjusted R-squared: -0.6286
Multiple R-squared(proj model): 0.01785   Adjusted R-squared: -0.9652
F-statistic(full model):0.2284 on 3130 and 3127 DF, p-value: 1
F-statistic(proj model): 56.84 on 1 and 3127 DF, p-value: 6.169e-14

We can fit the two-ways fixed effects model in lfe simply by adding an additional contrast.

m2 <- felm(d.veg.rel ~ d.ppt | FIPS + year, data=panel.data)

Switching to plm, we can fit the two-ways fixed effects model using the plm() function. The plm library doesn't use the vertical bar to specify fixed effects, rather, it requires us to specify the index argument with the variable names of the individual and time fixed effects specified as a tuple (in that order). We also indicate that the model we want to estimate is the within model and that we are estimating twoways effects.

library(Matrix)
library(plm)
m1 <- plm(d.veg ~ d.ppt, data=panel.data, model='within',
  index=c('FIPS', 'year'), effect='twoways')

The output of summary() for plm is different and we get a little more detail in some areas. We see that we have a balanced panel (same number of observations for each individual) over 3,129 individuals and two (2) time periods (for a total of 6,258 observations).

> summary(m1)

Twoways effects Within Model

Call:
plm(formula = d.veg ~ d.ppt, data = panel.data, effect = "twoways",
    model = "within", index = c("FIPS", "year"))

Balanced Panel: n=3129, T=2, N=6258

Residuals :
   Min. 1st Qu.  Median 3rd Qu.    Max.
-0.1990 -0.0294  0.0000  0.0294  0.1990

Coefficients :
        Estimate Std. Error t-value  Pr(>|t|)    
d.ppt 1.1311e-04 1.5004e-05   7.539 6.169e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    10.227
Residual Sum of Squares: 10.045
R-Squared:      0.017851
Adj. R-Squared: 0.00892
F-statistic: 56.836 on 1 and 3127 DF, p-value: 6.169e-14

What we don't see is a goodness-of-fit statistic for the full model. The value of the R-Squared statistic presented here, when compared to the lfe output, is obviously that of the projected model.

Note that the individual effects model can be with with plm by removing the second variable from the tuple provided to index and specifying an individual effects model:

m1 <- plm(d.veg ~ d.ppt, data=panel.data, model='within',
  index=c('FIPS'), effect='individual')

Goodness-of-Fit for Panel Models

To get a goodness-of-fit metric for the full-model, we have to calculate various sums-of-squares. Returning to our basic statistics, we note that:

$$ R^2 = \frac{SSR}{SST} = \frac{\sum (\hat{y}_i - \bar{y})^2}{\sum (y_i - \bar{y})^2} $$

Where \(\hat{y}_i\) and \(\bar{y}_i\) are the estimated and mean values, respectively, of \(y_i\) and \(SSR\) and \(SST\) are, respectively, the residual sum-of-squared and total sum-of-squares, related by the following formula:

$$ \sum_{i=1}^n (y_i - \bar{y})^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \sum_{i=1}^n (\hat{y}_i - \bar{y})^2\\ SST = SSE + SSR $$

Thus, if we can calculate \(SSR\) and \(SST\), we can calculated R-squared. \(SST\) is easily obtained from our model data, as it depends only on the observed values and mean value of our response:

$$ SST = \sum_{i=1}^n (y_i - \bar{y})^2 $$

In R, this is:

sst <- with(panel.data, sum((d.veg - mean(d.veg))^2))

Without deriving the fitted values, \(\hat{y}\), we can't calculate \(SSR\) or \(SSE\) directly. I'll get into deriving fitted values later. For now, we can exploit the fact that \(SST = SSE + SSR\) so as to derive \(SSR\) as the difference between \(SST\) and \(SSE\), the sum of squared errors. We can then calculate \(SSE\) from the least-squares criterion:

$$ SSE = \epsilon '\epsilon = (y - X\beta )' (y - X\beta ) $$

In R, this is:

m1.sse <- t(residuals(m1)) %*% residuals(m1)

Putting this altogether, in R, we can derive R-squared as follows. Recall that, here, d.veg signifies my response variable.

> sst <- with(panel.data, sum((d.veg - mean(d.veg))^2))
> m1.sse <- t(residuals(m1)) %*% residuals(m1)
> (sst - m1.sse) / sst

          d.veg
d.veg 0.1860799

We obtain \(R^2 = 0.1860799\), which, compared to the R-squared estimated for the full model by lfe, is a pretty good estimate. Why bother calculating this when lfe does it for free? In my work, I found that lfe and felm() choked on some two-ways panel models I was fitting but, if that's not a problem for you, just use lfe. In addition, lfe doesn't provide adjusted R-squared, which is a better estimate between models with differing numbers of parameters.

Adjusted R-squared is defined as:

$$ \bar{R}^2 = 1 - (1 - R^2)\frac{n-1}{n-p-1} = 1 - \frac{SSR(n-1)^{-1}}{SST(n-p-1)^{-1}} $$

Provided we have \(R^2\), here denoted m1.r2, we can calculate \(n\), the number of observations, and the adjusted R-squared statistic in R as follows:

N <- dim(panel.data)[1]
1 - (1 - m1.r2)*((N - 1)/(N - length(coef(m1)) - 1))

Calculating Fitted Values

While the lfe library is compatible with the built-in fitted() function for extracting the fitted values, plm is not. The fitted values can be extracted easily enough, however, after we create a model matrix for our predictors.

X <- panel.data[,c('mean.veg', 'd.ppt')]
yfit <- as.matrix(X) %*% coef(m1)

We'll need the fitted values for model diagnostics.

Model Diagnostics

Two critical assumptions of any linear model, including linear fixed-effects panel models, are constant variance (homoskedasticity) and normally distributed errors [6]. We might also want to determine the leverage of our observations to see if there are any highly influential points (which might be outliers). In addition, since we're working with spatial data (in this case), we'll do a crude check for spatial autocorrelation in the residuals, which, if present, would be problematic for inference.

Normally Distributed Errors

This is a quick check using a couple of built-in functions. We want to determine that the curve of our residuals in a QQ-Normal plot follow a straight line. Some curvature in the tails is to be expected; it is a somewhat subjective test but this assumption is also commonly satisfied with observational data.

qqnorm(residuals(m1), ylab='Residuals')
qqline(residuals(m1))

We might also examine a histogram of the residuals.

hist(residuals(m1), xlab='Residuals')

Constant Variance

Here, a plot of the residuals against the fitted values should reveal whether or not there is non-constant variance (heteroskedasticity) in the residuals, which would violate one of the assumptions of our linear model. We look for whether or not the spread of points is uniform in the y-direction as we move along the x-axis (constant variance along the x-axis).

plot(yfit, residuals(m1))

Checking for Influential Observations

To derive the leverage of the observations, we wish to derive hat matrix (or "projection matrix") from our linear model. This is the matrix given by the linear transformation through which we obtained the estimated coefficients for our model, \(\beta\).

$$ X\hat{\beta} = X(X^T X)^{-1} X^Ty = Py $$

Where \(X\) is the design matrix (matrix of explanatory variables) and \(y\) is the vector of our observed response values. The hat (or projection) matrix is denoted by \(P\). The diagonal of this \(N\times N\) matrix (diag(P) in R) contains the leverages for each observation point.

In R, we use matrix multiplication and the solve() function (to obtain the inverse of a matrix). With the faraway package, we can draw a half-normal plot which sorts the observations by their leverage.

X <- panel.data[,c('mean.veg', 'd.ppt')]
X <- as.matrix(X)
P = X %*% solve(t(X) %*% X) %*% t(X)

require(faraway)
halfnorm(diag(P), labs=row.names(X), ylab='Leverages', nlab=1)

If we need to remove influential observations, in order to maintain a balanced panel, it's important to remove all of the observations for that individual; that is, to remove that individual entirely from the panel, rather than just the observation(s) at those time period(s). In R, we do this by querying those points that have an influence above a certain threshold, say 0.013; here, the individual units of observation are denoted by unique FIPS codes.

panel.data.mod <- subset(panel.data,
  !(FIPS %in% panel.data[names(diag(P)[diag(P) > 0.013]), 'FIPS']))

Checking for Spatial Autocorrelation

An investigation of spatial autocorrelation deserves much more space than I will give it in this article. However, a crude check for spatial autocorrelation can be performed by plotting the residuals against some spatial variable, such as distance along a transect or latitude or longitude.

plot(panel.data$latitude, residuals(m1),
  xlab='Latitude', ylab='Residuals', main='Residuals vs. Latitude')
abline(h=0, col='red', lty='dashed')

Again, this is crude. If spatial autocorrelation is obviously present (variance in residuals changes across the spatial gradient), then we would be satisfied with this test alone. If spatial autocorrelation does not appear to be present, we would require more rigorous tests (such as Moran's I test).

References

  1. Gaure, S. 2013. lfe: Linear Group Fixed Effects. The R Journal 5(2):104–117.
  2. Croissant, Y., and G. Millo. 2008. Panel data econometrics in R: The plm package. Journal of Statistical Software 27(2):1–43.
  3. Halaby, C. N. 2004. Panel Models in Sociological Research: Theory into Practice. Annual Review of Sociology 30(1):507–544.
  4. Gelman, A., and J. Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. New York, New York, USA: Cambridge University Press.
  5. Allison, P. D. 2009. Fixed Effects Regression Models ed. T. F. Liao. Thousand Oaks, California, U.S.A.: SAGE.
  6. Crawley, M. J. 2015. Statistics: An Introduction Using R 2nd ed. Chichest, West Sussex, United Kingdom: John Wiley & Sons.