K. Arthur Endsley - Blog

Approximate Bayesian computation in Python

2021-11-13T12:00:00+01:00

The PyMC library offers a solid foundation for probabilistic programming and Bayesian inference in Python, but it has some drawbacks. Although the API is robust, it has changed frequently along with the shifting momentum of the entire PyMC project (formerly "PyMC3"). This is most evident in the abandoned "PyMC4" project and the stranding of Theano. The PyMC developers have picked up where the developers of Theano left off, introducing "Aesara" as a replacement; but, as it stands, little has been written about how to migrate from Theano to Aesara, especially for PyMC users who don't have prior experience with computational graphs or tensor libraries.

PyMC is actually quite expressive and there are rich examples of how to use it for more than just Bayesian inference. These community-contributed examples are not always updated, however, to reflect the changes to PyMC over the years. In particular, there are few examples of how to use PyMC for approximate Bayesian computation (ABC), and those I've seen [1,2] are functionally deprecated given the current state of the library.

ABC, also referred to as likelihood-free inference [3] involves Bayesian simulation—primarily through Markov Chain Monte Carlo (MCMC)—for models which have intractable or analytically unavailable likelihood functions. This includes models that have no formal likelihood, such as physical models or other simulations that have no analytical representation. In such cases, we require an approach to modeling with a "black-box" likelihood function, for instance, the mean squared error between a model's predictions and observed values. This approach is popular for fitting a variety of models to observed data and is the focus of another Python library, SPOTPY [4]. However, in my experience, SPOTPY has some outstanding issues related to the proposal distribution, in which bounded priors are sampled in a way that fails to preserve balance in the Markov chain [5]. Moreover, it doesn't have the advantages that PyMC offers, with its rich diagnostics and performance enhancements. I spent some time figuring out how to do ABC in PyMC, so I think it's worth reporting here in case anyone else is struggling to navigate the vast and shifting sands of PyMC's API.

Lotka-Volterra Example

Here, I use the example of the Lotka-Volterra ordinary differential equations (ODEs), as presented in one of the deprecated PyMC examples [2]. These well-known ODEs are used to model the dynamics of a pair of interacting predator and prey populations. $N_t$ and $M_t$ describe the size of prey and predator populations, respectively, at time $t$. Change in the populations is given:

$$ \frac{d N_t}{d t} = \alpha N_t - \beta N_t M_t $$

$$ \frac{d M_t}{d t} = -\gamma M_t + \delta N_t M_t $$

Where $\alpha$ is the prey growth rate (in the absence of predators), $\beta$ is the prey mortality rate, $\gamma$ is the predator mortality rate (in the absence of prey), and $\delta$ is the conversion efficiency by which prey are consumed and predators increased. A Python implementation of this model, postponing a solution, could be written as follows where, for simplicity, we set $\gamma = -1.5$ and $\delta = 0.75$:

import numpy as np

# True parameter values (alpha, beta)
alpha = 1.0
beta = 0.1

# Initial population of rabbits and foxes
X0 = [10.0, 5.0]
size = 100 # Size of data
time = 15 # Time lapse
t = np.linspace(0, time, size)

# Lotka-Volterra equations
def dX_dt(X, t, a = alpha, b = beta):
    'Return the growth rate of fox and rabbit populations.'
    return np.array([
        a * X[0] - b * X[0] * X[1],
        -1.5 * X[1] + 0.75 * b * X[0] * X[1]
    ])

The solution to this system of ODEs might be written:

from scipy.integrate import odeint

def competition_model(params, x = None):
    a, b, _ = params
    'Simulator function (an ordinary differential equation to be integrated)'
    return odeint(dX_dt, y0 = X0, t = t, rtol = 0.01, args = (a, b))

We generate some synthetic, "observed" data as:

# Observed data (with added random noise)
data = competition_model(
    params = (a, b, None)) + np.random.normal(size = (size, 2))

The Black-Box Likelihood Operation

PyMC is built on top of Theano, a computational graph library that allows for symbolic computing and automatic differentiation. When trying to do ABC, we have to develop a way for Theano to sample the posterior likelihood even though we have no formal likelihood function that can be represented analytically. With ABC, there is no likelihood function that Theano can differentiate with respect to model parameters; no gradient to be calculated.

We can, however, create a custom Theano operator, tt.Op, that returns a quasi-likelihood value, allowing for a sampler (e.g., Metropolis-Hasting) to sample from the posterior likelihood even though a gradient is unavailable. Here, the quasi-likelihood function is the root-mean squared error (RMSE), though we could use any other goodness-of-fit metric. The perform() method represents the operation that is performed on Theano's computational graph: given a vector of proposed parameter values, calculate the posterior likelihood. For calculating the RMSE, it is necessary that this operation have the observed values available in-memory, so we set them as an instance attribute, self.observed. Because our model may require some additional driver data, we also set these data as an instance attribute, self.x.

import theano.tensor as tt

class BlackBoxLikelihood(tt.Op):
    itypes = [tt.dvector] # Expects a vector of parameter values when called
    otypes = [tt.dscalar] # Outputs a single scalar value (the log likelihood)

    def __init__(self, model, observed, x):
        '''
        Parameters
        ----------
        model : Callable
            An arbitrary "black box" function that takes two arguments: the
            model parameters ("params") and the forcing data ("x")
        observed : numpy.ndarray
            The "observed" data that our log-likelihood function takes in
        x:
            The forcing data (input drivers) that our model requires
        '''
        self.model = model
        self.observed = observed
        self.x = x

    def loglik(self, params, x, observed):
        # The root-mean squared error (RMSE)
        predicted = self.model(params, x)
        return -np.sqrt(np.nanmean((predicted - observed) ** 2))

    def perform(self, node, inputs, outputs):
        # The method that is used when calling the Op
        (params,) = inputs
        logl = self.loglik(params, self.x, self.observed)
        outputs[0][0] = np.array(logl) # Output the log-likelihood

Note that we return the negative RMSE, as PyMC samplers are used to working with log-likelihoods and we wish to maximize the likelihood function (i.e., obtain the smallest negative RMSE).

For some applications, we may want to use the Gaussian log-likelihood or a similar exponential likelihood function.

$$ \mathrm{log}\,\mathcal{L}(\hat{\theta}, \hat{\sigma}) = -\frac{N}{2}\,\mathrm{log}(2\pi\hat{\sigma}^2) - \frac{1}{2\hat{\sigma}^2} \sum (\hat{y}(\hat{\theta}) - y)^2 $$

In this case, we need to include as a parameter in our model the error variance, sigma; it's most convenient to make this last the parameter in the parameters vector:

def loglik(self, params, x, observed):
    predicted = self.model(params, x)
    sigma = params[-1]
    return -0.5 * np.log(2 * np.pi * sigma**2) - (0.5 / sigma**2) *\
        np.nansum((predicted - observed)**2)

Parameter Estimation with PyMC

Finally, we're ready to estimate our model's parameters using PyMC. If you're unfamiliar with PyMC and how models are defined, check out the tutorial, "Getting started with PyMC3" before trying to figure out what's going on below. If you are already familiar with PyMC models, hopefully there are no surprises here.

In this example, we use a Gaussian log-likelihood function, so we add the error variance, sigma, as a model parameter.

import arviz as az
import pymc3 as pm
from matplotlib import pyplot

loglik = BlackBoxLikelihood(competition_model, data, None)

with pm.Model() as my_model:
    # (Stochastic) Priors for unknown model parameters
    a = pm.HalfNormal('a', sd = 1)
    b = pm.HalfNormal('b', sd = 1)
    sigma = pm.HalfNormal('sigma', sd = 1)

    # Convert model parameters to a tensor vector
    params = tt.as_tensor_variable([a, b, sigma])

    # Define the likelihood as an arbitrary potential
    pm.Potential('likelihood', loglik(params))

    trace = pm.sample(
        draws = NUM_DRAWS, step = pm.Metropolis(),
        start = {'a': 0.5, 'b': 0.5, 'sigma': 10})
    az.plot_trace(trace)
    pyplot.show()

Because, in this toy example, we know the true parameter values, a look at the trace plot suggests that they are correctly identified. However, there is a lot of uncertainty in the $\beta$, or b, parameter.

Let's try it again but using the RMSE quasi-likelihood this time. Again, we'll take 50,000 samples of the posterior in each of four chains. At first glance, our posterior distributions seem narrower with the RMSE quasi-likelihood; however, if we look closely at the horizontal axes, we see that the RMSE quasi-likelihood is permitting a few larger jumps in the proposal distribution than we saw with the Gaussian likelihood.

Despite these differences, the choice of whether to use the RMSE or a formal likelihood function should be based on substantive modeling concerns. In this case, is the Gaussian likelihood appropriate, given that our observed data might not be independent and identically distributed?

Other Considerations

When fitting a model to data with MCMC, we gain the ability to estimate the uncertainty in our model parameters, which can be visually assessed from plots likes the ones above. But how do we pick the best-fit parameters? In this toy example, the maximum posterior estimate obviously corresponds to the true parameter values, but in models with less well-identified parameters (or more process or instrument noise), it may not be obvious what to choose. We may want the parameters associated with the maximum model log-likelihood. You may have expected that PyMC would record the model's log-likelihood at each sampling step—I certainly did. But that's not the case, as can be seen from the InferenceData that were returned:

>>> trace
Inference data with groups:
    > posterior
    > sample_stats

The reason for this is complicated and related to the active development of PyMC. For now, you can store the model log-likelihood by explicitly defining it during the model's configuration:

with pm.Model() as my_model:
    ...
    # Store the model log-likelihood as well
    loglik = pm.Deterministic('log_likelihood', model.logpt)

References

"Using a 'black box' likelihood function" (Accessed: November 13, 2021.)
"Sequential Monte Carlo - Approximate Bayesian Computation" (Accessed: November 13, 2021.)
Sisson, S.A. & Y. Fan. 2011. "Likelihood-Free MCMC." Chapter 12, Handbook of Markov Chain Monte Carlo. CRC Press, LLC.
Houska, T., Kraft, P., Chamorro-Chavez, A. & Breuer, L. 2015. "SPOTting model parameters using a ready-made Python package." PLoS ONE, 10(12), e0145180.
Vrugt, J. A. 2016. Markov chain Monte Carlo simulation using the DREAM software package: Theory, concepts, and MATLAB implementation. Environmental Modelling & Software, 75, 273–316.

Fast unpacking of QC bit-flags in Python

2021-03-13T11:30:00+01:00

Satellite remote sensing products often represent quality control (QC) information as bit-flags, a sequence of binary digits that individually (single digit) or collectively (multiple digits) convey additional information about a remote sensing measurement. This information may be related to cloud cover, atmospheric conditions in general, the status of the space-borne sensor, or the overall quality of the retrieval. Bitflags are convenient because a sequence of varied true/ false (binary) indicators can be represented as a decimal number. Here's an example from the Moderate Resolution Imaging Spectrometer (MODIS) MOD15 product User Guide.

Strictly speaking, the phrase "bit-word" is misappropriated here; computer scientists have told me it refers to a sequence of two bytes (16 bits total). It may be more appropriate to refer to them as bit-flags. The example above is the binary equivalent of the decimal number 64. For a given remote sensing image, in addition to the raster array of data values (e.g., remotely sensed leaf area index, as in the MOD15 example), there is a QC band that has the same size and shape but is full of decimal values. Where the QC band has the decimal number 64, based on the above example, we can infer that the corresponding pixel in the data band has "good quality" (bit number 0) and "significant clouds [were] NOT present" (bits numbered 3-4).

Converting from Decimal to Binary Strings

The problem of trying to determine whether certain conditions apply to a given data pixel, then, is the problem of converting from decimal to binary strings. While this is relatively trivial for a single decimal number, we often need to do this conversion or "unpacking" of bit-flags for several thousands or millions of pixels at once, in the case of large or high-resolution raster image. Recently, I developed a pipeline for extracting MODIS MOD16 data in 500-m tiles (on the MODIS Sinusoidal grid), masking out low-quality pixels, resampling to 1000-m, and stitching in a global mosaic. Line profiling revealed that, by far, the most time-intensive operation was unpacking the QC bit-flags!

So, what's the fastest way to do unpack QC bit-flags in Python? Equivalently, what's the fastest way to convert from decimal to binary strings for an arbitrary array of decimal numbers? There are several options, assuming we have 8-bit binary strings that we need to unpack and parse for bit-flags. For clarity, we're evaluating options that behave something like (to use the previous example yet again):

>>> my_function(64)
'01000000'

Option 1: NumPy's Binary Representation

NumPy's binary_repr() function is a straight-forward, high-level interface for converting decimal numbers to binary strings. It's not a ufunc, however, so if we want to apply it to an arbitrary array, we need to vectorize it. In addition, we need to use partial() to curry a version of this function that always returns 8-bit binary strings. Vectorizing functions makes them slow, so I don't expect this one to perform best.

import numpy as np
from functools import partial
dec2bin_numpy = np.vectorize(partial(np.binary_repr, width = 8))

In fact, this is the approach I was originally using, which accounted for the vast majority of the program's run-time.

Option 2: Python's Built-in Binary Conversion

Python, of course, has a built-in function bin() to do this conversion. The strings returned by bin() have a special format (e.g., 0b1000000 for 64) that's not exactly what we want when unpacking QC bit-flags, so we need to use additional string methods to get a compact, 8-bit string representation. This function must also be vectorized, so it may be slow.

import numpy as np
dec2bin_py = np.vectorize(lambda v: bin(v).lstrip('0b').rjust(8, '0'))

Option 3: Bitwise Operators

A common solution for decimal-to-binary conversion in any language is the use of bitwise operators. Below, we use the left shift operator, << to get an array of decimal numbers that represent every 8-bit string with exactly one 1. Then, the bitwise AND operator & compares each of these 8 options with the (implicitly binary) representation of the decimal number we want to convert to a binary string. In effect, this updates the array with powers of 2 that, when compared to 0, are converted to true/ false flags that are equivalent to the binary representation of that number! It's a neat trick I found from multiple online forums.

import numpy as np

def dec2bin_bitwise(x):
    'For a 2D NumPy array as input'
    shp = x.shape
    return np.fliplr((x.ravel()[:,None] & (1 << np.arange(8))) > 0)\
        .astype(np.uint8).reshape((*shp, 8))

We have to know the shape of the input array because this approach does create a new axis along which the binary digits are enumerated; for example:

>>> dec2bin_bitwise(np.array((64,)))
array([[0, 1, 0, 0, 0, 0, 0, 0]], dtype=uint8)

The approach above requires an np.fliplr() operation because np.arange() produces a numeric sequence in ascending order, but bit strings are almost all computer systems are ordered from the most-significant bit (at the left) to the least-significant bit. This means that are binary strings are flipped left-to-right. I'm not sure what impact this has on performance compared to simply reversing np.arange(), as below.

import numpy as np

def dec2bin_bitwise2(x):
    'For a 2D NumPy array as input'
    shp = x.shape
    return ((x.ravel()[:,None] & (1 << np.arange(7, -1, -1))) > 0)\
        .astype(np.uint8).reshape((*shp, 8))

An important thing to note about these two functions is that they require arguments that are numpy.ndarray instances, so, they don't strictly work like the example above, with an atomic decimal number, but they are well-suited to the actual problem of working with decimal arrays.

Option 4: NumPy's `unpackbits()`

I wish I could remember where I first saw this function... It may have been from the NumPy documentation itself (in frustration, desperately seeking a faster solution). This approach does have a stronger assumptions than the previous implementations; specifically, x must be typed unsigned-integer (np.uint8). Also, as in Option 3, this function needs to create a third axis along which the binary digits are enumerated. The axis argument provides some control over this, but in most cases you'd want the new axis to be last (hence, equal to the current number of dimensions, ndim, in Python's number system). Finally, like the first example in Option 3, this function does return a binary string in big-endian order (most significant bit at the end). Thus, it is necessary to reverse-index (flip) the array to get little-endian order

import numpy as np

def dec2bin_unpack(x, axis = None):
    'For an arbitrary NumPy array a input'
    axis = x.ndim if axis is None else axis
    return np.unpackbits(x[...,None], axis = axis)[...,-8:]

Clocking In

For simplicity, in timing each approach, I create a 15-by-15 array, A, of each of the number 0 through 225; the last square matrix I can create using unique decimal numbers smaller than 255. Each function is timed using timeit with 100 loops, the timer repeated 30 times (default). Furthermore, I execute the timeit statement three times and take the average of each trial.

$ python -m timeit -n 100 -r 30 -s "import numpy as np;from dec2bin import dec2bin_numpy, A"
  "dec2bin_numpy(A)"

Implementation	Time (usec)
`dec2bin_numpy()`	`211.0`
`dec2bin_py()`	`106.0`
`dec2bin_bitwise()`	`18.8`
`dec2bin_bitwise2()`	`13.7`
`dec2bin_unpack()`	`5.1`

We can see that it matters a lot which approach you choose! My first implementation, dec2bin_numpy(), was over 40 times slower than the best choice I found. I'm not sure what makes numpy.unpackbits() so fast; I need to look under the hood and report back as, incidentally, the new NumPy documentation makes it harder to find a function's source code.

Day length, sunrise, and sunset calculation for Earth system models

2021-02-10T15:45:00+01:00

In Earth system models that run on daily or weekly time steps, there are many quantities that are we may wish to calculate only when the sun is in the sky. Vapor pressure deficit (VPD), as one important example, has strong diurnal variation, tending to decrease in the evening as the ambient temperature declines. As VPD is a key driver of transpiration and photosynthesis [1], if we want to correctly estimate the impact of VPD on these processes, we may need to estimate VPD as it is experienced by plants in the heat of the day. Consequently, we would want to integrate hourly VPD data (such as from a numerical weather model) only for those hours when the sun is up (the photoperiod) and we would therefore need to know the timing of sunrise and sunset.

An alternative approach to such day-length calculations, particularly if we're already integrating data from a weather model, would be to use solar irradiance—an empirical threshold on down-welling short-wave radiation, for instance, would be an acceptable proxy for the sun is in the sky. This is the approach used in photosynthesis models like MODIS MOD17 [2]. However, this empirical threshold might vary between sites and seasons; it's ultimately arbitrary. But the approach is attractive because the calculation of sunrise and sunset times is fairly elaborate. Further, if our model has a large spatial domain—say, continental to global scales—we inevitably need to calculate sunrise and sunset at different latitudes, across different longitudes, and possibly taking into account very different elevations.

I recently ran into this issue when developing a new photosynthesis model in Python. The Python developer community ostensibly has at least two choices of open-source libraries for calculating sunrise and sunset times (I did not look at Skyfield or AstroPy). Both libraries, pyephem and astral, are targeted at more general problems than calculating solar transits on Earth, however. Their generality leads to complexity and this translates to longer execution times when you have a global spatial domain and millions of data cells (pixels). Both libraries do offer greater precision and, in the case of the pyephem library, greater accuracy than the solution I've settled upon, which I describe below. pyephem also corrects for atmospheric refraction and, optionally, elevation effects. However, for determining sunrise and sunset times to the nearest hour, we can get by with less sophistication.

Descriptions of sunrise and sunset calculation on the internet are pretty lacking. NOAA's Earth System Research Laboratories (ESRL) offers a spreadsheet calculator but no details on how it works. This has led some developers to implement a version in code that calculates the value in each column, as in this Matlab example, as if the ESRL calculator was some sacred but indecipherable tome. I wanted to do better, at least in the documentation, verification, and wall time of my own sunrise and sunset calculation. Jean Meeus' "Astronomical Algorithms" [3] provides a calculation, but it relies on ephemeris tables to look-up declination and right ascension. Here, I mostly follow the algorithm described in the U.S. Naval Observatory's Almanac for Computers [4], which a hobbyist named Ed Williams put on their website. I can't find a copy anywhere else, so I'm indebted to them. Both Meeus and the U.S. Naval Observatory describe the calculation in the equatorial coordinate system, in terms of hour angles and the declination of the sun. The approach in Almanac for Computers may be preferable from a standpoint of numeric stability stability, as Meeus' equations require a high number of significant digits in order to be accurate.

Day Length Algorithm Description

Jean Meeus [3] gives the equation for the approximate hour angle of sunrise and sunset:

$$ \mathrm{cos}(H_0) = \frac{\mathrm{sin}(h_0) - \mathrm{sin}(\phi)\mathrm{sin}(\delta)}{\mathrm{cos}(\phi)\mathrm{cos}(\delta)} $$

Where $H_0$ is the Greenwich hour angle of sunrise (or sunset), i.e., the local hour angle at the prime meridian; $h_0$ is the zenith angle of the sun; $\phi$ is the observer's latitude; and $\delta$ is the declination of the sun.

The general approach to calculating sunrise and sunset times treats the heavens as a clock. Celestial phenomena (including sunrise and sunsets) happen on a schedule, so the positions of celestial bodies, including the sun, can be described by periodic equations that take time as an argument. There are four basic steps to calculating sunrise and sunset times:

Finding the mean solar time that describes where (or "when") the current moment is in the celestial cycle;
Calculating the sun's position in the ecliptic coordinate system;
Converting the sun's ecliptic coordinates to coordinates in the equatorial coordinate system;
Obtaining the sunrise and sunset times at the Greenwich meridian, then adding or subtracting the offset for our longitude (or "time zone").

The Almanac for Computers algorithm, which I modify, is based on Meeus' equation above. The algorithm walks us through how to calculate $\delta$ and to convert between hour angles and clock time so as to obtain sunrise and sunset times. The calculations are first performed in the ecliptic coordinate system, where we pretend that sun revolves around the Earth, and then we perform a coordinate transformation to the equatorial coordinate system. According to the U.S. Naval Observatory, the algorithm obtains an accuracy of $\pm 2$ minutes for any location between $\pm 65$ degrees latitude. The central weakness of this algorithm is that it uses constants that are estimated for the "latter half of the twentieth century" for the orbital elements (eccentricity, argument of the perihelion) and for converting between sidereal time and Universal Time. Some of these constants we can replace with mathematical models (e.g., VSOP87) like those described by Jean Meeus, but that's totally unnecessary if we simply we wish to obtain the nearest hour of sunrise and sunset.

The procedure described below assumes that all angles are in degrees, not radians. To convert from radians to degrees, multiply by $180/\pi$; to convert from degrees to radians, multiply by $\pi/180$. Latitude and longitude should be in decimal degrees; latitude values south of the equator are negative and longitude values west of the prime meridian are negative. I refer to Greenwich mean time and Coordinated Universal Time (UTC) interchangeably.

Calculating the Mean Solar Time

Our approach to calculating sunrise and sunset times uses the mean solar clock—the average motion of the sun in the sky as seen from Earth. Naturally, we begin by figuring out where we are in this solar cycle based on the current date. We often work in hour angles, an angular distance expressed as the number of hours the Earth has rotated since (or must rotate until) the meridian plane (plane containing Earth's axis and the zenith) intersects the point of the body of interest (here, the sun) projected onto the celestial sphere. Essentially, the hour angle quantifies how many hours the Earth must rotate until (or has rotated since) the object is in the same place in the sky again.

Here, the solar hour angle (SHA) quantifies the angle between the sun and the observer expressed as the number of hours difference between solar noon at the observer's location and that of the Greenwich meridian. As the Earth spins on its axis at a rate of about 15 degrees longitude per hour, this calculate is simply longitude, $\lambda$, divided by 15:

$$ \mathrm{SHA} = \frac{\lambda}{15} $$

This is basically the offset from Greenwich mean time: the number of hours east or west of the prime meridian (negative values west). We can use the SHA to obtain an approximate time of sunrise and sunset, the mean solar time, expressed as a fraction of a day (e.g., 30.5 is noon on DOY 30). This approximate time is in Greenwich mean time, and so we subtract the SHA from 12h00 as this is the middle of the solar day, which starts at midnight. The Algorithm calculates this separately for sunrise and sunset, but it's easier to calculate the approximate of transit (when the sun is highest in the sky) rather than do twice the work.

$$ T = [\mathrm{DOY}] + \frac{12 - [\mathrm{SHA}]}{24} $$

$DOY$ is the number of days since January 1 of that year on the closed interval $[1, 366]$ (on December 31 in leap years, DOY$=366$). This can be obtained easily in Python and other programming languages with date string formatting.

Position of the Sun in Ecliptic Coordinates

In order to calculate the sun's declination, we must first calculate the solar mean anomaly and the equation of the center. The solar mean anomaly, $M$, is best described by first considering the true anomaly; the true anomaly, $v$, that quantifies the position of the Earth in its elliptical orbit around the sun. If we imagine a line drawn through the points of aphelion and perihelion (farthest and closest points on our orbit), this is the angle between that line and the line connecting us to the sun (the radius vector), measured from perihelion. "It is the angle over which the object moved, as seen from the Sun, since the previous passage through the perihelion" [3]. The solar mean anomaly, then, is the equivalent angle for a circular orbit with the same period as the true object on its elliptical orbit. "The mean anomaly is the angular distance from perihelion which the planet would have moved if it moved around the Sun with a constant angular velocity" [3]. If $T$ is measured in Julian centuries, instead:

$$ M = [357.5291 + 35\,999.0503\,T - 0.000\,155\,9\,T^2 - 0.000\,000\,48\,T^3]\,\mathrm{mod}\, 360^{\circ} $$

Where mod refers to the modulus function, i.e., the above quantity should be on the interval $(0,360]$.

The Algorithm for Computers provides an approximation that is simpler and no less accurate for transit estimates to the nearest hour, where $T$ is measured in Julian days:

$$ M = 0.9856\,T - 3.289 $$

Note that $35\,999.0503$ divided by $36\,525$ (number of days in a Julian century) equals $0.9856$. In both cases, note that $M$ is expressed in degrees, not radians.

Using a circular orbit is a simplification that is corrected in the next step, with the equation of the center, which describes the angular difference ($v - M$) between the position of the true body (in its elliptical orbit) and the hypothetical body with constant angular velocity (in its circular orbit). This equation can be used in place of Kepler's equation when the eccentricity, $e$ of the orbit is small, which holds in the case of Earth's orbit [5], $e = 0.016711$. The equation of the center is often approximated by a Taylor series expansion, with the accuracy of the estimate improving with the number of powers of $e$ employed. Meeus [3] provides the expansion for the first four terms, which is sufficient for our aim of day-length calculations on Earth with hourly precision:

$$ v - M = \left( 2e - \frac{e^3}{4} + \frac{5e^5}{96} \right)\mathrm{sin}(M) + \left( \frac{5}{4}e^2 - \frac{11}{24}e^4 \right)\mathrm{sin}(2M) $$

The above equation gives the angular difference in radians. If we plug in Earth's $e$, multiply the coefficients by $180/\pi$, and move $M$ to the right-hand side, we obtain the approximation for the true anomaly, $v$, used in the Almanac for Computers [4]:

$$ v = M + 1.915\,\mathrm{sin}(M) + 0.020\,\mathrm{sin}(2M) $$

We then calculate the ecliptic longitude, $L$, the angular distance of the sun in the plane of the ecliptic, measured from the primary direction (a line pointing from Earth to the sun on the date of the vernal equinox), i.e., in the geocentric coordinate system. This is the first step of calculating the sun's position in the sky.

$$ L = (v + 180^{\circ} + \omega)\,\mathrm{mod}\,360^{\circ} $$

Where $\omega \approx 102.937\,348$ is the argument of the perihelion. $\omega$ can be calculated as: $\omega = \pi - \Omega$, where $\pi$ is the longitude of the perihelion and $\Omega$ is the longitude of the ascending node [6]. I could not find a reference for $\omega$ in the Earth-sun system so I calculated it by reading $\pi = 102.937\,348$ from Meeus' (1991) Table 30.A [3] and assuming that $\Omega = 0$, given that it does not appear in Table 30.A and at the "mean equinox of the date" the Earth is at the ascending node in its orbit around the sun. This may not be the correct interpretation of this calculation (I am not an astronomer), but it does produce the correct value of $L$, which can also be verified by comparing the value $180 + 102.937\,348$ to the value used in Almanac for Computers, $282.634$. The Almanac does provide a time-varying formula of the longitude of perihelion, $\ell$, that could be used in place of $180 + \omega$:

$$ \ell = 180^{\circ} + 100.460 + 36000.772\,T $$

Where $T$ is measured in Julian centuries and $\ell$ is obtained in degrees.

Position of the Sun in Equatorial Coordinates

So far, we've been working in the ecliptic coordinate system. Now, we want to convert these coordinates (ecliptic longitude and latitude of the sun, the latter assumed to be zero as it is always very small) to an equatorial coordinate system, expressed as right ascension and declination. This is a great introduction to the equatorial (or "celestial") coordinate system, using Earth's geographic coordinate system as an analog.

The declination of the sun, $\delta$, is obtained:

$$ \mathrm{sin}(\delta) = \mathrm{sin}(L)\times \mathrm{sin}(23.44^{\circ}) $$

Where 23.44 degrees is the maximum tilt of Earth's axis [5]. Note that all of these sines should accept arguments in degrees, not radians.

The right ascension of the sun, $\alpha$, is obtained:

$$ \mathrm{tan}(\alpha) = \mathrm{cos}(23.44^{\circ})\times \mathrm{tan}(L) $$

Where, again, we use the obliquity of the Earth (23.44 degrees). Pre-computing this cosine (and the sine of the obliquity, above) can speed things up.

We also need to check to make sure that the right ascension, $\alpha$, is in the same quadrant as the sun's longitude, $L$:

$$ \alpha^* = \alpha + 90\times \left[ f\left(\frac{L}{90}\right) - f\left(\frac{\alpha}{90}\right) \right] $$

Where $f()$ is the floor function (i.e., round down to the nearest integer).

The sun's local hour angle can now be obtained using Meeus' equation from before. We define $h_0 \rightarrow 0$ as the zenith of the sun such that $h_0 = 90^{\circ}$ for solar noon, i.e., "when the sun is at its zenith." Note that Almanac for Computers defines solar zenith differently such that it equals zero at solar noon; consequently, they use $\mathrm{cos}(h_0)$ in place of $\mathrm{sin}(h_0)$, below. We might not set $h_0$ exactly equal to zero (or exactly equal to 90 if following Almanac for Computers) for sunrise or sunset because of the finite width of the sun's disc and the effect of atmospheric refraction. Elevation of the observer might also be considered. Thus, the definition of $h_0$ will depend on your application. Meeus writes that 34 minutes ($0.5\bar{6}$ degrees) "is generally adopted for the effect of refraction at the horizon" and that 16 minutes is added as an approximation of radius of the sun (seen from Earth). Hence, Meeus recommends $h_0 = -0.8\bar{3}$ degrees (the negative sign indicates the center of the body is below the horizon).

$$ \mathrm{cos}(H_0) = \frac{\mathrm{sin}(h_0) - \mathrm{sin}(\phi)\mathrm{sin}(\delta)}{\mathrm{cos}(\phi)\mathrm{cos}(\delta)} $$

Recall that $\phi$ is the latitude of the observer.

Edge Cases Near the Poles

If we're above the Arctic Circle or below the Antarctic Circle, it's possible that the sun is always up (never sets) or never up (never rises). We can detect these conditions as follows. If $\mathrm{cos}(H_0) > 1$, the sun never rises at this location on this date. If $\mathrm{cos}(H_0) < -1$, the sun never sets at this location on this date.

Putting it All Together

We can now calculate the local rise time or local setting time. We do this differently for sunrise ($m_{\uparrow}$) and sunset ($m_{\downarrow}$): to make sure that $H_0$ is in the right quadrant, we must add 360 degrees to the rising time. This allows us to obtain the Greenwich apparent sidereal time (GAST) of sunrise or sunset. Note that, below, we are converting both the local hour angle, $H_0$ and right ascension, $\alpha$ to hours by dividing by 15 degrees. The sum of $H_0$ and $\alpha*$, both in hours, gives the GAST. We convert from GAST to the local apparent sidereal time by adding the solar hour angle (SHA). Then, we convert from sidereal time to Universal Time (for our purposes, equivalent UTC) for a sunrise ($1/4$ of a day earlier than the transit) or sunset ($1/4$ of a day later).

$$ m_{\uparrow} = \frac{360 - H_0 + \alpha^*}{15} - [\mathrm{SHA}] - \left(0\rlap{.}^{\mathrm{h}} 06571\,(T - 0.25)\right) - 6\rlap{.}^{\mathrm{h}} 622 $$

$$ m_{\downarrow} = \frac{H_0 + \alpha^*}{15} - [\mathrm{SHA}] - \left(0\rlap{.}^{\mathrm{h}} 06571\,(T + 0.25)\right) - 6\rlap{.}^{\mathrm{h}}622 $$

Because $T$ is the transit time at the Greenwich meridian, as a fraction of the day, we want to subtract $1/4$ of a day when calculating sunrise and add $1/4$ of a day when calculating sunset. The scale factor $0\rlap{.}^{\mathrm{h}}06571$ and offset $6\rlap{.}^{\mathrm{h}}622$ are in units of hours and are essentially magic numbers.

Note that the sunrise time, $m_{\uparrow}$, or sunset time $m_{\downarrow}$ calculated above may not lie on the interval $[0,24)$ and in that case you must add or subtract 24.

Calculating Photoperiod

Although the pyephem implementation (see later in the article) is slow, I take it to be the most accurate source for sunrise and sunset calculation and, hence, photoperiod calculation. I therefore wanted to compare my photoperiod calculation to what pyephem comes up with. I calculated sunrise and sunset hours with both approaches for a latitude-longitude grid, with steps of 5/8° longitude and 1/2° latitude. This is the grid used by the NASA GMAO Modern-Era Retrospective Re-analysis dataset, version 2 (MERRA-2). You can make a pretty picture of photoperiod for the entire globe! For example, on June 25, 2012, global photoperiod kind of looks like this, from short days (darker colors) to longer days (lighter colors):

Below, I visualize the difference between my approach and pyephem; differences are within 2 hours, with pyephem describing slightly shorter days (by 1 hour, shown in blue, or by 2 hours, shown in red) relative to my implementation. Where day length appears to differ by 2 hours, this is only due to a 1-hour rounding difference in both the sunrise and sunset hours. The bias in the direction of shorter days is just because of the offset of the photoperiod calculations and my reduction in precision by rounding both down to the nearest hour. Overall, they agree pretty well.

Python Implementation

I implemented the day-length algorithm described above in Cython—partly because I wanted the practice writing Cython extension modules for Python projects, but also because my experience with pyephem pointed to the need for better performance. A pure-Python implementation can be obtained simply by removing the lines that begin with cdef. Also, I'm rounding down to the nearest hour because I merely need to determine which hours to select from a re-analysis dataset.

I also had to make a decision about how to calculate photoperiod near the poles when the sun is always above or below the horizon. For the austral and boreal summers (sun is always up), I simply decided that the photoperiod must be 24 hours long. Therefore, in the implementation below, I set the sunrise hour to zero and the sunset hour to 23 (Python starts counting at zero). For the austral and boreal winters (sun is always down), however, the solution is not obvious.

For a model that needs to make daily calculations everywhere on the Earth, it seems that even if the sun never rises we need to calculate a daily temperature, daily VPD, et cetera. These are mean quantities over any number of hours, so it doesn't matter how long the day is as long as we have at least one data point to average. Therefore, even when the sun is always below the horizon, I decided that "photoperiod" should also be 24 hours long (same as when the sun is always up). For a photosynthesis model, GPP is likely very close to zero under these conditions, so it doesn't matter much. For other applications, you may want to determine photoperiod during the winter differently. In my implementation below, I emit -1 for the sunrise and sunset hours if the sun is always below the horizon so that we can decided what to do later.

import datetime
import numpy as np

cdef list coords
cdef int doy
cdef float x, zenith, lat, lng, lng_hour, tmean, anomaly
cdef float lng_sun, ra, ra_hours, dec_sin, dec_cos
cdef float hour_angle_cos, hour_angle, hour_rise, hour_sets

# The sunrise_sunset() algorithm was written for degrees, not radians
sine = lambda x: np.sin(np.deg2rad(x))
cosine = lambda x: np.cos(np.deg2rad(x))
tan = lambda x: np.tan(np.deg2rad(x))
arcsin = lambda x: np.rad2deg(np.arcsin(x))
arccos = lambda x: np.rad2deg(np.arccos(x))
arctan = lambda x: np.rad2deg(np.arctan(x))

def sunrise_sunset(coords, dt, zenith = -0.83):
    '''
    Returns the hour of sunrise and sunset for a given date. Hours are on the
    closed interval [0, 23] because Python starts counting at zero; i.e., if
    we want to index an array of hourly data, 23 is the last hour of the day.
    Recommended solar zenith angles for sunrise and sunset are -6 degrees for
    civil sunrise/ sunset; -0.5 degrees for "official" sunrise/sunset; and
    -0.83 degrees to account for the effects of refraction. A zenith angle of
    -0.5 degrees produces results closest to those of pyephem's
    Observer.next_rising() and Observer.next_setting(). This calculation does
    not include corrections for elevation or nutation nor does it explicitly
    correct for atmospheric refraction. Source:
        U.S. Naval Observatory. "Almanac for Computers." 1990.

    Parameters
    ----------
    coords : list or tuple
        The (longitude, latitude) coordinates of interest; coordinates can
        be scalars or arrays (for times at multiple locations on same date)
    dt : datetime.date
        The date on which sunrise and sunset times are desired
    zenith : float
        The sun zenith angle to use in calculation, i.e., the angle of the
        sun with respect to its highest point in the sky (90 is solar noon)
        (Default: -0.83)

    Returns
    -------
    tuple
        2-element tuple of (sunrise hour, sunset hour)
    '''
    lat, lng = coords
    assert -90 <= lat <= 90, 'Latitude error'
    assert -180 <= lng <= 180, 'Longitude error'
    doy = int(dt.strftime('%j'))
    # Calculate longitude hour (Earth turns 15 degrees longitude per hour)
    lng_hour = lng / 15.0
    # Appoximate transit time (longitudinal average)
    tmean = doy + ((12 - lng_hour) / 24)
    # Solar mean anomaly at rising, setting time
    anomaly = (0.98560028 * tmean) - 3.289
    # Calculate sun's true longitude by calculating the true anomaly
    #   (anomaly + equation of the center), then add (180 + omega)
    #   where omega = 102.634 is the longitude of the perihelion
    lng_sun = (anomaly + (1.916 * sine(anomaly)) +\
        (0.02 * sine(2 * anomaly)) + 282.634) % 360
    # Sun's right ascension (by 0.91747 = cosine of Earth's obliquity)
    ra = arctan(0.91747 * tan(lng_sun)) % 360
    # Adjust RA to be in the same quadrant as the sun's true longitude, then
    #   convert to hours by dividing by 15 degrees
    ra += np.subtract(
        np.floor(lng_sun / 90) * 90, np.floor(ra / 90) * 90)
    ra_hours = ra / 15
    # Sun's declination's (using 0.39782 = sine of Earth's obliquity)
    #   retained as sine and cosine
    dec_sin = 0.39782 * sine(lng_sun)
    dec_cos = cosine(arcsin(dec_sin))
    # Cosine of the sun's local hour angle
    hour_angle_cos = (
        sine(zenith) - (dec_sin * sine(lat))) / (dec_cos * cosine(lat))
    # Correct for polar summer or winter, i.e., when the sun is always
    #   above or below the horizon
    if hour_angle_cos > 1 or hour_angle_cos < -1:
        if hour_angle_cos > 1:
            return (0, 0) # Sun is always down
        elif hour_angle_cos < -1:
            return (0, 23) # Sun is always up
    hour_angle = arccos(hour_angle_cos)
    # Local mean time of rising or setting (converting hour angle to hours)
    hour_rise = ((360 - hour_angle) / 15) + ra_hours -\
    (0.06571 * (tmean - 0.25)) - 6.622
    hour_sets = (hour_angle / 15) + ra_hours -\
    (0.06571 * (tmean + 0.25)) - 6.622
    # Round to nearest hour, convert to UTC
    return (
        np.floor((hour_rise - lng_hour) % 24),
        np.floor((hour_sets - lng_hour) % 24))

Should we always round down to the nearest hour? Or should we round to the nearest whole hour (round, not floor), possibly rounding up? I find that rounding down (floor) produces evenly spaced photoperiod bands across longitudes. Any other rounding scheme produces a kind of aliasing that biases day length high or low at a certain longitude. Again, you should decide what is best for your application.

Using Photoperiod for Climatic Data Aggregation

What impact does a sun-up calculation of climatic variables have? Specifically, compared to a simple daily average over 24 hours, do we see an impact from an average calculated only over the photoperiod? If we look at the difference in VPD (on June 25, 2012) between a 24-hour calculation and a photoperiod calculation, we see that the atmospheric moisture demand over land is much higher when it is aggregated from hourly data when the sun is up. This makes sense—it's hotter during the day—but the sun-up calculation of VPD might also more accurately represent the atmospheric moisture demand experienced by plants during photosynthesis. The difference can be as much 700 Pa! (I clipped the image below to 600 Pa for better visualization.)

Note that while there is a faint stamping pattern of the photoperiod on the above difference image, the daylight-aggregated VPD image does not show these artifacts.

Competing Approaches

Before describing the pyephem and astral approaches, I want to further motivate my work here. I timed the pyephem and astral implementations (code below) using Python's timeit module. The wall times below (in microseconds) are the average of the per-loop times in three trials under similar CPU loads, for an Intel Core i7-10710U (1.10 GHz) CPU.

$ python -m timeit -n 100 -s "import datetime;from my_module import updown_func"
    "updown_func((42, -83), datetime.datetime.today())"

Implementation	Time (usec)
`pyephem`	544
`astral`	125
Almanac for Computers (Python)	75
My algorithm (Cython)	63

The Cython implementation is the one I showed above. The Almanac for Computers (Python) implementation is similar but is in pure Python and calculates separate quantities throughout for sunrise and sunset—literally the way it is described in the Almanac, but ultimately unnecessary. The speed-up between the Python and Cython versions is probably due to the static typing of Cython, not that we discarded some redundant calculations.

The pyephem approach asks us to define an observer and a celestial body (in this case, the sun). I like their API, especially the custom error classes AlwaysUpError and NeverUpError. As with the Almanac for Computers implementation, dt is a datetime.date instance and coords is a 2-element sequence of latitude and longitude.

import ephem

SUN = ephem.Sun() # Module-level constant

obs = ephem.Observer()
# Positions in degrees are expected to be str type
obs.lat, obs.long = map(str, coords)
obs.date = dt.strftime('%Y-%m-%d')
obs.pressure = 0 # Do not calculate refraction
try:
    rising = obs.next_rising(SUN).datetime().hour
except (ephem.AlwaysUpError, ephem.NeverUpError):
    rising = -1
try:
    setting = obs.next_setting(SUN).datetime().hour
except ephem.AlwaysUpError:
    setting = 23
except ephem.NeverUpError:
    setting = -1
return (rising, setting)

In the astral implementation, we don't have custom error classes differentiating cases where the sun is always above or below the horizon. There is a ValueError issued when one tries to observe the sun from north of (south of) the Arctic (Antarctic) Circle. Therefore, I had to implement a pretty poor hack; we check to see if we're above or below the Arctic or Antarctic circles, respectively, in the summer or winter.

from astral import LocationInfo
from astral.sun import sun

lat, lng = coords
loc = LocationInfo()
loc.latitude = lat
loc.longitude = lng
try:
    s = sun(loc.observer, date = dt)
except ValueError as err:
    if lat > 60 and dt.month in (4, 5, 6, 7, 8, 9):
        return (0, 23) # Sun always up above Arctic Circle
    if lat < -60 and dt.month in (1, 2, 3, 10, 11, 12):
        return (0, 23) # Sun always up below Antarctic Circle
    return (-1, -1) # Sun always down
return (s['sunrise'].hour, s['sunset'].hour)

I can't recommend the astral implementation, above, because of this hack. But why is the pyephem approach so slow? The backbone of pyephem is implemented in C, but there may be some overhead in initialization of the Observer(). If you only need to calculate transit times for a single observer-body pair, this performance hit would go unnoticed. However, if you need to calculate transit times for all the pixels in a global Earth system model, I think the modified Almanac approach is the best choice.

References

Chapin, F. S., Matson, P. A., & Vitousek, P. M. (2011). Principles of Terrestrial Ecosystem Ecology. Springer New York.
Zhao, M., Heinsch, F. A., Nemani, R. R., & Running, S. W. (2005). Improvements of the MODIS terrestrial gross and net primary production global data set. Remote Sensing of Environment, 95, 164–176.
Meeus, J. (1991). Astronomical Algorithms. Willman-Bell Inc.
U.S. Naval Observatory. (1990). Almanac for Computers. The Nautical Almanac Office, U.S. Naval Observatory, Washington, D.C.
NASA (2021). "Earth Fact Sheet." Accessed: February 10, 2021.
Andøya Space Center (2021). "Introduction of the six basic parameters describing satellite orbits." Accessed: February 10, 2021.

Implementing fixed effects panel models in R

2019-03-30T13:51:00+01:00

Note: This post builds and improves upon an earlier one, where I introduce the Gapminder dataset and use it to explore how diagnostics for fixed effects panel models can be implemented.

Note (July 2019): I have since updated this article to add material on making partial effects plots and to simplify and clarify the example models.

My last post on this topic explored how to implement fixed effects panel models and diagnostic tests for those models in R, specifically because the two libraries I used for this at the time, plm and lfe, in different ways, weren't entirely compatible with R's built-in tools for evaluating linear models. Here, I want to write a much more general article on fixed effects regression and its implementation in R. Specifically, I'll write about:

Use and interpretation of fixed effects (FE) regression models in the context of repeat-measures or longitudinal data;
How to implement an FE model in R using either the built-in lm() function or those provided by plm or lfe;
Calculating variance inflation factors (VIF);
Assessing multi-collinearity among predictor variables before fitting an FE model;
FE model criticism, including whether or not the assumptions of the linear model are met;
Calculating and plotting the marginal effect of $X$ on $Y$, i.e., partial effects plots.

In this article, I'll be using the Gapminder dataset again; the previous article gives a description of the dataset and its contents.

Use and Interpretation of Fixed Effects Regression

I'm going to focus on fixed effects (FE) regression as it relates to time-series or longitudinal data, specifically, although FE regression is not limited to these kinds of data. In the social sciences, these models are often referred to as "panel" models (as they are applied to a panel study) and so I generally refer to them as "fixed effects panel models" to avoid ambiguity for any specific discipline. Longitudinal data are sometimes referred to as repeat measures, because we have multiple subjects observed over multiple periods, e.g., patients in a clinical trial or households in a study of spending habits throughout the year. You can think of multiple examples where repeat measures are relevant.

As I previously discussed, fixed effects regression originates in the social sciences, in particular in econometrics and, separately, in prospective clinical or social studies:

In these prospective studies, a panel of subjects (e.g., patients, children, families) are observed at multiple times (at least twice) over the study period. The chief premise behind fixed effects panel models is that each observational unit or individual (e.g., a patient) is used as its own control, exploiting powerful estimation techniques that remove the effects of any unobserved, time-invariant heterogeneity

The term "fixed effects" can be confusing, and is contested, particularly in situations where fixed effects can be replaced with random effects. Clark and Linzer (2014) provide a good discussion of the differences and trade-offs between fixed and random effects [1]. Gelman and Hill (2007) or Bolker et al. (2009) also provide good discussions of the differences between fixed and random effects [2,3].

Relevance of Fixed Effects Regression for Causal Inference

Repeat measures are commonly required for a particular type of causal inference. In these studies, the interpretation of a causal effect is that it occurs before or at the same time as the measured outcome (some causal effects appear to be simultaneous with the outcome, such as flipping on a light switch). In fact, FE regression models are often used to establish weak causal inference under certain circumstances; we'll soon see why.

But even where causal inference is not the goal, FE regression models allow us to control for omitted variables. In the context of a regression model, an omitted variable is any variable that explains some variation in our response or dependent variable and co-varies with one or more of the independent variables. It is something that we should be measuring and adding to our regression model because it predicts or explains our dependent variable but also because the relationship between one of our existing independent variables may depend on that omitted variable. For example, if we're interested in measuring the effect of different amounts of a fertilizer on crop yield (i.e., the weight or biomass of the harvested crop) across a set of different crop types, omitted variables might include (if we failed to measure them) the crop type or the type of soil each plant is in. Crop type certainly affects crop yield, as certain crops will have different ranges of yields they can achieve, but also may affect the way that fertilizer drives yields; certain crops may be more or less sensitive to the fertilizer we're using. Soil type, too, will affect yields (without fertilizer, it is the only source of the crop's nutrients) and the properties of the soil may affect how fertilizer is retained and subsequently absorbed by a plant's roots. In our study, failing to account for either crop type or soil type would be a source of omitted variable bias in our study design and in our model.

FE regression models eliminate omitted variable bias with respect to potentially omitted variables that do not change over time. Such time-invariant variables, like crop type or soil type, from our previous example, will be the same for each subject in our model every time it is measured. In a clinical trial, patient sex, eye color, and height (in grown adults) are all examples of time-invariant variables. We'll soon see how the use of subject-level fixed effects control for any and all time-invariant omitted variables. But first, let's appreciate the implications for causal inference.

Let's say we have repeat measures of $y$, some outcome of interest, and of multiple $x_i$ or independent variables. We have measured every relevant variable that varies over time and affects $y$ and/or the relationship between $y$ and other of the $x_i$. Furthermore, we have controlled for all sources of time-invariant differences between subjects [1]. That means the only variable(s) that can explain differences in $y$ are one or more of those time-varying $x_i$ we have measured. By estimating the effect of $x_i$ within an individual subject over time, relative to that subject's long-term average conditions, we eliminate the effects of all unobserved, time-invariant heterogeneity between the different subjects. [4]. We can then argue that a change level of any particular $x_i$—if it has a sufficient mechanism we can explain—is a likely cause of a corresponding change in $y$. Much of this depends on the nature of your data, whether or not your proposed treatment variable is reasonable, whether or not you have actually controlled for everything relevant, and, no less important, the reception this type of model will receive from your intended audience (or field of study). In general, causal inference with panel models still requires an assumption of strong exogeneity (simply put: no hidden variables and no feedbacks).

General Specification of Fixed Effects Models

In general, for a sample of subjects indexed $i\in [0, 1, 2, \dots ]$, where each individual subject can be identified as part of a group, $j$, of other observations (on the same individual or on multiple other individuals), the outcome for an individual can be modeled as:

$$ y_{ij} = \alpha_j + X_i\beta + \varepsilon_i;\quad \varepsilon_i\sim N(0, \sigma_y^2) $$

In order for this model to be identified, it is essential that the rank of $i$ be larger than the rank of $j$, i.e., that there are more individual subjects than there are groups. Otherwise, the $\alpha_j$ terms would absorb all the degrees of freedom in the model.

Alternative Specifications for Longitudinal Data

Similarly, in a repeat-measures or longitudinal framework, where the "groups" of individuals are time periods, it is essential each individual subject is observed more than once. Obviously, if the number of observations $N$ was equal to the number of individuals $i \in M$, we would exhaust the degrees of freedom in our model simply by adding $M$ intercept terms, $\alpha_i + ... + \alpha_M$. With as few as two observations $(t \in [1,2])$ of each subject, however, we've doubled the number of observations and the individual intercept terms now correspond to any time-invariant, idiosyncratic change between those two observations.

We can specify our model in two different ways; though very different, they have the same interpretation and will produce the same parameter estimates in a least-squares regression. Compared to the general specification, above, we exchange the index of groups, $j$, for an index of time periods, $t$. The first specification is an ordinary least squares (OLS) regression in which a fixed intercept, $\alpha_i$ is fit for every subject $i$.

$$ y_{it} = \alpha_i + X_{it}\beta + \varepsilon_{it} $$

The second specification subtracts the subject-specific mean values of our dependent variable, $y$, and independent variables, $X$, from the values at each period of observation, $t$, for every subject, $i$.

$$ y_{it} - \bar{y_i} = \left(X_{it} - \bar{X_i}\right)\beta + \varepsilon_{it} $$

These two specifications are equivalent because fitting a subject-specific intercept, $\alpha_i$, effectively reduces the variation in each subject's $y_i$ and $X_i$ to variation around its long-term mean. In the first, or fixed-intercept specification, $\alpha_i$ represents each subject's long-term mean. In the second, or demeaned specification, subtracting the subject-specific mean values of the dependent and independent variables is called centering the data within subjects. This is because the resulting values now have a mean value of zero.

As with everything in statistics, a diverse set of terms have been created to describe the same thing, and the terms used often depend on the lingua franca of a particular discipline. Subtracting the subject-specific means can be variously referred to as centering the data within subjects or time-demeaning the data (subtracting the long-term mean); the centered values themselves can also be referred to as deviations from the (subject-specific) mean.

Interpretation

Setting aside issues of causal inference, how do we interpret a fixed effects regression? Because of the way the data have been transformed (into deviations from subject-specific means), we cannot interpret the coefficients in the same was as for a cross-sectional OLS regression. In the cross-sectional case, we interpret a regression coefficient, $\beta$, as the change in our dependent variable per unit change in the corresponding independent variable across or between subjects; in a sense, we are estimating the effect of a difference between two subjects, one average in every way, and the other different by one unit in the corresponding independent variable. In a one-way fixed effects regression, because the dependent and independent variables have been transformed to deviations from the subject-specific means, $\beta$ is instead interpreted as the change in our dependent variable, $y$, per unit change in the corresponding independent variable, $x$, within each subject. In this sense, the regression coefficients tell us about the relationship between $x$ and $y$ as the subject's $x$ changes over time. If we accept weak causal inference is justified, the model can be interpreted as: a unit change in $x$ drives an estimated change in $y$.

Including Time Period Fixed Effects

This model can be extended further to include both individual fixed effects (as above) and time fixed effects (the "two-ways" model):

$$ y_{it} = \alpha_i + X_{it}\beta + \mu_t + \varepsilon_{it} $$

Here, $\mu_t$ is an intercept term specific to the time period of observation; it represents any change over time that affects all observational units in the same way (e.g., the weather or the news in an outpatient study). These effects can also be thought of as "transitory and idiosyncratic forces acting upon [observational] units (i.e., disturbances)" [5].

However, including time period fixed effects changes the interpretation of our model considerably. In the individual fixed effects (only) model, $\beta$ represented the "within" effect: the effect of a change in $X_i$ on $y$ within each individual $i$. Now, the time period fixed effect functions as an additional grouping in which the data are centered (in our time-demeaning framework, described above). With both time and individual fixed effects, $\beta$ essentially represents a weighted average between the pooled estimator, $\beta_{OLS}$ (from an OLS regression without fixed effects), the within estimator from our individual effects model, and a between effect from a model with time fixed effects (only) and no individual effects [6].

As Kropko and Kubinec (2018) write, regarding a similar econometric model to the one we investigate here:

This interpretation will often be difficult to communicate and to understand. The difficulty arises because the interpretation requires two dimensions of comparison, not just one. GDP per capita is negative relative to the country’s over-time average, so we compare a country to itself as it changes over time. But then, by regressing relative democracy on relative GDP per capita for the six countries, the two-way FE coefficient ultimately expresses how one country’s GPD per capita and democracy, relative to itself, compares to another country’s GDP per capita and democracy, relative to itself. If this interpretation does not match the question the model is intended to answer, then we suggest that applied researchers employ methods with interpretations that directly answer the research question.

Implementation in R

Let's load in the Gapminder dataset for the following examples. Since my previous article, I've discovered there is a gapminder package available for R that makes it easy to load these data into an R session.

library(gapminder)
data(gapminder)

Now let's take a brief look at the data. We're interested in modeling the effect of per-capita GDP on life expectancy. Let's first observed that per-capita GDP has a log-linear relationship with life-expectancy. Taking a base-10 logarithm of right-skewed dollar values is generally good practice, and the plot below shows that doing so here improves the linear relationship with life expectancy.

with(gapminder, plot(log10(gdpPercap), lifeExp,
  main = 'Life Expectancy vs. Log10 Per-Capita GDP'))

However, for now, we'll model the effect of per-capita GDP without a log transformation because it is simpler.

There are at least three ways to run a fixed effects (FE) regression in R and it's important to be familiar with your options.

With R's Built-in Ordinary Least Squares Estimation

First, it's clear from the first specification above that an FE regression model can be implemented in with R's OLS regression function, lm(), simply by fitting an intercept for each level of a factor that indexes each subject in the data.

m1.ols <- lm(lifeExp ~ country + gdpPercap + pop, data = gapminder)

One disadvantage of this approach becomes clear as soon as you call summary(m1.ols); the subject-specific (here, country-specific) intercepts are reported for 140+ countries in this dataset! That's a lot to scroll through to get to the coefficients we're actually interested in.

summary(m1.ols)$coefficients[c('gdpPercap', 'pop'),]

              Estimate   Std. Error  t value     Pr(>|t|)
gdpPercap 3.936623e-04 2.973936e-05 13.23708 5.512379e-38
pop       6.196916e-08 4.838246e-09 12.80819 8.746824e-36

We'll interpret these coefficients later. For now, let's convince ourselves that this model produces the same results if we use centered data and no country-level intercepts.

The initial challenge is in centering the data. I'm going to use a relatively sophisticated tool to do this, simply because I don't know of a reasonable way to do it with base R. Using the dplyr library's mutate_at() function, we'll calculate a new, centered variable for our dependent variable, lifeExp, and each of the three independent variables. This new variable has the suffix _dm at the end of its name, which is my abbreviation for "de-meaned" as in the mean has been subtracted from the variable; you can call it whatever you want.

library(dplyr)
gapminder.centered <- gapminder %>%
  group_by(country) %>%
  mutate_at(.vars = vars(year, lifeExp, pop, gdpPercap), .funs = funs('dm' = . - mean(.)))

summary(gapminder.centered$lifeExp_dm)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
-20.8647  -4.2138   0.4733   0.0000   4.5696  17.1973

We can see from the above that the overall mean, across all subjects, is zero. This is a consequence of the fact that the mean of each subject's measures is zero, so the mean of means is also zero. To fit the second or demeaned specification of the model using lm(), we plug in each of these centered or demeaned variables.

m2.ols <- lm(lifeExp ~ gdpPercap_dm + pop_dm,
  data = gapminder.centered)

Now, let's compare the coefficients.

cbind(
  coef(m1.ols)[c('gdpPercap', 'pop')],
  coef(m2.ols)[c('gdpPercap_dm', 'pop_dm')]
)

                  [,1]         [,2]
gdpPercap 3.936623e-04 3.936623e-04
pop       6.196916e-08 6.196916e-08

As you can see, the specifications are equivalent. These coefficients are also correct point estimates for the "within" effect of each independent variable on the outcome. However, as we'll discuss, the standard errors are not correct; this OLS model fails to account for the fact that we have repeat measures of each subject. This is a violation of the assumption of independence of errors. With our data, where we have multiple measures for the same country, some elements of the error term, $\varepsilon$, are not independent. Clustering the standard errors within countries is one solution, which I won't detail here. The more sophisticated approaches to this model discussed below will deliver the correct standard errors.

If you have a very large number of subjects, the lm() function will cease to work for the first specification of our model, with subject-specific intercepts; it simply wasn't designed to fit thousands of intercepts and it will either take a long time to compute or will fail utterly. This is where the centering approach comes in handy: it is much easier (on the computer) to work with deviations from the mean instead of computing all those subject-specific intercepts. The plm and lfe libraries, which we'll we discuss next, have no issue with a large number of subjects in your data, and you don't need to think about the two specifications we discussed when you're using those libraries.

With Dedicated Approaches for Mean Deviations

Now let's see how the dedicated packages plm and lfe are used. I'll give more time to plm because it is my preferred tool, but they both work very well. Neither of these packages fits intercepts directly, because this doesn't scale well for a large number of subjects. In cases where there is a very large subject population such an approach, which we tried with OLS, above, could lead to a failure to identify the model. Instead, these packages have tools to fit FE regression models to data that have been transformed into deviations from subject-specific means or from more complicated deviation measures (in the case of two-ways fixed effects models).

With the lfe package [7], our fixed effects regression of life expectancy on time, per-capita GDP, and total population can be expressed with a syntax similar to the of the popular lme4 and nlme packages. The felm() function is what we want to use to fit fixed effects models with lfe.

library(lfe)
m1.lfe <- lfe::felm(lifeExp ~ gdpPercap + pop | country,
  data = gapminder)

The | country syntax indicates we wish to fit a fixed intercept for each level of country. If we compare the coefficient estimates of this model to those of both of our prior OLS models, we'll see that we are indeed fitting exactly the same mean structure in all three approaches.

cbind(
  coef(m1.ols)[c('gdpPercap', 'pop')],
  coef(m2.ols)[c('gdpPercap_dm', 'pop_dm')],
  coef(m1.lfe)
)

If we examine the standard errors, however, we'll see that they are different in the demeaned OLS (or "OLS on mean deviations") model.

std.errs <- cbind(
  summary(m1.ols)$coefficients[c('gdpPercap', 'pop'),2],
  summary(m2.ols)$coefficients[c('gdpPercap_dm', 'pop_dm'),2],
  summary(m1.lfe)$coefficients[,2]
)
colnames(std.errs) <- c('OLS w/ Intercepts', 'OLS on Mean Deviations', 'felm Model')
std.errs

          OLS w/ Intercepts OLS on Mean Deviations   felm Model
gdpPercap      2.973936e-05           6.052637e-05 2.973936e-05
pop            4.838246e-09           9.846931e-09 4.838246e-09

In general, you cannot rely on OLS to deliver the correct standard errors when you have a dependence structure like repeat measures in your data. In this case, the standard errors for the OLS model with intercepts (first column) are the same as estimated by lfe (third column), because our OLS model does have a dummy variable for each country-specific intercept. The correction for standard errors, in this case, is straightforward, as the felm() documentation describes:

The standard errors are adjusted for the reduced degrees of freedom coming from the dummies which are implicitly present.

In the second column, we can see that the standard errors for the demeaned OLS model are not correct. The advantage of lfe (and plm) is that it achieves the computational efficiency of a mean-deviations approach and is also able to estimate the correct standard errors. Now let's see how plm handles the same model. In plm, the function we'll use to fit FE regression models is also called plm [8]. Below, the index argument indicates which column has levels corresponding to the subjects, for which individual, subject-level intercepts will be fit implicitly.

m1.plm <- plm(lifeExp ~ gdpPercap + pop, data = gapminder,
  model = 'within', index = c('country'))

summary(m1.plm)

Oneway (individual) effect Within Model

Call:
plm(formula = lifeExp ~ gdpPercap + pop, data = gapminder, model = "within",
    index = c("country"))

Balanced Panel: n = 142, T = 12, N = 1704

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max.
-30.23943  -3.25287   0.31427   3.54819  19.85916

Coefficients:
            Estimate Std. Error t-value  Pr(>|t|)    
gdpPercap 3.9366e-04 2.9739e-05  13.237 < 2.2e-16 ***
pop       6.1969e-08 4.8382e-09  12.808 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    73973
Residual Sum of Squares: 59768
R-Squared:      0.19203
Adj. R-Squared: 0.11796
F-statistic: 185.378 on 2 and 1560 DF, p-value: < 2.22e-16

If you compare these coefficient estimates to all three previous models, you'll see they're the same. Other things to note in the summary of plm include:

We have fit a "Oneway (individual) effect Within Model;" that is, we only fit fixed effects for the individual subjects (countries). In plm, this is the default.
Our panel data are balanced, that is, every subject (country) has the same number of observations. Here, we have n = 142 countries observed in T = 12 time periods (12 different years) for a total number of N = 1704 country-year observations. In general, fixed effects regression models are better understood and more reliable for balanced panels.

The R-squared and adjusted R-squared estimated by plm are for the "full" model, i.e., including the country-level fixed effects. If you called summary() on our lfe model, you'll see that it reports both the full model R-squared and that of the "projected" model. The "projected" model R-squared refers to the within R-squared, or the proportion of the variation over time explained by the time-varying covariates.

Diagnostics and Inference in R

Assessing Multicollinearity in Fixed Effects Regression Models

Multicollinearity arises when two or more independent variables are highly correlated with one another. It poses a serious problem for explanatory models of all kinds, including non-parametric and statistical learning approaches, because if the correlation between $x_i$ and $x_j$ is large, and both are similarly correlated with the outcome of interest, $y$, then the model cannot determine which of the two, $x_i$ or $x_j$, is explaining the observed variation in $y$. If multicollinearity exists, linear regression coefficients can be unstable or biased.

A popular way of diagnosing multicollinearity is through the calculation of variance inflation factors (VIFs). The VIF score indicates the proportion by which the variance of an estimator increases due to the inclusion of a particular covariate. Calculating VIF scores with fitted models other than those produced by lm() can be tricky (it won't work with plm or lfe models) so the easiest way to calculate VIF scores for a one-way fixed effects regression model is to calculated them over the the corresponding fitted OLS model. We saw that we could get the time-demeaned panel of the Gapminder data easily enough with dplyr and mutate_at(); below is a second way to do this using plm.

# Assuming we've already fit our plm() model...
design.matrix <- as.data.frame(model.matrix(m1.plm))

# Get the time-demeaned response variable, lifeExp
design.matrix$lifeExp <- plm::Within(
  plm::pdata.frame(gapminder, index = 'country')$lifeExp)

# Fit the OLS model on the demeaned dataset
m3.ols <- lm(lifeExp ~ gdpPercap + pop, data = design.matrix)

# Calculate VIF scores
car::vif(m3.ols)

Here, the VIF scores are all very low, so multicollinearity is not an issue.

Linear Model Assumptions: Homoscedasticity

We can assess homoscedasticity, or constant variance in the residuals, by examining a plot of the model residuals against the fitted values. Once again, R's fitted() doesn't know how to work with plm model objects, however, we can calculate the fitted values as the difference between the observed values and the model residuals.

par(bg = '#eeeeee')
fitted.values <- as.numeric(gapminder$lifeExp - residuals(m1.plm))
plot(fitted.values, residuals(m1.plm),
  bty = 'n', xlab = 'Fitted Values', ylab = 'Residuals',
  main = 'Residuals vs. Fitted')
abline(h = 0, col = 'red', lty = 'dashed')

There certainly seems to be some heteroscedasticity present, particularly in the presence of relatively large, negative residuals or high fitted values. Studentized residuals are one way of assessing the magnitude of residual in standardized units [8]. To get studentized residuals, we first have to derive hat matrix (or "projection matrix") from our linear model. This is the matrix given by the linear transformation by which we obtained the estimated coefficients for our model, $\beta$.

$$ X\hat{\beta} = X(X^T X)^{-1} X^Ty = Py $$

Where $X$ is the design matrix (matrix of explanatory variables) and $y$ is the vector of our observed response values. The hat (or projection) matrix is denoted by $P$. The diagonal of this $N\times N$ matrix (diag(P) in R) contains the leverages for each observation point. In R, we use matrix multiplication and the solve() function (to obtain the inverse of a matrix).

# Calculate projection matrix
X <- model.matrix(m1.plm)
P <- X %*% solve(t(X) %*% X) %*% t(X)

# Internally studentized residuals
sigma.sq <- (1 / m1.plm$df.residual) * sum(residuals(m1.plm)^2)
student.resids <- residuals(m1.plm) / (sigma.sq * (1 - diag(P)))

plot(fitted.values, student.resids, bty = 'n',
  xlab = 'Life Expectancy (Fitted Values)', ylab = 'Residuals',
  main = 'Studentized Model Residuals v. Fitted Values')
abline(h = 0, lty = 'dashed', col = 'red')

The apparent (perceived) distribution of the residuals is the same, but the y-axis now shows standardized units.

Checking for Influential Observations

Sometimes, a linear relationship can be dominated by a small number of highly influential observations. One way this can happen is if the domain of a certain $X_i$, say, per-capita GDP, is relatively small for most observations (e.g., most countries in a given sample have per-capita GDP in the range of $1,000-2,000) but there are a few countries which have very high per-capita GDP, say, around $5,000. The relationship between per-capita GDP and some outcome like life expectancy, for the group of countries with per-capita GDP in the range $1,000-2,000 might be nothing: the slightly wealthier countries don't have significantly higher life expectancies. However, if the very wealthy countries, with per-capita GDP around $5,000, have considerably higher life expectancy, then a positive relationship will be found between the two even though, if the very wealthy countries were removed, no such relationship would be found.

We can calculate the leverage that a particular observation (country) exerts on a linear relationship; it is like a measure of how sensitive that relationship is to a particular observation. With the faraway package, we can draw a half-normal plot which sorts the observations by their leverage. The labs argument will ensure that they are labeled by their row index, and nlab indicates how many points to label (to avoid visual clutter), starting with the most highly influential observation.

X <- model.matrix(m1.plm)
P = X %*% solve(t(X) %*% X) %*% t(X)

require(faraway)
# Create `labs` (labels) for 1 through 1704 observations
halfnorm(diag(P), labs = 1:1704, ylab = 'Leverages', nlab = 1)

It does seem like there are a few observations that may be driving the relationship. If we index the Gapminder data, we see that India's survey in 2007 is the most influential.

gapminder[708,]

# A tibble: 1 x 6
  country continent  year lifeExp        pop gdpPercap
  <fct>   <fct>     <int>   <dbl>      <int>     <dbl>
1 India   Asia       2007    64.7 1110396331     2452.

It's helpful to look at the data in all years to understand why. It seems that India's per-capita GDP rose quite fast from 2002 to 2007, along with its life expectancy. If we think India's change in this period is an outlier, we may want to remove India from our panel dataset and run the model again.

gapminder[gapminder$country == 'India',]

# A tibble: 12 x 6
   country continent  year lifeExp        pop gdpPercap
   <fct>   <fct>     <int>   <dbl>      <int>     <dbl>
 1 India   Asia       1952    37.4  372000000      547.
 2 India   Asia       1957    40.2  409000000      590.
 3 India   Asia       1962    43.6  454000000      658.
 4 India   Asia       1967    47.2  506000000      701.
 5 India   Asia       1972    50.7  567000000      724.
 6 India   Asia       1977    54.2  634000000      813.
 7 India   Asia       1982    56.6  708000000      856.
 8 India   Asia       1987    58.6  788000000      977.
 9 India   Asia       1992    60.2  872000000     1164.
10 India   Asia       1997    61.8  959000000     1459.
11 India   Asia       2002    62.9 1034172547     1747.
12 India   Asia       2007    64.7 1110396331     2452.

Partial Effects Plots

What is the marginal effect of $X$ on $Y$? This can be read directly from the summary() table, but sometimes it is nicer to visualize as a plot. Such plots are often referred to as partial effects plots [9].

First, let's switch to a different model of life expectancy. We recognized earlier that per-capita GDP really has a log-linear relationship with life expectancy.

m2.plm <- plm(lifeExp ~ I(log10(gdpPercap)) + pop, data = gapminder,
  model = 'within', index = c('country'))
summary(m2.plm)

...
Coefficients:
                      Estimate Std. Error t-value  Pr(>|t|)    
I(log10(gdpPercap)) 2.1008e+01 6.9613e-01 30.1779 < 2.2e-16 ***
pop                 3.2964e-08 4.1982e-09  7.8521 7.553e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    73973
Residual Sum of Squares: 41976
R-Squared:      0.43255
Adj. R-Squared: 0.38053
F-statistic: 594.56 on 2 and 1560 DF, p-value: < 2.22e-16

This model fits much better, which is obvious when we look at plots of the raw data but also when we examine how the goodness-of-fit score (R-squared) has changed.

First, let's decide what range of predictor values we want to test. We're interested in the effect of per-capita GDP on life expectancy. Let's look at rising per-capita GDP; this will be our parameter sweep. Recall that in our subject-centered model, these values are deviations from the mean, i.e., positive values represent per-capita GDP above the mean for the average subject.

# Create a test matrix
gdpPercap.sweep <- log10(seq(100, 10e3, 1e2)) # A parameter sweep of per-capita GDP change

Next, we want to get an empty design or model matrix. By "empty" I mean that the values in all columns are zero. This is because, in our subject-centered model, the mean value of any variable is going to be zero (because we subtracted the mean within each subject).

# Get the columns in the same order
new.X <- model.frame(m2.plm)
new.X <- new.X[,names(coef(m2.plm))]
stopifnot(colnames(new.X) == names(coef(m2.fe)))

# Get a short representation of design matrix
test.X <- colMeans(new.X)
test.X[1:length(test.X)] <- 0 # Mean value for subject-centered data is always zero)

# Fill out an empty test matrix
test.X <- matrix(rep(test.X, each = length(gdpPercap.sweep)), nrow = length(gdpPercap.sweep))

Then, we need to insert our parameter sweep into this empty design matrix. All other variables will have zero everywhere, but the variable we're interested in (per-capita GDP) needs to meaningfully change so we can visualize its effect on the outcome (life expectancy).

# Insert the parameter sweep
test.X[,which(colnames(new.X) == 'I(log10(gdpPercap))')] <- gdpPercap.sweep

Finally, we're ready to calculate the partial effect and the confidence band. The partial effect is the predicted value of $Y$ (life expectancy) for a given value of $X$ (per-capita GDP). We want to visualize the uncertainty around this prediction, so we'll also calculate the standard error of the prediction and use this to derive a 95% confidence interval around the prediction.

level <- 0.95 # For a 95% confidence interval

# Get the covariance matrix of the predictions
vcov.prediction <- test.X %*% vcov(m2.plm) %*% t(test.X)

# The standard error of the prediction is then on the diagonal
se.prediction <- sqrt(diag(vcov.prediction))

# Get the predicted value by multiplying the design matrix
#   against our coefficient estimates
predicted <- (test.X %*% coef(m2.plm))

# Calculate the t-statistic corresponding to a 95% confidence level and
#   the appropriate num. of degrees of freedom
t.stat <- qt(1 - (1 - 0.95)/2, m2.plm$df.residual)

# Calculate the lower and upper bounds of the confidence interval
lower.bound <- as.numeric(predicted - t.stat * se.prediction)
upper.bound <- as.numeric(predicted + t.stat * se.prediction)

We can quickly verify we're getting reasonable results.

head(cbind(lower.bound, predicted, upper.bound))

     lower.bound          upper.bound
[1,]    39.28448 42.01537    44.74627
[2,]    45.19739 48.33932    51.48125
[3,]    48.65621 52.03859    55.42096
[4,]    51.11029 54.66326    58.21623
[5,]    53.01382 56.69912    60.38441
[6,]    54.56912 58.36253    62.15595

Recall that the first row corresponds to a $100 increase in per-capita GDP within a given country; our model predicts that the life expectancy would increase by about 42 years for such an increase in per-capita GDP! This can also be derived from the slope of the gdpPercap coefficient:

coef(m2.plm)[1]

I(log10(gdpPercap))
           21.00769

A 2-unit increase in the log of per-capita GDP corresponds to a $100 increase, so an increase of life expectancy of $2\times 21 = 42$ years is expected. From the 95% confidence interval we just calculated, we can see that a better estimate of the effect could be given as a range between 39.3 and 44.7 years.

This seems really large; likely our model is too simple. Our model controls for all time-invariant confounding factors, but there are likely other factors that change with time (i.e., not time-invariant) that we are missing. Thus, even though our current model is robust against heterogeneity, the results are likely still biased. Our results could also be affected by mixing very poor and very wealthy countries together; small increases in per-capita GDP may indeed have outsized effects in very poor countries, but certainly not in wealthy nations.

Without recourse to greater model realism, we can at least make the partial effects plot using ggplot2. We note that larger and larger increases in per-capita GDP do have a diminishing effect on life expectancy, which is reasonable.

library(ggplot2)
library(scales)
df <- data.frame(gdpPercap = gdpPercap.sweep,
  prediction = predicted,
  lower = lower.bound,
  upper = upper.bound)
ggplot(df, mapping = aes(x = 10^gdpPercap, y = prediction)) +
  geom_line() +
  geom_ribbon(mapping = aes(ymin = lower, ymax = upper), alpha = 0.4) +
  scale_x_continuous(labels = dollar) +
  labs(x = 'Per-Capita GDP', y = 'Predicted Increase in Life Expectancy') +
  theme_minimal()

References

Clark, T. S., & Linzer, D. A. (2014). Should I Use Fixed or Random Effects? Political Science Research and Methods, 3(02), 399–408.
Gelman, A., and J. Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. New York, New York, USA: Cambridge University Press.
Bolker, B. M., M. E. Brooks, C. J. Clark, S. W. Geange, J. R. Poulsen, M. H. H. Stevens, and J. S. S. White. 2009. Generalized linear mixed models: a practical guide for ecology and evolution. Trends in Ecology and Evolution 24(3):127–135.
Allison, P. D. 2009. Fixed Effects Regression Models ed. T. F. Liao. Thousand Oaks, California, U.S.A.: SAGE.
Halaby, C. N. 2004. Panel Models in Sociological Research: Theory into Practice. Annual Review of Sociology 30(1):507–544.
Kropko, J., & Kubinec, R. (2018). Why the Two-Way Fixed Effects Model Is Difficult to Interpret, and What to Do About It. SSRN, 1–27.
Gaure, S. (2013). lfe: Linear Group Fixed Effects. The R Journal, 5(2), 104–117. Retrieved from http://journal.r-project.org/archive/2013-2/gaure.pdf
Croissant, Y., & Millo, G. (2008). Panel data econometrics in R: The plm package. Journal of Statistical Software, 27(2), 1–43.
Faraway, J. J. (n.d.). Linear Models with R (2nd ed.). Boca Raton, U.S.A.; London, England; New York, U.S.A.: Chapman & Hall/CRC Texts in Statistical Science.

Parallel processing of raster arrays in Python with NumPy

2018-07-30T15:00:00+02:00

I've been using Google Earth Engine recently to scale up my remote sensing analyses, particularly by leveraging the full Landsat archive available on Google's servers. I've followed Earth Engine's development for years, and published results from the platform, but, before now, never had a compelling reason to use it. Now, without hubris, I can say that some of the methods I'm using (radiometric rectification of thousands of images in multi-decadal time series) are already straining the limits of the freely available computing resources on Google's platform. After an intensive pipeline that merely normalizes the time series data I want to work with, I don't seem to have the resources to perform, say, a pixel-level time series regression on my image stack. Whatever the underlying issue (is it never quite clear with Earth Engine), regressions as the scale of 30 meters (a Landsat pixel) for the study area I'm working on, following the necessary pre-processing, hasn't been working.

I started wondering if I could calculate the regressions myself in (client-side) Python. Image exports from Earth Engine used to be infeasible but they have vastly improved and, recently, I've been able to schedule export "tasks," monitor them using a command-line interface, and download the results directly from Google Drive. With the pre-processed rasters downloaded to my computer, I turned to NumPy to develop a vectorized regression over each pixel in a time series image stack. Here, I describe the general procedure I used and how it can be scaled up using Python's concurrency support, pointing out some potential pitfalls associated with using multiple processes in Python.

A Note on Concurrency

I recently attended part of a workshop run by XSEDE, a collaborative organization funded by the National Science Foundation to further high performance computing (HPC) projects. The introductory material was very interesting to me, not because I was unfamiliar with HPC, but because of the many compelling reasons for learning and using concurrency in scientific computing applications.

First and perhaps best known among these reasons, is that Moore's Law, which predicts a doubling in the number of transistors on a commercially available chip every two years, started to level off at around 2004. The rate is slowing, and in the graph at the top of Wikipedia's page on the Law, you can already see the right-hand turn this trajectory is making. This is largely due to real physical limitations that chip developers are starting to encounter. The gains in the number of transistors per chip have come from making transistors progressively smaller; the smaller the transistor, the more heat (density) that must be dissipated.

According to one of XSEDE's instructors during the recent "Summer Bootcamp" that I attended, computer chips today are tasked with dissipating heat, in terms of watts-per-square meter, on the order of a nuclear reactor! Keeping things from melting has become the main concern of chip design. But, as we reduce the clock rate and the voltage required to run the chip, we can reduce the amount of heat generated. There is a reduction in performance, but if we add a second chip and run both at a lower voltage, we can get more performance for the same total amount of voltage used by the single chip. Lower voltage means less power consumed and less heat generated [1].

In short, because of physical limitations to chip design and the demand for both power and heat dissipation, commercially available computers today almost exclusively ship with two or more central processing units (CPUs or "cores") running at a clock speed (measured in GHz) than many chips sold just a few years ago. When I was building computers as a kid in the early 2000s, I could buy Intel chips with clock speeds in excess of 3 GHz. A quick search on any vendor's website (say, Dell Corporation's) today, however, reveals that clock speeds haven't really budged: their new desktops ship with cores clocked at 3-4 GHz. The new computers are still "faster" for many applications, however, for the reasons I just discussed: multiple cores running simultaneously.

Multiple Threads, Multiple Processes

However, if an application isn't designed to take advantage of multiple cores, you won't see that performance gain: you'll have a single-threaded or single-core (more on this) or, more generally, serial set of instructions (computer code) running on a single, slower core. So, how do take advantage of multiple cores? First, Clay Breshear [1] helpfully disambiguates some terminology for us: the author argues that "parallel" programming should be considered a special case of "concurrent" programming. Whereas concurrent programming specifies any practice where multiple tasks can be "in progress" at the same time, parallel programming describes a practice where the tasks are proceeding simultaneously. More concretely, a single-core system can be said to allow concurrency if the concurrent code can queue tasks (e.g., as threads) but it cannot be said to allow parallel computation.

Threads? Suffice to say, whereas a program on your computer typically starts a single process, that process can spawn multiple threads, each representing a potentially concurrent task. To get at "parallel programming" and increase the performance of your application, you can leverage either threads or processes. Threads share resources, particularly memory, and thus are able to communicate shared memory through message-passing interfaces. Processes do not share memory. This sounds like a disadvantage, but it's actually easier to get started using multiple processes than multiple threads; this article will focus on using multiple processes. Another reason I favor multiple processes here is due to the nature of concurrency in Python...

On Concurrency in Python

Because threads are somewhat complicated and, perhaps, also because of historical developments I'm not aware of, CPython (the standard Python implementation) has in place a feature termed the Global Interpreter Lock (GIL). If your multi-threaded Python program were like a discussion group with multiple participants (threads), the GIL is like a ball, stone, or other totem that participants must be holding before they can speak. If you don't have the ball, you're not allowed to speak; the ball can be passed from one person to another to allow that person to speak. Similarly, a thread without the GIL cannot be executed [2].

In short, spawning multiple threads in Python does not improve performance for CPU-intensive tasks because only one thread can run in the interpreter at a time. Multiple processes, however, can be spun up from a single Python program and, by running simultaneously, get the total amount of work done in a shorter span of time. You can see examples both of using multiple threads and of using multiple cores in Python in this article by Brendan Fortuner [3]. In this example of linear regression at the pixel-level of a raster image, our parallel execution is that of linear regression on a single pixel: with more cores/ processes, we can compute more linear regressions simultaneously, thereby more quickly exhausting the finite number of regressions (and pixels) we have to do.

Example

To get to a pixel-wise regression, in either a serial or multi-process pipeline, we need to:

Read in the dependent variable data (here, maximum NDVI) from each file and combine them into a single array;
Get an ordered array of the dates (years) that are the independent variables in our regression;
Ravel both arrays into a shape that allows a vectorized function to be executed over each pixel's time series;
For each pixel's time series, calculate the slope of a regression line.

Getting Started

I have a time series of maximum NDVI from 1995 through 2015 (21 images) covering the City of Detroit. So, to start with, I need to read in each raster file (for each year) and concatenate them together into a single, large, $N$-dimensional array, where $N=21$ for 21 years, in this case. Strictly speaking, my array ends up being of a size $N\times Y\times X$ where $Y,X$ are the number of rows and columns in any given image, respectively (note: each image must have the same number of rows and columns). Below, I demonstrate this directly; I use the glob library to get a list of files that match a certain pattern: *.tiff.

import glob
import os
import sys
import numpy as np
from osgeo import gdal

# Create a list (generator) of the years, 1995-2015
years_list = range(1995, 2016)
years_array = np.array(years_list)

# Make it a 2-dimensional array to start
years_array = years_array.reshape((1, years_array.shape[0]))

# Get a list of the relevant files; sort in place
ordered_files = glob.glob('*.tiff')
ordered_files.sort()

Note that calling the file list's sort() method is essential: if the files are not in the right order, our regression line won't be fit properly (we will be assigning dependent variables to the wrong independent variable/ wrong year).

Combining Array Data from Multiple Files

Our target array from regression over $N$ years (files) is a $P\times N$ array, where $P = Y\times X$. That is, $P$ is the product of the number of rows and the number of columns. This array can be thought of as a collection of 1-D subarrays with $N$ items: the measured outcome (here, maximum NDVI) in each year.

# Iterate through each file, combining them in order as a single array
for i, each_file in enumerate(ordered_files):
    year = years_list[i]

    # Open the file, read in as an array
    ds = gdal.Open(each_file)
    arr = ds.ReadAsArray()
    ds = None
    shp = arr.shape

    # Ravel the array to a 1-D shape
    arr_flat = arr.reshape((shp[0]*shp[1], 1))

    # For the very first array, use it as the base
    if i == 0:
        base_array = arr_flat
        continue # Skip to the next year

    # Stack the arrays from each year
    base_array = np.concatenate((base_array, arr_flat), axis = 1)

Now that we have a suitably shaped array of our dependent variable data, maximum NDVI, we want to generate an array of identical shape of our independent variable, the year.

# Create an array for the X data, or independent variable, i.e., the year
shp = base_array.shape
years_array = np.repeat(years_array, shp[0], axis = 0)\
    .reshape((shp[0], shp[1], 1))
base_array = base_array.reshape((shp[0], shp[1], 1))

Finally, we can combine both our dependent and independent variables into a single $(P\times N\times 2)$ array.

# Now, combine X and Y data
base_array = np.concatenate((years_array, base_array), axis = 2)

Sidebar: Linear Regression in SciPy

Let's take a moment to examine how linear regression works in SciPy (the collection of scientific computing tools that extend from NumPy). The linear regression function is found as scipy.stats.linregress(); there are at least two ways to specify a linear regression and I opted for the approach that requires a single array argument to the function, e.g.:

from scipy import stats
stats.linregress(x) # e.g., x is a Nx2 array for N regression cases

In this approach, as the SciPy documentation states:

If only x is given (and y=None), then it must be a two-dimensional array where one dimension has length 2. The two sets of measurements are then found by splitting the array along the length-2 dimension.

This is why we created a combined $(P\times N\times 2)$ array; each pixel is then a $(N\times 2)$ subarray that is already set up for the stats.linregress() function. The last part, since we want to calculate regressions on every one of $P$ total pixels is to map the stats.linregress() function over all pixels. We'll define a function to do just this, as below. The first (zeroth) element in the sequence that is returned is the slope, which is what we want.

def linear_trend(array):
    N = array.shape[0]
    result = [stats.linregress(array[i,...])[0] for i in range(0, N)]
    return result

Calculating Regressions on Subarrays across Multiple Processes

Potential pitfall: It may seem straightforward now to farm out a range of pixels, e.g., $[i,j] \in [0\cdots P]$ for $P$ total pixels. However, with multiple processes, each process gets a complete copy of the resources required to get the job done. For instance, if you spin up 2 processes asking one to take pixel indices $0\cdots P/2$ and the other to take pixel indices $P/2\cdots P$, then each process needs a complete copy of the master ($P\times N\times 2$) array. See the issue here? For large rasters, that's a huge duplication of working memory. A better practice is to literally divide the master array into chunks before farming out those pixel (ranges) to each process.

Dividing rectangular arrays in Python based on the number of processes you want to spin up may seem tricky at first. Below, I use some idiomatic Python to calculate the range of pixel indices each process would get based on P processes (note: here, P is the number of processes, whereas earlier I referred $P$ as the number of pixels).

N = base_array.shape[0]
P = (NUM_PROCESSES + 1) # Number of breaks (number of partitions + 1)

# Break up the indices into (roughly) equal parts
partitions = list(zip(np.linspace(0, N, P, dtype=int)[:-1],
    np.linspace(0, N, P, dtype=int)[1:]))

# Final range of indices should end +1 past last index for completeness
work = partitions[:-1]
work.append((partitions[-1][0], partitions[-1][1] + 1))

It might be useful if you see what work contains. In my example, I had 730331 total pixels and I wanted to farm them out, evenly, to 4 processes. Note that the last range ends on 730332, since the Python range() function does not include the ending number (that is, we want to make sure we count up to, but not including pixel 730332).

>>> work
[(0, 182582), (182582, 365165), (365165, 547748), (547748, 730332)]

Concurrency in Python 3

Finally, to farm out these subarrays to multiple processes, we need to use the ProcessPoolExecutor that ships with Python 3, available in the concurrent.futures module.

Potential pitfall: You might be tempted to use a lambda function in place of the linear_trend() function we defined above, for any similar pixel-wise calcualtion you want to perform. Because Python multi-process concurrency requires that every object farmed out to multiple processes be "picklable,", you can't use lambda functions. Instead, you must define a global function, as we did, above, with linear_trend(). What does "picklable" mean? It means that the object can be pickled using Python's pickle library; pickled objects are binary representations of Python state, i.e., of Python data, functions, classes, etc. Why does each process' state need to be pickled? I'll let the Python concurrent.futures library answer that directly:

The ProcessPoolExecutor class is an Executor subclass that uses a pool of processes to execute calls asynchronously. ProcessPoolExecutor uses the multiprocessing module, which allows it to side-step the Global Interpreter Lock but also means that only picklable objects can be executed and returned.

If you do need an anonymous or dynamically created function, like a lambda function, you can still use such a pattern with concurrency in Python; you just need to use the partial() function as a wrapper.

The ProcessPoolExecutor creates a context in which we can map a (globally defined, picklable) function over a subset of data. Because it creates a context, we invoke it using the with statement.

from concurrent.futures import ProcessPoolExecutor

# Split the master array, base_array, into subarrays defined by the
#   starting and ending, i and j, indices
with ProcessPoolExecutor(max_workers = NUM_PROCESSES) as executor:
    result = executor.map(linear_trend, [
        base_array[i:j,...] for i, j in work
    ])

After the processes terminate, their results are stored as a sequence which we can coerce to a list using the list() function. In our case, in particular, because we split up the $P$ pixels into 4 sets (for 4 processes), we want to concatenate() them back together as a single array.

regression = list(result)
result = np.concatenate(regression, axis = 0)

And, ultimately, if we want to write the pixel-wise regression out as a raster file, we need to reshape it to a 2-dimensional, $Y\times X$ raster, for $Y$ rows and $X$ columns.

output_array = np.array(result).reshape((num_rows, num_cols))

Performance Metrics

No discussion of concurrency would be complete without an analysis of the performance gain. If you're not already aware, Python's built-in timeit module is de rigeur for timing Python code; below, I use it to time our pixel-wise regression in both its serial and parallel (multiple-process) forms.

With 4 processes:

$ python -m timeit -s "from my_regression_example import main" -n 3 -r 3 "main('~/Desktop/*.tiff')"
3 loops, best of 3: 46.1 sec per loop

With 1 process (serial):

$ python -m timeit -s "from my_regression_example import main" -n 3 -r 3 "main('~/Desktop/*.tiff')"
3 loops, best of 3: 152 sec per loop

As you can see, with 4 processes we're finishing the work in about one-third of the time as it takes with only one process. You might have expected us to finish in a quarter of the time, but because of the overhead associated with spinning up 4 processes and collecting their results, we never quite get a $1/P$ reduction in time for $P$ processes. This speed-up is still quite an achievement, however. No matter how many processes we use, the regression results are, of course, the same; below is the image we created, with colors mapped to regression slope quintiles.

In Summary

By raveling a 2-D raster array into a collection of pixel-level subarrays, we can easily pass any vectorized function over them, allowing us to do far more than regression. With a vectorized function, we can farm out the work to multiple processes to finish our total work faster. In addition, because we split our raster time series up into multiple chunks, the memory required is no more than we need to store the entire raster time series. Keep in mind these two pitfalls that are commonly encountered with multi-processing in Python:

DON'T use a lambda function as the function to map over the array; instead, use regular, globally defined Python functions, with or without functools.partial, as needed.
DO chunk up the array into subarrays, passing each process only its respective subarray.

In case my walkthrough above was overwhelming, a general pattern for parallel processing of raster array chunks is presented below.

import glob
import numpy as np
from osgeo import gdal
from concurrent.futures import ProcessPoolExecutor

# Example file list; filenames should have some numeric date/year
ordered_files = glob.glob('*.tiff')
ordered_files.sort()

# A function that maps whatever you want to do over each pixel;
#   needs to be a global function so it can be pickled
def do_something(array):
    N = array.shape[0]
    result = [my_function(array[i,...]) for i in range(0, N)]
    return result

# Iterate through each file, combining them in order as a single array
for i, each_file in enumerate(ordered_files):
    # Open the file, read in as an array
    ds = gdal.Open(each_file)
    arr = ds.ReadAsArray()
    ds = None
    shp = arr.shape
    arr_flat = arr.reshape((shp[0]*shp[1], 1)) # Ravel array to 1-D shape
    if i == 0:
        base_array = arr_flat # The very first array is the base
        continue # Skip to the next year

    # Stack the arrays from each year
    base_array = np.concatenate((base_array, arr_flat), axis = 1)


# Break up the indices into (roughly) equal parts, e.g.,
#   partitions = [(0, 1000), (1000, 2000), ..., (9000, 10001)]
partitions = [...]

# NUM_PROCESSES is however many cores you want to use
with ProcessPoolExecutor(max_workers = NUM_PROCESSES) as executor:
    result = executor.map(linear_trend, [
        base_array[i:j,...] for i, j in partitions
    ])

combined_results = list(result) # List of array chunks...
final = np.concatenate(regression, axis = 0) # ...Now a single array
np.array(final).reshape((num_rows, num_cols)) # ...In the original shape

References

Breshears, C. 2009. The Art of Concurrency: A Thread Monkey's Guide to Writing Parallel Applications. O'Reilly Media Inc. Sebastopol, CA, U.S.A.
Beazley, D. 2010. "Understanding the Python GIL." PyCon 2010. Atlanta, Georgia.
Fortuner, B. 2017. "Intro to Threads and Processes in Python."

Unsupervised learning for time series data: Singular spectrum versus principal components analysis

2017-09-19T10:00:00+02:00

Recently, I was working with a colleague on a project involving time series observations of neighborhoods in Los Angeles. We wanted to see if there were patterns in the time series data that described how similar neighborhoods evolved in time. For multivariate data, this is a great application for unsupervised learning: we wish to discover subgroups among either among the variables (obtaining a more parsimonious description of the data) or among the observations (grouping similar samples together) [1]. My colleague described what we needed as a "PCA for time series."

Though I was familiar with principal components analysis (PCA), I didn't know what to expect from applying PCA to a time series.

How should the data be structured?
How does the approach compare to digital signal processing techniques like singular spectrum analysis (SSA)?
How can PCA and SSA be implemented in the R environment?

Here, I attempt to provide a brief introduction to PCA, SSA, and their implementations in R, along with the relevant considerations of their similarities and differences. It may seem that SSA is coming out of left field; what the hell is it? This investigation was prompted by a question about PCA, after all. However, SSA and PCA have interesting similarities and differences that I think merit their joint discussion, at least at the level I'm capable of here.

Principal Components Analysis

The objective of PCA is to "provide the best $m$-dimensional approximation (in terms of Euclidean distance)" [1] to each observation in a $p$-dimensional dataset, where $p > m$. This characterization places PCA in a list of other "dimesionality reduction" techniques that seek to describe a set of data using fewer variables (or dimensions/ degrees of freedom) than were measured. A lower-dimensional description of a dataset has obvious benefits for data compression---fewer variables used to describe the data means fewer columns in a table or fewer tables in a data cube---but it can also reveal hidden structure in the data.

What we obtain from PCA is a coordinate transformation; our data are projected from their original coordinate system (spanned by the variables we measured) onto a new, orthogonal basis. In this way, correlation that may have existed between the columns in our original data is eliminated. The new variables, referred to as principal components, in this new coordinate system can be hard to interpret and may not have a physical meaning. They are intrinsicaly ordered by the amount of variance in the data they explain.

PCA can be implemented in one of two ways:

Through a spectral decomposition of the covariance ("unstandarized" PCA) or correlation matrix ("standardized" PCA) [1];
Through a singular value decomposition (SVD) of the data matrix, $X$.

PCA is sometimes referred to as being "standardized" or "unstandardized" [2]. In standardized PCA, the correlation matrix is used in place of the covariance matrix of unstandardized PCA. If the variables in $X$ are measured on different scales, their corresponding variances will also be scaled differently, which can put unequal weight on one or more of our original variables. This weight is often unjustified. James et al. (2013), for example, describe a dataset where one of the variables in an urban crime dataset, the rate of criminal assaults, has a much larger variance than the other variables, including the rates of rape and murder, simply because they occur far more often than other crimes. Figure 10.3 in their book is an excellent visual aid here. In general, we want to both mean-center and standardize the variables in our data matrix, $X$, prior to PCA. If we're using the SVD approach, mean-centering and standardizing our data matrix, $X$, is equivalent to using the correlation matrix, rather than the covariance matrix, in spectral decomposition.

It's important for both the description of PCA and, later, of SSA, to provide some background on matrix decomposition.

Spectral or Eigenvalue Decomposition

Spectral decomposition, also referred to as eigenvalue decomposition, factors any diagonalizable (square) matrix into a canonical form. The oft-described "canonical form" makes intuitive sense, here, in a larger discussion of PCA because a canonical form is precisely what we wish to find in our original, messy dataset.

Given a multivariate dataset $X$ as an $n\times p$ matrix, where the columns $X_i,\, i\in \{1,\cdots ,p\}$ represent distinct variables and the rows represent different observations or samples of each of the variables, the spectral decomposition of the covariance matrix can be written as:

$$ Q^{-1}AQ = \Lambda $$

Where $A$ is a real, symmetric matrix and the columns of $Q$ are the orthonormal eigenvectors of $A$ [3]. $\Lambda$ is a diagonal matrix and the non-zero values correspond to the eigenvalues. Elsner and Tsonis (1996) provide a concise introduction to the intuition behind eigenvectors and eigenvalues.

Singular Value Decomposition

Singular value decomposition is a generalization of spectral decomposition. Any $m\times n$ matrix $X$ can be factored into a composition of orthogonal matrices, $U$ and $V^T$, and a diagonalizable matrix $\Sigma$:

$$ X = U\Sigma V^T $$

The columns of the $m\times m$ matrix $U$ are the eigenvectors of $XX^T$ while the columns of the $n\times n$ matrix $V$ are the eigenvectors of $X^TX$. The columns of $U$ and $V$ are also called the "left" and "right" singular vectors. The singular values on the diagonal of $\Sigma$ are the square-roots of the non-zero eigenvalues of both the $XX^T$ and $X^TX$ matrices [5].

In PCA, the "right singular vectors," the columns of the $V$ matrix, of an SVD are equivalent to the eigenvectors of the covariance matrix [4]. Also, the eigenvalues of the covariance matrix correspond to the variance explained by each respective principal component.

PCA can be implemented in R in a few different ways for a data matrix X.

# SVD of the (scaled) data matrix; the name `v` is the matrix of PCs as column vectors
svd(scale(X))$v

# Spectral decomp. of the covariance/ correlation matrix;
# `vectors` has matrix of PCs as column vectors
eigen(cor(X))$vectors

# Built-in tool for PCA; `rotation` has matrix of PCs as column vectors
prcomp(X, scale. = T)$rotation

Singular Spectrum Analysis

Singular spectrum analysis (SSA) is a technique used to discover oscillation series of any length within a longer (univariate) time series. Oscillations are of interest, generally, because they are associated with various signals of interest: in ecology, it could be seasonal/ phenological change; in physics or engineering, it could be a mechanical or electrical wave.

"An oscillatory series is a periodic or quasi-periodic series which can be either pure or amplitude-modulated. Noise is any aperiodic series. The trend of the series is, roughly speaking, a slowly varying additive component of the series with all the oscillations removed." - Golyandina et al. (2001)

Unlike PCA, SSA is generally performed on a univariate dataset: a single variable observed at multiple points in time. There is a multivariate form of SSA, sometimes called M-SSA, but it is beyond my current understanding. Univariate SSA is a lot like PCA for univariate time series data.

There are a couple of different approaches to setting up SSA. One way, described in Elsner and Tsonis' (1996) excellent and very accessible book, begins by constructing the trajectory matrix. The trajectory matrix is the $n\times m$ matrix whose row vectors are every consecutive $m$-tuple, or every window of length $m$, in the time series.

"By using lagged copies of a single time series, we can define the coordinates of the phase space that will approximate the dynamics of the system from which the time record was sampled. The number of lags is called the embedding dimension." - Elsner and Tsonis (1996)

A general formula for the number of rows in the trajectory matrix is $n = n_t-m+1$, where $n_t$ is the length of the original time series vector. For instance, a time series of 6 observations with an embedding dimension of $m=3$ will have $n=4$ possible combinations of 3 consecutive values from among 6 ordered values. This means there are 4 rows in the trajectory matrix. The trajectory matrix is definitely not linearly independent.

"[The trajectory matrix] contains the complete record of patterns that have occurred within a window of size [m]." - Elsner and Tsonis (1996)

The trajectory matrix is the matrix whose $(i,j)$ element is defined:

$$ [X]_{ij} = x_{i+j-1} $$

Where $x$ is some ordered, time series vector.

It is customary to normalize the elements of the trajectory matrix by $n$, the number of windows. That is, for a single time series record $v = \{v_1, v_2, \cdots, v_n\}$, the trajectory matrix for an embedding dimension $m$ could be written as:

$$ X = \frac{1}{\sqrt{n}} \left[\begin{array}{ccc} v_1 & \cdots & v_{m}\\ v_2 & \cdots & v_{m+1}\\ & \ddots &\\ v_{n-m+1} & \cdots & v_{n}\\ \end{array}\right] $$

The lagged-covariance matrix is then defined:

$$ S = X^TX $$

There are two ways to perform SSA as a matrix decomposition:

Through the spectral decomposition of the (normalized) lagged-covariance matrix, $S = X^TX$. Here, $X$ is not the data matrix, but the trajectory matrix; the matrix formed by all possible time windows of length $m$. The normalization factor used is $1/\sqrt{n}$, where $n = n_t-m+1$ is the number of time windows of length $m$ taken from a time series vector of length $n_t$.
Through the singular value decomposition (SVD) of the trajectory matrix [4], $X$, taking the right singular vectors, or $V$ in the SVD given by $S + UG(V^T)$.

It's easy to see why these two methods are equivalent, if we recall that the right singular vectors of an SVD on $X$ correspond to the eigenvectors of the matrix $X^TX$. The spectral decomposition approach is described in detail by Elsner and Tsonis (1996) while the SVD approach is described by Golyandina et al. (2001).

SSA, Stationarity, and Autocorrelation

If the underlying signal is contaminated only by white noise (AR0 noise), then the dominant eigenvalutes will be associated with oscillations in the time series record. If the noise is autocorrelated (red or AR1 noise), however, "then dominant eigenvalues in the singular spectrum will characterize both the noise and signal components of the record" [3]. It should also be noted that higher-order autocorrelation structures, AR2 and AR3, do produce oscillations.

SSA versus PCA

You should immediately notice one similarity between PCA and SSA. They both can be computed through either a spectral decomposition of a covariance matrix or through an SVD of the data matrix, taking careful note that the data matrix in SSA is the trajectory matrix for a single time series record.

Elsner and Tsonis (1996) claim that aside from the difference between the composition of $X$, i.e., between the trajectory matrix (containing lagged windows of a univariate time series) in SSA and the data matrix of PCA (containing multivariate time series records), "there is no difference between the expansion [of the data set] used in classical PCA and the expansion [of the data set] in SSA." More specifically, PCA can be defined as the spectral decomposition of the covariance matrix $X^TX$ whereas SSA can be defined as the spectral decomposition of the (normalized) lagged-covariance matrix, which is also designated $X^TX$; however, in PCA, $X$ is the data matrix while in SSA $X$ is the trajectory matrix.

The difference between the structure of the matrix $X$ in PCA versus SSA is precisely what contributes to their different behaviors.

In PCA, the matrix consists of a single variable observed at multiple locations (a multivariate time series dataset, where the variables are different spatial locations) and the resuling components are termed spatial principal components [3]. In SSA, the matrix consists of a single time series (a single variable observed in a single location) "observed" at different time windows; the resulting components derived through SSA could be termed temporal principal components.

In the digital signal processing community, PCA (as a spectral decomposition of the correlation, not covariance, matrix) is also known as Karhunen–Loève (K-L) expansion.

Implementation in R

I'll demonstrate PCA and SSA in R using two different time series datasets, as they do have two, central assumptions about time series data that differ:

In PCA, we necessarily have one variable observed over time at multiple locations or among multiple sample units;
In (univariate) SSA, we necessarily have one variable observed over time as a single record, or in a single location.

For SSA, I'll use the LakeHuron dataset, bundled with R and described here; it consists of observations of the water level in Lake Huron (one place) over time. For PCA, I couldn't find a built-in dataset that was adequate, so I'm using a series of home sale price observations in multiple neighborhoods in Los Angeles from 1989 to 2010.

PCA for Time Series Data in R

The first thing we want to do with time series data in R is create a time plot to look at the (mean) behavior over time. Here, a time plot of the price-per-square foot data indicates there is an overall regional oscillation in prices. In Los Angeles, it appears that prices peak just before the subprime mortgage crisis of 2006-2007.

The set up our data for PCA, we need to make sure the data frame is in "wide format," i.e., the years span the columns.

$$ X = \left[\begin{array}{ccc} x_{1,1989} & \cdots & x_{1,2010}\\ & \ddots & \\ & & x_{N,2010} \end{array}\right] $$

More formally, the elements of the $X$ matrix are generated from sampling a time series $T$ at location $i$ and at time $j$:

$$ \left[X\right]_{ij} = T(\,\mathrm{Location}\,\, i , \mathrm{Time}\,\, j\,) $$

Deriving the Spatial Principal Components

The PCA here is performed using a singular value decomposition of the mean-centered and scaled data matrix, $X$. We then create a screeplot, which shows the proportion of variance explained by each principal component.

pca.price.only <- svd(scale(my.data)))

plot((pca.price.only$d^2 / sum(pca.price.only$d^2))[1:10], type = 'b',
  log = 'y', main = 'Screeplot (Log10): PCA on Price Data Only',
  ylab = 'Proportion of Variance Explained', xlab = 'No. of Principal Components')

Recall that the variance explained is proportional to the corresponding eigenvalue, among $P$ total eigenvalues. The proportion of total variance, $d_i$, attributable to principal component $i$, can thus be calculated from the SVD, $X = U\Sigma V^T$, as:

$$ d_i = \frac{\mathrm{diag}(\Sigma)_i^2}{\sum_{i=1}^P\mathrm{diag}(\Sigma)_i^2} $$

The screeplot is helpful for identifying how many distinct components to the variance exist. If our aim with PCA is dimensionality reduction or compression, we can use this plot to decide how many principal components are needed to approximate the original data.

Visualizing the Spatial Principal Components

We'll examine the first four (4) principal components. By plotting the right singular vectors, we can visualize the loadings of each variable on each of the principal components. Recall that because the "variables" in this time-series dataset are different years of observation, what the loadings here represent is the contribution of each year to the spatial pattern of variance. That is why Elsner and Tsonis (1996) refer to these as spatial principal components.

pca.price.components <- data.frame(
  year = seq.int(1989, 2010),
  Price.PC1 = pca.price.only$v[1:22,1],
  Price.PC2 = pca.price.only$v[1:22,2],
  Price.PC3 = pca.price.only$v[1:22,3],
  Price.PC4 = pca.price.only$v[1:22,4])

require(stringr)
pca.price.components %>%
  gather(key = 'PCs', value = 'loading', -year) %>%
  mutate(
    variable = substr(PCs, 1, 5),
    PCs = str_replace(PCs, '(Price|Loans)\\.', '')) %>%
  ggplot(mapping = aes(x = year, y = loading)) +
  geom_line(size = 1) +
  facet_wrap(~ PCs) +
  labs(title = 'Loadings on Principal Components: Log10 Price-per-Square Foot',
    x = '') +
  theme_linedraw()

Eastman and Fulk (1993) conducted a standardized PCA on vegetation in Africa and provide an excellent interpretation of the loadings on the spatial principal components:

"If a [year] shows a strong positive correlation with a specific component, it indicates that that [year] contains a latent (i.e., to some extent hidden or unappearnt) spatial pattern that has strong similarity to the one depicted in teh component image. Similarly, a strong negative corerlation indicates that the monthly image has a latent pattern that is the inverse of that shown (i.e., with positive and negative anomalies reversed." [2]

When the first principal component is essentially constant over time, as it is in the case of PC1 in this example, it indicates that the dominant variation in the data occurs over space. In this example, it means that there is more variation in sale prices between L.A. neighborhoods in any year than over time for all neighborhoods.

Mapping the spatial principal component scores, or the original values projected onto the principal components, might aid intepretation. The scores can be obtained for a one or more principal components, up to $m$ total principal components, as the product of a subset of the columns, $W$, and the mean-centered and scaled time-series data, $Z$:

$$ T = ZW\quad\mbox{where}\quad W = \left[\begin{array}{cccc} V_1 & V_2 & \cdots & V_p \end{array}\right]\quad\mbox{for}\quad V_i\in V,\, p \le m $$

# Mean-center and scale the original data values
var.price.scaled <- as.matrix(scale(select(var.by.year.clean, starts_with('price'))))

# Calculate the rotated values
pca.price.spatial <- matrix(nrow = nrow(var.price.scaled),
  ncol = ncol(pca.price.only$v))
for (i in 1:ncol(pca.price.spatial)) {
  pca.price.spatial[,i] <- var.price.scaled %*% as.matrix(pca.price.only$v[,i])
}

When we map the scores, here presented as the number of standard deviations around the mean score, for the first principal component, we see the dominant, time-invariant (or aggregate) spatial variation in price. If we make a similar map for PC2, we can compare it to the loadings plot above. The areas with positive correlations in the map follow the time trend indicated in the loadings for PC2; the areas with negative corerlations in the map follow the inverse of that PC2 time trend.

SSA for Time Series Data in R

For SSA, which assumes weak stationarity, we want to look at the first differences of the LakeHuron data. Differencing is easy in R with the diff() function.

data(LakeHuron)
plot(LakeHuron, main = 'Lake Huron Water Levels', xlab = 'Lake Level (ft)')
plot(diff(LakeHuron), main = 'Lake Huron Water Levels: First Difference',
  xlab = 'Lake Level (ft)')

We might take a moment to confirm that first-differencing of the water levels data is adequate to produce a time series that is first-order stationary. The acf() function in R is a good tool for visual inspection of time series data. The resulting plot shows very low correlation after the zeroth lag (lag $= 0$), which is encouraging.

acf(diff(LakeHuron), type = 'covariance')

Construction of the Trajectory Matrix

We'll construct two different trajectory matrices, investigating 10- and 25-year windows. Recall that the trajectory matrix is an $(n-m+1)\times m$ matrix for an embedding dimension of $m$.

# Construct the trajectory matrix (Elsner and Tsonis 1996, p.44)
traj10 <- matrix(nrow = length(diff(LakeHuron)) - 10 + 1, ncol = 10)
traj25 <- matrix(nrow = length(diff(LakeHuron)) - 25 + 1, ncol = 25)

To populate the matrices, we can use a for loop to calculate all the windows of length $m$ in the dataset. Recall that the $(i,j)$ element of the trajectory matrix is given by $x_{i+j-1}$, where $x$ is our first-differenced LakeHuron time series.

for (i in 1:nrow(traj10)) {
  for (j in 1:ncol(traj10)) {
    traj10[i, j] <- diff(LakeHuron)[i + j - 1]
  }
}

The Lagged-Covariance Matrix

We next construct the lagged-covariance matrix by performing a spectral decomposition of the lagged covariance matrix. Recall that this is formed from the trajectory matrix $X$ as $X^TX$. Note that we normalize each matrix by one over the square-root of the number of time windows (the number of rows in the trajectory matrix).

S.traj10 <- (t(traj10) * 1/sqrt(nrow(traj10))) %*% (traj10 * 1/sqrt(nrow(traj10)))
S.traj25 <- (t(traj25) * 1/sqrt(nrow(traj25))) %*% (traj25 * 1/sqrt(nrow(traj25)))

Derivation of the Eigenvectors

Recall that we can get the eigenvectors in one of two ways. The first way, perhaps more straightforward, is through a spectral (eigenvalue) decomposition of the lagged-covariance matrix.

# Spectral decomposition of the lagged covariance matrix (columns are eigenvectors)
S.traj10.eigen <- eigen(S.traj10, symmetric = T)$vectors

Alternatively, we could take an SVD of the trajectory matrix, keeping the right singular vectors.

# SVD of the trajectory matrix
S.traj10.by.svd <- svd(traj10)

We can confirm these are equivalent up to a sign change as follows.

all.equal(sapply(S.traj10.by.svd$v, abs), sapply(S.traj10.eigen, abs))

The two approaches may show sign differences in the resulting eigenvectors because the definition of the direction of the coordinate system is arbitrary. The same is true in PCA [1].

"Each principal component loading vector is unique, up to a sign flip...The signs may differ because each principal component loading vector specifies a direction in p-dimensional space: flipping the sign has no effect as the direction does not change." - James et al. (2013)

Visualizing the Temporal Principal Components

The temporal principal components [3], which correspond to the eigenvectors of the lagged-covariance matrix, are more easily visualized if we wrap up our results in an R data frame.

S.traj10.eigen <- eigen(S.traj10, symmetric = T)$vectors
dat.traj10 <- as.data.frame(S.traj10.eigen)
colnames(dat.traj10) <- 1:ncol(dat.traj10)
dat.traj10$time <- 1:ncol(dat.traj10)

require(dplyr)
require(tidyr)
require(ggplot2)
dat.traj10 %>%
  gather(key = 'eigenvector', value = 'value', -time) %>%
  mutate(eigenvector = ordered(eigenvector, levels = 1:1000)) %>%
  ggplot(mapping = aes(x = time, y = value)) +
  geom_line(size = 0.8) +
  facet_wrap(~ eigenvector) +
  theme_linedraw()

Like PCA, interpretation of SSA results can be subjective. SSA results may be even harder to interpet because every temporal principal component is some oscillation. A more straightforward goal with SSA is smoothing, achieved in the reconstruction of the original signal using a subset of the components.

Conclusion

This is my take on SSA, PCA, and how they compare for different applications.

PCA is a well-established tool for exploratory visualization and analysis of multivariate data. It's particularly valuable for high-dimensional data (lots of columns). For time series data, it may be less useful if there is more variation between spatial units/ sample units than over time.

SSA is a neat technique for discovering oscillations in time series data but it is tricky to get right. Oscillations may correspond either to signal or to noise and you need to know more about the data generating mechanism in order to distinguish the two. The assumption of first-order stationarity might also pose a problem for certain time series datasets. As a result, SSA may be better for smoothing and forecasting time series data than discovering canonical trajectories. It's also clear that SSA requires longer time series than PCA for practical use.

References

James, G., D. Witten, R. Tibshirani, and T. Hastie. 2013. An Introduction to Statistical Learning with Applications in R. New York, New York, USA: Springer Texts in Statistics.
Eastman, R., and M. Fulk. 1993. Long sequence time series evaluation using standardized principal components. Photogrammatic Engineering & Remote Sensing 59(6):991–996.
Elsner, J., and A. Tsonis. 1996. Singular spectrum analysis: a new tool in time series analysis. New York and London: Plenum Press.
Golyandina, N., V. Nekrutkin, and A. Zhigljavsky. 2001. Analysis of Time Series Structure: SSA and Related Techniques. Washington, D.C., U.S.A.: Chapman and Hall/CRC.
Strang, G. 1988. Linear Algebra and its Applications. Orlando, Florida. Harcourt Brace Jovanovich, Inc. 3rd ed.

A visual tool for analyzing trends among group means in R

2017-07-19T17:00:00+02:00

Exploratory data analysis is a topic that doesn't get enough attention in courses, formal or otherwise, on statistical analysis or so-called "data science." While some scientists or data professionals may be approaching a problem with what they feel is a great amount of well-established theory behind them, in many cases an a priori understanding of the system under consideration is lacking. In other cases, what is thought to be well-understood about a system is actually ripe for skepticism and further inquiry.

Either way, approaching a dataset with an open mind is a good thing. While any system of study with two or more variables can exhibit complex behavior, including thresholds, non-linearity, and feedbacks, problems in both the physical and social sciences are frequently modeled as purely linear relationships. There are a couple of reasons for this and both are understandable. First, the general linear model (and generalized linear models) are relatively uncomplicated. A second, less compelling reason, is that linear relationships may be thought to be more useful, or at least easier to understand. In exploratory data analysis (EDA), we want to get a sense of the range of possibilities in the system under study, to the extent that our available data accurately represent it.

Here, I present a visual tool and R code to help guide the detection of bivariate trends among groups means (or among quantiles of one continuous variable), which also applies and displays an analysis of variance (ANOVA) test, while also controlling for the influence of a third variable. I've found this script useful for approaching a variety of datasets and, in the interest of cleaning up my code when I have multiple relationships to test, turned it into a re-usable R function.

Example Using American Community Survey

For this example, I'll use data from the 2012 5-Year American Community Survey (ACS) for the Detroit-area counties of Wayne, Oakland, and Macomb; survey data are at the block-group level. I obtained the ACS data from SocialExplorer.com, which is a lot more convenient than going to the U.S. Census Bureau website. With these data, we might ask questions similar to those posed for a general, tabular dataset:

What is the relationship between housing vacancy and median household income, controlling for housing density?
Is there a relationship between median household income and white population proportion, controlling for population density?

First, I need to process the ACS 2012 data such that:

Variables have meaningful names;
Only block groups with non-zero population and housing totals are considered;
Housing density and population density are calculated for every block group;
Median household income is log-transformed;
My other variables of interest are appropriately normalized;
Calculate quantiles for my variables of interest.

I present this example using the R programming environment (version 3.3.2). For most of the processing, I'll use dplyr and pipes. I'm going to choose to cut my data into quintiles ($n=5$ quantiles) but this approach works for any kind of discretization.

library(dplyr)

# Load the 2012 ACS data
acs2012.raw <- read.csv('acs2012_5year_by_block_groups.csv',
  header = T, skip = 1, colClasses = c('Geo_FIPS' = 'character'))

quintile <- function (v) {
  # Also, rename levels so they fit inside plot labels
  cut(v, breaks = quantile(v, probs = seq(0, 1, 0.2), na.rm = T),
    include.lowest = T, # Must include for cases of zero vacant housing, etc.
    labels = c('1st', '2nd', '3rd', '4th', '5th'))
}

acs2012 <- acs2012.raw %>%
  select(
    FIPS = Geo_FIPS,
    county.code = Geo_COUNTY,
    area.land.sq.m = Geo_AREALAND,
    Pop.Total = SE_T001_001,
    Pop.White = SE_T013_002,
    Pop.Black = SE_T013_003,
    Median.Hhold.Income = SE_T057_001,
    Housing.Units.Total = SE_T093_001,
    Housing.Units.Vacant = SE_T096_001) %>%
  # Consider only block groups with non-zero housing, population
  filter(Pop.Total > 0) %>%
  filter(Housing.Units.Total > 0) %>%
  filter(!is.na(Median.Hhold.Income)) %>%
  # Normalize variables
  mutate(
    Log10.Median.Hhold.Income = log10(Median.Hhold.Income),
    Housing.Density = Housing.Units.Total / area.land.sq.m,
    Pop.Density = Housing.Units.Total / area.land.sq.m,
    Prop.White = Pop.White / Pop.Total,
    Prop.Black = Pop.Black / Pop.Total,
    Prop.Vacant = Housing.Units.Vacant / Housing.Units.Total) %>%
  mutate_each(funs('quintile'), Pop.Density, Housing.Density,
    Prop.White, Prop.Black, Prop.Vacant)

We lost just 16 block groups by our decision to exclude those block groups with zero population or zero housing. It turns out that median household income (Median.Hhold.Income) is missing for 5 more block groups, so I also remove these cases.

The annotated source code for the levels.plot function seen in subsequent code snippets is at the bottom of this post.

First Plot: Income versus Vacant Housing

Let's answer the first question with data: What is the relationship between housing vacancy and median household income, controlling for housing density? Here, it makes more sense (to me) to have median household income on the y-axis, as a continuous variable; thus, it should be the first variable name passed to levels.plot. The (proportion of) vacant housing should then be our second variable. Finally, we want to control for the effect of housing density; that is, we want to examine this relationship at different levels of housing density.

levels.plot(acs2012, 'Log10.Median.Hhold.Income', 'Prop.Vacant',
  'Housing.Density', y1 = 5.6)

The function will automatically determine where to place both the p-value statistic associated with the ANOVA (within each density level) and the numbers of observations in each bin. However, I felt the bin counts were getting overplotted by the high-end outliers for the boxplots, so I added a custom value for the y1 argument, the position of these labels; it's a little higher, now, than the maximum value for Log10.Median.Hhold.Income (5.60 instead of 5.37).

How do we read this plot? The subplots with gray boxes as titles (the facets in ggplot2 parlance) correspond to each of the five density quintiles (from lowest to highest housing density, in this case). Within each of those subplots, the five quintiles of the second variable (var2), the proportion of vacant housing (Prop.Vacant), in this case, are each plotted against the log of median household income.

We can see from this plot that there is a clear, statistically significant negative relationship between vacant housing and median household income at all density levels. This is just as expected; presumably, people with higher incomes can avoid living in neighborhoods with high vacancy rates. Conversely, high vacancy rates do not tend to occur where people have higher incomes (because they are more likely to be able to afford to pay their mortgage and property taxes).

We also see that, with the logarithmic transformation of median household income, there is a slight saturation effect at low levels of vacant housing. That is, when the proportion of housing that is vacant is small, we see very small differences in median household income. But at the highest three quintiles of vacant housing density, the difference is much larger.

Second Plot: Income and White Population

Our second question: Is there a relationship between median household income and white population proportion, controlling for population density? The call to levels.plot is very similar, but we're controlling for population density this time.

We see a clear, statistically significant, positive trend in the log of median household income as the white proportion rises. At the highest population density level, the relationship is linear; at lower population densities, it exhibits the exponential behavior we saw in the relationship between vacant housing and income.

levels.plot(acs2012, 'Log10.Median.Hhold.Income', 'Prop.White',
  'Pop.Density', y1 = 5.6)

The Levels Plot Function

There are many other examples I could provide, but you've seen enough of what this function does. It is a fairly simple, yet effective, visualization tool. The R code I provide below demonstrates how simple it is. Because R is not a very well-designed language (e.g., awful tracebacks, pollution of the global namespace, proliferation of global functions that are very similar, factors), I don't profess to be any good at writing base R code; it is quite likely this function definition could be written with more sophistication. But I think this attempt is more than adequate. I welcome any suggested changes.

# Levels plot function
# -- Presents a plot of one continuous variable (`var1`) across quintiles of
#     a discrete variable (`var2`), faceted by a third density variable (`dens`)

require(ggplot2)
require(grid)
require(reshape2)
levels.plot <- function (q.dat, var1, var2, dens, yaxis = 'fixed', y0 = NULL, y1 = NULL, y.label = NULL) {
  # Calculate cross-tabulation
  ctab <- reshape2::melt(table(subset(q.dat, select = c(var2, dens))), id.vars = var2)

  # Set the y-axis position of quantile N labels
  ctab$y <- rep(ifelse(is.null(y1), max(q.dat[,var1]), y1), dim(ctab)[1])

  # var1 must be a continuous variable
  stopifnot(class(q.dat[,var1]) %in% c('integer', 'numeric'))

  # Iterate over the density levels...
  tests <- c()
  for (q in levels(q.dat[,dens])) {
    test <- aov(as.formula(paste0(var1, ' ~ ', var2)),
                data = q.dat[q.dat[,dens] == q,])
    tests <- c(tests, sprintf('p-value: ~%.4f', summary(test)[[1]][['Pr(>F)']][[1]]))
  }
  # Set the y-position of the p-value text
  tests <- data.frame(p.value = tests, dens = levels(q.dat[,dens]),
    x = 1, y = ifelse(is.null(y0), min(q.dat[,var1]), y0))
  # MUST rename the "dens" column to the variable name so it can be found by ggplot2
  .names <- names(tests)
  .names[2] <- dens
  names(tests) <- .names

  # The initial plot object
  ggplot(q.dat, aes_string(y = var1)) +
    geom_boxplot(mapping = aes_string(x = var2)) +
    geom_text(aes_string(x = var2, y = 'y', label = 'value'),
      data = ctab, vjust = 0.7, size = 4.5) +
    geom_text(aes(x = x, y = y, label = p.value), data = tests,
      hjust = 0.1, vjust = 0.1, size = 4.5) +
    xlab(paste0('2012 ACS ', gsub('\\.', ' ', var2), ' Quintile')) +
    ylab(ifelse(is.null(y.label), gsub('\\.', ' ', var1), y.label)) +
    labs(title = paste0(gsub('\\.', ' ', var2), ' by ', gsub('\\.', ' ', dens), ' Quintiles')) +
    facet_wrap(as.formula(paste0('~ ', dens)), scales = yaxis) +
    theme_bw() +
    theme(text = element_text(size = 16),
      plot.margin = unit(c(0.5, 0.2, 0.5, 0), 'cm'))
}

Teaching the Q Method in a class on urban sustainability

2016-09-21T10:30:00+02:00

On Deciding to Teach the Q Method

While my background is in the natural sciences, I have a tendency for discovering and diving in to new methods, including those that originate or are typically practiced in the social sciences. As I'm helping to design and teach a course on urban sustainability this semester, a course that must cater to students in a professional Master's program that will go on to careers as sustainability practitioners, I have struggled to devise skill-based instruction that serves their needs. The Q Method is a relatively obscure approach to analyzing qualitative data on human subjectivity (although there is an active research community promoting its wider use). Our students are learning that urban sustainability is not an exact science; it is a confluence of discourses and untested proposals for how to make our cities more efficient, healthy, and just. Can the Q Method help them to understand the diverse perspectives on contemporary sustainability issues?

The Q Method

The Q Method is a mixed method that combines a survey of individuals with factor analysis to determine what distinct perspectives are embedded in a population. In the words of van Exel and de Graaf [1], who paraphrase Brown [2]:

Q methodology provides a foundation for the systematic study of subjectivity, a person's viewpoint, opinion, beliefs, attitude, and the like...By Q sorting people give their subjective meaning to the statements, and by doing so reveal their subjective viewpoint...or personal profile.

It is a useful tool for analyzing human subjectivity on a variety of social or technical issues, whether the respondents are experts in a particular field or are drawn from a more general population.

Q Method was devised by the psychologist William Stephenson, who was very critical of the classic statistical analysis advanced by Karl Pearson. In particular, Stephenson's Q Method questions the single, objective reality that is assumed in classical statistics, where the goal is to develop a best-fitting model that tests one of a prescribed set of hypotheses. With the Q Method, no a priori hypotheses are specified. Its practitioners assume the existence of multiple subjective realities. The origin story for the Q Method holds that because classical statistics was a discipline associated with Pearson and, in particular, Pearson's correlation coefficient, denoted $r$, Stephenson decided that his method should be called "Q" as Q comes before R in the alphabet.

Terminology and Methods

We've all seen surveys that ask us to rate statements on a scale from "Agree" to "Disagree," perhaps from "Strongly Agree" to "Strongly Disagree." These surveys have usually irritated me because this spectrum seems rather arbitrary, the number of gradations from one end to the other too numerous. Am I a "7" or an "8" on this 10-point scale? Do I agree or agree strongly?

The Q Method begins with such a survey but puts more thought into its design and execution. When we survey a group of respondents, also called the P sample (or P set), each is asked to sort a collection of statements, photographs, or other discrete messages along an axis. This axis, however, may be described in different terms from simple agreement and disagreement. We might ask the respondents to sort the statements from "less like how I think" to "more like how I think;" from "less likely to motivate me" to "more likely to motivate me." The statements (or photographs, audio clips, etc.) to be sorted consitute the concourse of communication. The sorting of these statements is also typically constrained by a matrix that approximates a quasi-normal distribution (see below). Each of the responses constitutes a Q sort and the collection of all Q sorts is referred to as the Q sample.

Above, a Q sort in progress is depicted (from Ellingsen et al., 2014); the respondent is sorting the statements into a quasi-normal distribution according to her subjective viewpoint on each.

The "objective" part of the Q Method comes after the Q sample is generated. It involves the application of factor analysis to the Q sample, with the goal of inducing factors that correspond to shared, subjective worldviews related to the councourse of communication. The use of factor analysis belies the subjectivity that enters this part of the method, however. The number of factors to induce is not easily determined. Moreover, the resulting factors require interpretation, which is often highly subjective and even confusing. Consequently, much of what is to be learned from the Q Method comes from the process of conducting the analysis and sharing the results with the P sample that produced the data. When the P sample consists of practitioners, community members, or outside experts, then the Q Method becomes a tool for the co-production of knowledge, which is increasingly important in natural resouce issues.

The Q Method in Practice

The Q Method has been used in sustainability research before. Zeemering (2009) survey San Francisco city officials as to what aspects of "sustainability" are most important to them and their work [4]. The statements for the concourse were drawn from a local non-profit group's report and were ranked by city officals from "least" to "most important in my community." The Q Method has also been used with a concourse of photos, which participants' were asked to sort based on two prompts: 1) as to how the picture makes them feel climate change is important or unimportant; 2) as to how the picture makes them feel they can or cannot do something about climate change [5].

Applying the Q Method to Urban Sustainability

I assigned my students a Q sort exercise with statements that described multiple perspectives on sustainability. The concourse was drawn from two sources: John Dryzek's The Politics of the Earth and another book, Confronting Consumption, edited by Thomas Princen, Michael Maniates, and Ken Conca. These books, particularly the former, explore a number of ways of characterizing sustainability challenges and natural resource issues. Students were asked to sort 21 statements into a matrix based on their agreement with the statement. A quasi-normal distribution was enforced. What follows is a description of the analysis in R.

Organizing the Q Sample

The data are organized as a CSV file with the respondents along the columns and the statements along the rows, as in the example below, whre r1 through r15 are each of the respondents.

q.sample <- read.csv('QMethod_results.csv')
head(q.sample)

##   r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 r14 r15
## 1 -1  0  1  2  2  0  0  1 -1   0   0   2   3   3   2
## 2  2  0  1  1  1  2 -1  0  0   2   3   2   2  -3   2
## 3  1  2  2  1  1  1  1 -3  0   1   2  -2   1   1   1
## 4 -1  2 -2  0 -2 -3 -1  1 -2  -2   0  -2  -3  -2  -2
## 5  0  0  0  0  1  1 -2  1  1  -1   1   0   0   0   1
## 6  0 -1  1 -1 -1 -1  0  0  1   0   0   3  -2   1  -1

Because we used a forced (quasi-normal) distribution (i.e., respondents were constrained to placing statements inside a bell-shaped distribution), we want to make sure that the columns sum to zero (i.e., same number of statements on each side of the curve).

apply(q.sample, 2, sum)

##  r1  r2  r3  r4  r5  r6  r7  r8  r9 r10 r11 r12 r13 r14 r15
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

Exploratory Analysis

One question we can ask of the Q sample is which respondents are highly correlated with one another? We can calculate the Pearson's correlations between each pair of respondents using the cor() function.

cor(q.sample)

However, it's more effective to visualize these correlations as a heatmap.

# Calculate the correlations
cc <- as.data.frame(cor(q.sample, method = 'pearson'))
cc$X1 <- factor(colnames(cc), ordered = TRUE, levels = rev(names(q.sample)))

require(reshape2)
# Reorganize the table for ggplot2
cc <- melt(cc, id.var = 'X1', variable.name = 'X2', value.name = 'Correlation')
cc$Correlation <- cc$Correlation

require(ggplot2)
require(RColorBrewer)
pal <- brewer.pal(9, 'RdYlBu')
ggplot(cc,
  mapping=aes(x = X1, y = X2, fill = Correlation)) +
  geom_tile() +
  scale_fill_gradientn(colours = pal, limits = c(-1, 1)) +
  scale_x_discrete(expand = c(0, 0)) +
  scale_y_discrete(expand = c(0, 0)) +
  coord_equal() +
  xlab('') + ylab('') +
  labs(title = "Pearson's Correlation Coefficients") +
  theme_bw() +
  theme(axis.text = element_text(size = 16),
    axis.text.x = element_text(angle = 90, hjust = 1),
    plot.title = element_text(vjust = 1, size = 16),
    legend.title = element_text(size = 12, vjust = 2),
    legend.text = element_text(size = 12),
    legend.text.align = 1)

Choosing the Number of Factors to Induce

When using the Q method, we have to decide a priori how many factors to extract from the dataset. Recall that the factors correspond to the distinct perspectives or worldviews that the population's Q sorts represent. We may have some prior information or expert opinion as to the number of perspectives that exist. For example, in a Q sort of statements related to immigration, we might expect that there would be both pro-immigration and anti-immigration perspectives; therefore, we would try to extract at least 2 factors.

If we don't have any prior information, we can attempt to determine the number of factors from the data alone. This approach, letting the dataset speak for itself, might begin with a screeplot, as below. The screeplot shows the amount of variance (which we can think of as information content) that is explained by an increasing number of factors. Think about this plot in terms of moving to the right along the horizontal axis. As we continue to add factors, we will come to a point where very little information is gained by adding an additional factor. We want to choose the fewest number of factors as possible while explaining as much of the data as possible.

screeplot(prcomp(q.sample), main = 'Screeplot of unrotated factors',
  type = 'l', lwd = 2, cex = 1.5, cex.lab = 1.4, cex.axis = 1.5,
  cex.main = 1.5)

Based on this screeplot, I would estimate there are about 4 or 5 factors in the data.

Performing the Factor Analysis

Once we've decided on how many factors to induce in the data, we can run the final analysis. I'm using the qmethod package for this analysis.

library(qmethod)
results <- qmethod(q.sample, nfactors = 4)

We can see the factor loadings with:

summary(results)

## Q-method analysis.
## Finished on:             Tue Sep 20 10:07:54 2016
## Original data:           21 statements, 15 Q-sorts
## Forced distribution:     TRUE
## Number of factors:       4
## Rotation:                varimax
## Flagging:                automatic
## Correlation coefficient: pearson
##
## Factor scores
##    fsc_f1 fsc_f2 fsc_f3 fsc_f4
## 1       1      0      1      1
## 2       3      2      0      0
## 3       2      1     -3      2
## 4      -2     -2     -1     -2
...

We can also visualize the results as a plot, where the statements are on the y-axis, ordered from highest consensus (at bottom) to highest disagreement (at top).

require(RColorBrewer)
colours = brewer.pal(4, 'Set1')
plot(results,
  ylab = 'Statements', colours = colours,
  main = 'Factor Loadings by Concourse Statement')

We can see that Statement 1 ("A sustainable society is one that has in place informational, social, and institutional mechanisms to keep in check the positive feedback loops that cause exponential population and capital growth") has the second-highest consensus among the perspectives; that is, no matter what worldview an individual respondent most closely aligns with, he or she was very likely to rate this statement the same way as everyone else. What are the statements with the highest disagreement? Statement 14 ("Growth has no set limits in terms of population or resource use beyond which lies ecological disaster") and Statement 18 ("There is no single pathway to a sustainable future; local experimentation that is pluralistic, incremental, and piecemeal must be allowed so that every possibility for enduring prosperity is considered") saw wide disagreement, indicating that people disagree about limits to growth and whether or not local control is an important component of achieving sustainability.

Interpreting the Factors

Using the summary and the plot outputs in the previous section, we can try to interpret the factors as follows.

Factor 1: Market Reformers

Factor 1 is distinguished by:

A negative response to Statement 14 ("Growth has no set limits...");
A negative response to Statement 6 ("Resource exhaustion and environmental degradation are largely a matter of individuals and other actors pursuing their material interests in uncoordinated and decentralized systems").

This suggests that this perspective is skeptical of growth and also about individual responsibility for natural resource problems.

Factor 2: Radical Political Economists

Factor 2 is distinguished by:

A positive response to Statement 21 ("A focus on individual responses to environmental problems, such as planting trees or recycling, distracts us from the structural and institutional barriers to achieving sustainability");
A negative response to Statement 12 ("Environmental sustainability can be achieved largely through smart consumers making smart choices about where their food is sourced, how their clothing is made, where they live, and how much they buy");
A negative response to Statement 9 ("Environmental conservation is not just ecologically sound, it is good for business and economic growth everywhere").

These results suggest that respondents aligned with Factor 2 are conscious of how a broader political economy hampers sustainability and are not convinced that sustainability can be achived through individual consumer choices.

Factor 3: Pro-Market Eco-Modernists Acknowledging Sacrifice Zones

Factor 3 is distinguished by two contradictory statements:

A positive response to Statement 14 ("Growth has no set limits...");
A positive response to Statement 15 ("Humans have already appropriated more than our fair share of the earth's finite resource base and further economic growth is impossible").

And also:

A negative response to Statement 18 ("There is no single pathway to a sustainable future...");
A negative resposne to Statement 3 ("Advances in science and technology can enhance the carrying capacity of the earth and our resource base");
A positive response to Statement 7 ("A government-led sustainable development initiative is just another futile attempt to replace markets with political management; instead of trying to impose discipline on people's decisions, we should adjust the market's price system");
A negative response to Statement 10 ("Sustainability planning can work through government's creation and enforcement of environmental quality standards");
A negative response to Statement 13 ("Economic growth in all parts of the world is essential to improve the livelihoods of the poor, to sustain growing populations, and eventually to stabilize population levels");
A positive response to Statement 8 ("Inevitable population growth and desirable economic growth can never be accomodated by the earth's resources").

This is a confusing array of positions. The only sense I can make of this is that respondents aligned with this view are market boosters who think that Western industrial society represents the peak of human efficiency and yet acknowledge that this reality has created ecological sacrifice zones in the rest of the world, which will never catch up.

Factor 4: Pragmatic Social Engineers

Factor 4 is distinguished by:

A negative response to Statement 21 ("A focus on individual responses to environmental problems...distracts us from the structural and institutional barriers to achieving sustainability");
A positive response to Statement 17 ("In order to avoid the disruption of the earth's life support functions, a centrally coordinated plan for action is needed with enforcement administrated by a global governing body")
A positive response to Statement 10 ("Sustainability planning can work through government's creation and enforcement of environmental quality standards").

These statements distinguish an interesting composite view of sustainability issues both at local and global scales. This view suggests that individual behaviors should conform to a global development plan, enforced by central governments, for the proper management and conservation of natural resources; a confidence in the ability of governments to change human behavior. They are pragmatic because they don't seem to believe that personal choice alone is enough to achieve sustainability.

Checking the Factor Loadings

Finally, we can see how each respondent loads onto each factor (i.e., how they align with each perspective).

results$loa

             f1         f2           f3          f4
r1   0.07740761  0.8594596  0.005231051 0.096248126
r2   0.11055604  0.2496759 -0.825336473 0.320306693
r3   0.35672108  0.2638716 -0.057400350 0.771305939
...

Concluding Remarks

Three of the four factors identified represent interesting and probably real world views. Moreover, the loadings on these factors persists when I induce 5 factors instead of 4, suggesting they are stable. Factor 3, however, is hard to interpret. It likely indicates that the students did not understand Statements 14 and 15 (or, more precisely, did not understand them the same way I did when I chose them).

There are two significant areas for improvement in this example that would be essential next steps were this a real study. First, the concourse of communication was drawn from books the students have likely never read; they may not understand what the authors of these statements mean by them. In general, the concourse of communication should be drawn from written or spoken statements by the individuals forming the P sample. For urban sustainability studies, this might involve interviewing sustainability practitioners and selecting statements from across multiple individuals, who later will each conduct a Q sort. Second, the Q sort matrix was probably not well suited for this study. A larger body of statements and more slots to accomodate them in the tails of the distribution would likely have improved the results. While there is little advice that been able to find on designing these matrices or selecting the elements of the concourse of communication, there is promise to this method, at least in the classroom.

References

Exel, J. van, & Graaf, G. de. (2005). Q methodology: A sneak preview.
Brown, S. R. (1993). A primer on Q methodology. Operant Subjectivity, 16(3/4), 91–138.
Ellingsen, I. T., A. A. Thorsen, I. Størksen. (2014) "Revealing Children's Experiences and Emotions through Q Methodology."" Child Development Research. 2014. Article ID 910529.
Zeemering, E. S. (2009). What Does Sustainability Mean to City Officials? Urban Affairs Review, 45(2), 247–273.
O’Neill, S. J., M. Boykoff, S Niemeyer, S. A. Day. (2013) "On the use of imagery for climate change engagement." Global Environmental Change 23, 413–421.

Diagnostics for fixed effects panel models in R

2016-04-14T15:00:00+02:00

Note: This post has been updated for clarity and to use the Gapminder dataset instead of my old, proprietary example.

I've recently been working with linear fixed-effects panel models for my research. This class of models is a special case of more general multi-level or hierarchical models, which have wide applicability for a number of problems. In hierarchical models, there may be fixed effects, random effects, or both (so-called mixed models); a discussion of the multiple definitions of "fixed effects" is beyond the scope of this post, but Gelman and Hill (2007) or Bolker et al. (2009) are good references for this [4,7]. Fixed effects, in the sense of fixed-effects or panel regression, are basically just categorical indicators for each subject or individual in the model. The way this works without exhausting all of our degrees of freedom is that we have at least two observations over time for each subject (hence: a panel dataset). One further tweak that leads to the "within" estimator discussed in this post is that each subject's panel data are time-demeaned; that is, the long-term average within each subject is subtracted from all measurements for that subject.

Although these models can be fit in R using the the built-in lm() function most users are familiar with, there are good reasons to use one of the two dedicated libraries discussed here:

For large numbers of fixed effects, the function can be intractable and return poor results.
In addition, they clutter in statistical summary() of your model because they're reported alongside any covariates of interest.
The "fixed-effects transformation" (time-demeaning) is applied automatically (and correctly) without you having to transform your data.

In my work, I have about 4000-6000 fixed effects and, fortunately, the R community has delivered two excellent libraries for working with these models: lfe and plm. A more detailed introduction to these packages can be found in [1] and [2], respectively. Here, I'll summarize how to fit these models with each of these packages and how to develop goodness-of-fit tests and tests for the linear model assumptions, which are trickier when working with these packages (as of this writing).

I should state up front that I am going to gloss over much of the statistical red meat, writing, as I usually do, for practitioners rather than statisticians. Also, there are a variety of flavors of models that can be estimated with this framework. I'm going to focus on just one type of model, the panel model by the "within" estimator.

Introduction to Fixed-Effects Panel Models

Fixed-effects panel models have several salient features for investigating drivers of change. They originate from the social sciences, where experimental setups allow for intervention-based prospective studies, and from economics, where intervention is typically impossible but inference is needed on observational data alone. In these prospective studies, a panel of subjects (e.g., patients, children, families) are observed at multiple times (at least twice) over the study period. The chief premise behind fixed effects panel models is that each observational unit or individual (e.g., a patient) is used as its own control, exploiting powerful estimation techniques that remove the effects of any unobserved, time-invariant heterogeneity [3,4]. By estimating the effects of parameters of interest within an individual over time, we can eliminate the effects of all unobserved, time-invariant heterogeneity between the observational units [5]. This feature has led some investigators to propose fixed-effects panel models for weak causal inference [3] as the common problem of omitted variable bias (or "latent variable bias") is removed through differencing. Causal inference with panel models still requires an assumption of strong exogeneity (simply put: no hidden variables and no feedbacks).

The linear fixed-effects panel model extends the well-known linear model, below. The response of individual $i$, denoted $y_i$, is a function of some group mean effect or intercept, $\alpha$, one or more predictors, $\beta x_i$, and an error term, $\varepsilon_i$.

$$ y_i = \alpha + \beta x_i + \varepsilon_i $$

The basic linear fixed-effect panel model can be formulated as follows, where we add an intercept term for each of the individual units of observation, $i$, which are observed at two or more times, $t$:

$$ y_{it} = \alpha_i + \beta x_{it} + \varepsilon_{it} $$

It's important to note that this approach requires multiple observations of each individual. Obviously, if the number of observations $N$ was equal to the number of individuals $i \in M$, we would exhaust the degrees of freedom in our model simply by adding $M$ intercept terms, $\alpha_i + ... + \alpha_M$. With as few as two observations $(t \in [1,2])$ of each subject, however, we've doubled the number of observations and the individual intercept terms now correspond to any time-invariant, idiosyncratic change between those two observations.

This model can be extended further to include both individual fixed effects (as above) and time fixed effects (the "two-ways" model):

$$ y_{it} = \alpha_i + \beta x_{it} + \mu_t + \varepsilon_{it} $$

Thus, there are two basic kinds of fixed-effects panel models that can be estimated using the "within" estimator. The "unobserved effects" or individual model accounts for unobserved heterogeneity between individuals by partitioning the error term into two parts. One part is specific to the individual unit of observation and doesn't change over time while the other is the idiosyncratic error term, $\varepsilon$, we are familiar with from basic linear models [2]. The second type is an extension of the first, the "two-ways" panel model that includes both individual and time fixed effects.

A further extension of the panel model, one often seen in the literature, is given below.

$$ y_{it} = \alpha_i + \beta x_{it} + \phi z_i + \mu_t + \varepsilon_{it} $$

While $\beta$ and $\epsilon$ do not differ from the meanings in the basic linear model, $\alpha_i$ is the individual fixed effect and $\phi$ is a vector of coefficients for time-invariant, unit-specific effects. These effects can be estimated in a linear model but are removed in some kinds of estimation of panel models ($\phi \equiv 0$). They are removed in estimation through differencing; since each observational unit is used as its own control, we are unable to distinguish between heterogeneity that we didn't observe (and can't account for)—the kind we wish to remove from our model—and known (observed) differences between observational units (e.g., race or sex in patients) [3].

Fitting Fixed-Effects Panel Models in R

Let's look at the Gapminder dataset, a somewhat well-known dataset (owing to the TED talk on the subject) on global development indicators, including life expectancy and per-capita gross domestic product (GDP). You can download the data I used for this example as a CSV file from here. Below is a sample of the Gapminder data.

> head(panel.data)
      country year lifeExp      pop gdpPercap
1 Afghanistan 1952  28.801  8425333  779.4453
2 Afghanistan 1957  30.332  9240934  820.8530
3 Afghanistan 1962  31.997 10267083  853.1007
4 Afghanistan 1967  34.020 11537966  836.1971
5 Afghanistan 1972  36.088 13079460  739.9811
6 Afghanistan 1977  38.438 14880372  786.1134

In this example, the observational units are countries. Here, the country name is the unique identifier for individual subjects (countries) and the year is the identifier for the time period; these are the individual and time fixed effects, respectively. Both the individual and time fixed effects, country and year, must be factors where the levels correspond to the individual and time period identifiers, respectively. I'll investigate to what extent change in life expectancy (lifeExp) is predicted by change in (gdpPercap).

Let's start with lfe. A basic panel model can be with using lfe with the provided felm() method. This approach exploits the vertical-bar in the Wilkinson-Rogers syntax (R formulas) to specify the "levels" by which our panel data are organized. Here, we

library(Matrix)
library(lfe)
m1a <- felm(lifeExp ~ gdpPercap | country, data = panel.data)

We can be more explicit by specifying the contrasts of our model but the result is the same.

m1b <- felm(lifeExp ~ gdpPercap | country, data = panel.data,
  contrasts = c('country', 'year'))

The results are the same. We can see that our predictor, (change in) per-capita GDP is highly significant and we are given two pairs of goodness-of-fit statistics: the multiple and adjusted R-squared for the "full" and "projected" models. The full model is our model with the individual fixed effects included; the projected model is the estimated model where our fixed effects are not included. The full model always performs better than the projected model because the individual fixed effects always explain additional variation in the response: they account for any idiosyncratic differences between each observational unit.

> summary(m1a)

Call:
   felm(formula = lifeExp ~ gdpPercap | country, data = panel.data)

Residuals:
    Min      1Q  Median      3Q     Max
-31.697  -3.424   0.321   3.684  21.106

Coefficients:
           Estimate Std. Error t value Pr(>|t|)    
gdpPercap 4.260e-04  3.114e-05   13.68   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.505 on 1561 degrees of freedom
Multiple R-squared(full model): 0.7675   Adjusted R-squared: 0.7464
Multiple R-squared(proj model): 0.1071   Adjusted R-squared: 0.02583
F-statistic(full model): 36.3 on 142 and 1561 DF, p-value: < 2.2e-16
F-statistic(proj model): 187.2 on 1 and 1561 DF, p-value: < 2.2e-16

We can fit the two-ways fixed effects model in lfe simply by adding an additional contrast.

m2 <- felm(lifeExp ~ gdpPercap | country + year, data = panel.data)

Switching to plm, we can fit the two-ways fixed effects model using the plm() function. The plm library doesn't use the vertical bar to specify fixed effects, rather, it requires us to specify the index argument with the variable names of the individual and time fixed effects specified as a tuple (in that order). We also indicate that the model we want to estimate is the within model and that we are estimating twoways effects.

library(Matrix)
library(plm)
m2 <- plm(lifeExp ~ gdpPercap, data = panel.data, model = 'within',
  effect = 'twoways', index = c('country', 'year'))

The output of summary() for plm is different and we get a little more detail in some areas. We see that we have a balanced panel (same number of observations for each individual) over 142 subjects (countries) and 12 time periods (for a total of 1,704 observations).

> summary(m2)

Twoways effects Within Model

Call:
plm(formula = lifeExp ~ gdpPercap, data = panel.data, effect = "twoways",
    model = "within", index = c("country", "year"))

Balanced Panel: n = 142, T = 12, N = 1704

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max.
-22.63728  -1.69344  -0.04944   2.00422  10.11005

Coefficients:
             Estimate  Std. Error t-value  Pr(>|t|)    
gdpPercap -7.8017e-05  1.8431e-05 -4.2329 2.442e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    18529
Residual Sum of Squares: 18317
R-Squared:      0.011428
Adj. R-Squared: -0.086154
F-statistic: 17.9177 on 1 and 1550 DF, p-value: 2.442e-05

What we don't see is a goodness-of-fit statistic for the full model. The value of the R-Squared statistic presented here, when compared to the lfe output, is obviously that of the projected model.

Note that the individual effects model can be fit using plm by removing the second variable from the tuple provided to index and specifying an individual effects model:

m2 <- plm(lifeExp ~ gdpPercap, data = panel.data, model = 'within',
  effect = 'individual', index = c('country'))

Goodness-of-Fit for Panel Models

To get a goodness-of-fit metric for the full-model, we have to calculate various sums-of-squares. Returning to our basic statistics, we note that:

$$ R^2 = \frac{SSR}{SST} = \frac{\sum (\hat{y}_i - \bar{y})^2}{\sum (y_i - \bar{y})^2} $$

Where $\hat{y}_i$ and $\bar{y}_i$ are the estimated and mean values, respectively, of $y_i$ and $SSR$ and $SST$ are, respectively, the residual sum-of-squared and total sum-of-squares, related by the following formula:

$$ \sum_{i=1}^n (y_i - \bar{y})^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \sum_{i=1}^n (\hat{y}_i - \bar{y})^2\\ SST = SSE + SSR $$

Thus, if we can calculate $SSR$ and $SST$, we can calculated R-squared. $SST$ is easily obtained from our model data, as it depends only on the observed values and mean value of our response:

$$ SST = \sum_{i=1}^n (y_i - \bar{y})^2 $$

In R, this is:

sst <- with(panel.data, sum((lifeExp - mean(lifeExp))^2))

Without deriving the fitted values, $\hat{y}$, we can't calculate $SSR$ or $SSE$ directly. I'll get into deriving fitted values later. For now, we can exploit the fact that $SST = SSE + SSR$ so as to derive $SSR$ as the difference between $SST$ and $SSE$, the sum of squared errors. We can then calculate $SSE$ from the least-squares criterion:

$$ SSE = \epsilon '\epsilon = (y - X\beta )' (y - X\beta ) $$

In R, this is:

m1.sse <- t(residuals(m1)) %*% residuals(m1)

Putting this altogether, in R, we can derive R-squared as follows. Recall that, here, lifeExp signifies my dependent or response variable.

> sst <- with(panel.data, sum((lifeExp - mean(lifeExp))^2))
> m1.sse <- t(residuals(m1)) %*% residuals(m1)
> (sst - m1.sse) / sst

          [,1]
[1,] 0.7675404

We obtain $R^2 = 0.9355361$, which, compared to the R-squared estimated for the full (individual fixed effects) model by lfe, is a pretty good estimate. Why bother calculating this when lfe does it for free? In my work, I found that lfe and felm() choked on some two-ways panel models I was fitting but, if that's not a problem for you, just use lfe. In addition, lfe doesn't provide adjusted R-squared, which is a better estimate between models with differing numbers of parameters.

Adjusted R-squared is defined as:

$$ \bar{R}^2 = 1 - (1 - R^2)\frac{n-1}{n-p-1} = 1 - \frac{SSR(n-1)^{-1}}{SST(n-p-1)^{-1}} $$

Provided we have $R^2$, here denoted m1.r2, we can calculate $n$, the number of observations, and the adjusted R-squared statistic in R as follows:

m1.r2 <- (sst - m1.sse) / sst
N <- dim(panel.data)[1]
1 - (1 - m1.r2)*((N - 1)/(N - length(coef(m1)) - 1))

Calculating Fitted Values

You might think of extracting the fitted values with the fitted() function in R. It's not clear to me that this works; the values I get from fitted() for any of the models I've worked with are too small. Unfortunately, there's no documentation I can find as to how the fitted() function performs on plm() model instances; i.e., ?fitted.plm returns nothing, nor does a quick search online.

Luckily, we can recall from elementary statistics that the fitted values can also be calculated as the difference between our observed values and the residuals.

panel.data$lifeExp - residuals(m1)

A quick fitted-versus-observed plot shows we're not doing too badly with this model.

# An example fitted-vs-observed plot
plot(panel.data$lifeExp - residuals(m1), panel.data$lifeExp, asp = 1)
abline(0, 1, col = 'red', lty = 'dashed', lwd = 2)

Calculating Fitted Values for Hypothesis Testing

What if you want to conduct hypothesis testing using proposed values for one or more main effects? For instance, within the Gapminder example, we might ask the question: what is a certain country's per-capita GDP grew at a different rate than it actually did? This is harder to do because the predict() function in R doesn't work out-of-the-box with felm() or plm() models.

This is a clearly made-up example, but let's say that per-capita GDP world-wide was 10% lower in 2002 and 2007 than we actually observed. Below, I use the dplyr library to transform the data this way. If you're unfamiliar with dplyr, just know that, below, I am scaling back per-capita GDP estimates in 2002 and 2007 by 10%, merging those years with the rest of the data, and then formatting my data frame so it is arranged the same way as the original.

library(dplyr)
panel.data2 <- panel.data %>%
  filter(year >= 2002) %>%
  mutate(gdpPercap = gdpPercap - (gdpPercap * 0.10)) %>%
  rbind(filter(panel.data, year < 2002)) %>%
  arrange(country, year)

Now, let's test this hypothesis with plm(). Let's also develop a slightly more complex fixed effects regression model. We have data on population in each year, and this might have something to do with life expectancy; it's a time-varying estimate of our system, so let's include it in the model.

> m3 <- plm(lifeExp ~ gdpPercap + pop, data = panel.data, model = 'within',
  effect = 'individual', index = c('country'))
> coef(m3)
   gdpPercap          pop
3.936623e-04 6.196916e-08

We get a similar, though slightly smaller estimate of the effect of (change in) per-capita GDP on (change in) life expectancy. Now, how can we obtain the fitted values using the existing model, but new data? Again, linear algebra helps us find the answer. Recall the formula for the individual fixed effects model we saw earlier.

$$ y_{it} = \alpha_i + \beta x_{it} + \phi z_i + \varepsilon_{it} $$

Because $z_i$ is really just a repeated index for each subject (observed multiple times), every term is additive other than the main effects $\beta x_{it}$. This means it is easy to calculate the fitted values for a new set of values (for the same subjects). In the case of the individual fixed effects model, we compute the product of the main effects and the new observations of our $X$ matrix and add it to the fixed effects, which need to be repeated in the order they appear in the original design matrix. Because our Gapminder data are ordered by country name, then year, for 12 years, we need to repeat each country fixed effect 12 times. Below, the function fixef() allows us to extract the country fixed effects from a given model.

# Get the new X matrix with our new (hypothetical) values
new.X <- as.matrix(select(panel.data2, gdpPercap, pop))

# Repeat each fixed effect 12 times (12 observations for each subject)
fe <- rep(as.numeric(fixef(m3)), each = 12)

# Compute our predictions
preds <- (new.X %*% coef(m3)) + fe

It's important to note that this is not "out-of-sample prediction." That wouldn't make sense for a fixed effects regression and would, in fact, be misleading. We have fit fixed effects for each of our subjects (here, countries) and it wouldn't make sense to use this model for a different set of subjects. What we're doing here is testing a counterfactual, e.g., suppose that $X$ has a different value, what would be the effect on $Y$?

We can confirm we've calculated the fitted values correctly by returning to the original dataset and adding the residuals to our fitted values. The residuals add the random variation of our original, observed response back into our model; the result is a perfect fit, as seen in how the plot points line up perfectly along the 1:1 line.

old.X <- as.matrix(select(panel.data, gdpPercap, pop))

# Repeat each fixed effect 12 times (12 observations for each subject)
fe <- rep(as.numeric(fixef(m3)), each = 12)

# Compute our predictions THIS TIME with the residuals
preds <- (new.X %*% coef(m3)) + fe + residuals(m3)

plot(preds, panel.data$lifeExp, asp = 1)
abline(0, 1, col = 'red', lty = 'dashed', lwd = 2)

I haven't been able to figure out how to do this for the two-ways model (with time fixed effects) yet. The result should require the addition of the time fixed effects, but I'm not getting the right result.

The above examples work for lfe::felm() models, too, as fixef() can also extract the fixed effects models from the lfe package.

Model Diagnostics

Two critical assumptions of any linear model, including linear fixed-effects panel models, are constant variance (homoskedasticity) and normally distributed errors [6]. We might also want to determine the leverage of our observations to see if there are any highly influential points (which might be outliers). In addition, since we're working with spatial data (in this case), we'll do a crude check for spatial autocorrelation in the residuals, which, if present, would be problematic for inference.

Normally Distributed Errors

This is a quick check using a couple of built-in functions. We want to determine that the curve of our residuals in a QQ-Normal plot follow a straight line. Some curvature in the tails is to be expected; it is a somewhat subjective test but this assumption is also commonly satisfied with observational data.

qqnorm(residuals(m3), ylab = 'Residuals')
qqline(residuals(m3))

We might also examine a histogram of the residuals.

hist(residuals(m1), xlab = 'Residuals')

Constant Variance

Here, a plot of the residuals against the fitted values should reveal whether or not there is non-constant variance (heteroskedasticity) in the residuals, which would violate one of the assumptions of our linear model. We look for whether or not the spread of points is uniform in the y-direction as we move along the x-axis (constant variance along the x-axis).

plot(preds, residuals(m3))

Checking for Influential Observations

To derive the leverage of the observations, we wish to derive hat matrix (or "projection matrix") from our linear model. This is the matrix given by the linear transformation by which we obtained the estimated coefficients for our model, $\beta$.

$$ X\hat{\beta} = X(X^T X)^{-1} X^Ty = Py $$

In R, we use matrix multiplication and the solve() function (to obtain the inverse of a matrix). With the faraway package, we can draw a half-normal plot which sorts the observations by their leverage.

X <- panel.data[,c('lifeExp', 'gdpPercap')]
X <- as.matrix(X)
P = X %*% solve(t(X) %*% X) %*% t(X)

require(faraway)
# Create `labs` (labels) for 1 through 1704 observations
halfnorm(diag(P), labs = 1:1704, ylab = 'Leverages', nlab = 1)

This theory and approach are based on the general linear model and ordinary least squares (OLS) regression, corresponding to models built with lm() in R. Some adjustment may be necessary to calculate leverage correctly for fixed effects models. However, this approach based on OLS seems to work pretty well; the points with the most leverage I find correspond to Kuwait in the 1950s through 1970s, a period when the country's per-capita GDP went on a roller coaster, reaching the highest value seen in the dataset.

If we need to remove influential observations, in order to maintain a balanced panel, it's important to remove all of the observations for that individual; that is, to remove that individual entirely from the panel, rather than just the observation(s) at those time period(s). In R, we do this by querying those points that have an influence above a certain threshold, say 0.004; here, the individual units of observation are denoted by unique country names.

panel.data.mod <- subset(panel.data,
  !(country %in% panel.data[names(diag(P)[diag(P) > 0.004]), 'country']))

References

Gaure, S. 2013. lfe: Linear Group Fixed Effects. The R Journal 5(2):104–117.
Croissant, Y., and G. Millo. 2008. Panel data econometrics in R: The plm package. Journal of Statistical Software 27(2):1–43.
Halaby, C. N. 2004. Panel Models in Sociological Research: Theory into Practice. Annual Review of Sociology 30(1):507–544.
Gelman, A., and J. Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. New York, New York, USA: Cambridge University Press.
Allison, P. D. 2009. Fixed Effects Regression Models ed. T. F. Liao. Thousand Oaks, California, U.S.A.: SAGE.
Crawley, M. J. 2015. Statistics: An Introduction Using R 2nd ed. Chichest, West Sussex, United Kingdom: John Wiley & Sons.
Bolker, B. M., M. E. Brooks, C. J. Clark, S. W. Geange, J. R. Poulsen, M. H. H. Stevens, and J. S. S. White. 2009. Generalized linear mixed models: a practical guide for ecology and evolution. Trends in Ecology and Evolution 24(3):127–135.

The Tasseled Cap transformation and band ratios: Applications for urban studies

2015-08-03T09:00:00+02:00

Urban environments are heterogeneous at relatively small scales and composed of a variety of land covers, chiefly including impervious surface, green vegetation (e.g., the urban canopy), and soil. Ridd (1995) demonstrated that urban land cover was dominated by these three types, summarized as the Vegetation-Impervious-Soil (VIS) model. The model suffers due to the vagaries of any urban land cover analysis: There are multiple subtypes of each land cover with different spectral characteristics (e.g., multiple impervious surface colors, different vegetation species) and impervious surface and soil are easily confused. Nonetheless, the VIS model is still useful and remains one of the most popular, widely applicable descriptions of urban land cover composition (e.g., Phinn et al. 2002; Small 2003; Kärdi 2007).

How can we identify VIS areas with remote sensing? If we can spectrally separate vegetation, impervious surface, and soil then we may be able classify a multispectral image on that basis. At the resolution of many earth-observing sensors (e.g., Landsat MSS, Landsat TM/ETM+, SPOT), however, the ground sample distance of a pixel is much larger than the minimum mapping unit of urban areas (Small 2003). That is, many if not most of the pixels in the scene are mixtures of two or more VIS components (e.g., vegetation and impervious surface co-located in a suburban development). Linear spectral mixture analysis (LSMA) is one way of modeling land cover mixtures to estimate the abundance of VIS components on the ground. LSMA can be challenging and time-consuming to implement, however.

As an alternative, some investigators have introduced land cover indices based on band ratios that attempt to capture the spatial variability in the three VIS components. The Normalized Difference Vegetation Index (NDVI) is probably the most popular example of a land cover index based on band ratios. NDVI wasn't expressly intended for use in urban areas but many more recent developments are. These include the Biophysical Composition Index (BCI; Deng and Wu 2012), the Normalized Difference Impervious Surface Index (NDISI; Xu 2010), the Modified NDISI (MNDISI; Liu et al. 2013), and the Ratio Normalized Difference Soil Index (RNDSI; Deng et al. 2015).

Here, I'll focus on only the BCI as it is based on the well-known tasseled cap transformation (Kauth and Thomas 1976), a coordinate transformation that is similar to principal components analysis (PCA) and similar dimension reduction techniques that I've used in spectral mixture analysis. I'll present an overview of BCI, the tasseled cap, and examples of implementing both in Python.

The Tasseled Cap Transformation

Calculating the BCI requires the tasseled cap transformation, a coordinate transformation of multispectral remote sensing data. Tasseled cap was originally developed for Landsat MSS 4-band data. As such, the tasseled cap transformation, as initially described, produced just two principal components termed "brightness" and "greenness." These principal components have physical meanings, however, and Kauth and Thomas (1976) and later investigators (Crist and Kauth 1986) demonstrated how it could be used to track changes in the phenology of vegetation, particularly agriculture.

The lifecycle of agricultural land proceeds from bare soil to emerging vegetation, to mature green vegetation that completely covers the visible ground area, to senescent yellow vegetation, and finally to bare soil again (Kauth and Thomas 1976, Figure 2). This lifecycle is accompanied by changes in spectral reflectance of the agricultural land and if pixels in each stage are plotted in multispectral feature space, they trace a line through the space that corresponds to the temporal evolution of the landscape. They also clearly fall into separate regions. If a new coordinate system is aligned with these features it is possible to compress over 95% of the information stored in four (4) MSS bands into just two bands ("brightness" and "greenness"). The tasseled cap's name is derived from its appearance in feature space once the coordinate transformation is performed.

The tasseled cap transformation has a compact mathematical notation but requires a table of rotation coefficients (see Kauth and Thomas 1976) for the $p\times p$ rotation matrix, $R$. A $p \times n$ vector $x$ of multispectral data (comprised of $p$ spectral bands across $n$ pixels) is transformed to a $u$ vector by means of the following rotation, where $e$ is an optional translation that can be used to avoid negative values in the transformed data:

$$ u = Rx + e $$

Kauth and Thomas provided the coefficients in $R$ for the Landsat MSS sensor in their 1976 paper that introduced the transformation. Those Landsat MSS coefficients were based on MSS data provided in counts. Since then, others have presented versions of the tasseled cap transformation for other platforms and sensors such as Landsat TM surface reflectance (Crist and Cicone 1984), Landsat ETM+ top-of-atmosphere (TOA) reflectance (Huang et al. 2002), MODIS (Lobser et al. 2007), WorldView-2 (Yarbrough et al. 2014), and Landsat 8 TOA reflectance (Baig et al. 2014).

Below is a simple Python implementation of the tasseled cap transformation, without support for the optional translation. The coefficients are those for Landsat TM surface reflectance (via Crist and Cicone 1984) but could be replaced with the relevant coefficients for any other sensor data. Landsat surface reflectance data can be obtained easily from the USGS EROS Science Processing Architecture (ESPA). The function accepts a Landsat TM surface reflectance raster as either a GDAL data source or a $p\times m\times n$ NumPy array where $p$ is the number of bands, and $m, n$ refer to the number of rows and columens, respectively.

def tasseled_cap_tm(rast, reflectance=True):
    '''
    Applies the Tasseled Cap transformation for TM data. Assumes that the TM
    data are TM reflectance data (i.e., Landsat Surface Reflectance). The
    coefficients for reflectance factor data are taken from Crist (1985) in
    Remote Sensing of Environment 17:302.
    '''
    if reflectance:
        # Reflectance factor coefficients for TM bands 1-5 and 7; they are
        #   entered here in tabular form so they are already transposed with
        #   respect to the form suggested by Kauth and Thomas (1976)
        r = np.array([
            ( 0.2043, 0.4158, 0.5524, 0.5741, 0.3124, 0.2303),
            (-0.1603,-0.2819,-0.4934, 0.7940,-0.0002,-0.1446),
            ( 0.0315, 0.2021, 0.3102, 0.1594,-0.6806,-0.6109),
            (-0.2117,-0.0284, 0.1302,-0.1007, 0.6529,-0.7078),
            (-0.8669,-0.1835, 0.3856, 0.0408,-0.1132, 0.2272),
            ( 0.3677,-0.8200, 0.4354, 0.0518,-0.0066,-0.0104)
        ], dtype=np.float32)

    else:
        raise NotImplemented('Only support for Landsat TM reflectance has been implemented')

    shp = rast.shape

    # Can accept either a gdal.Dataset or numpy.array instance
    if not isinstance(rast, np.ndarray):
        x = rast.ReadAsArray().reshape(shp[0], shp[1]*shp[2])

    else:
        x = rast.reshape(shp[0], shp[1]*shp[2])

    return np.dot(r, x).reshape(shp)

When I first implemented this in Python, I naturally wanted to check to make sure I had it right. So, I plotted the first three tasseled cap (TC) components, Brightness, Greenness, and Wetness, in feature space. We can compare these images to the figures in Kauth and Thomas (1976) or Crist and Cicone (1984) to make sure we're on the right track.

The left image shows the Brightness-Greenness plane, where we can see the eponymous tasseled cap itself. The middle image is Brightness-Wetness plane and while it doesn't look exactly like the figure in Crist and Cicone (1984), it's not wildly different. The right image is the so-called "transition zone" view; the Wetness-Grenness plane. The transition zone view looks correct although it has a very interesting fork at the top part of volume.

The Biophysical Composition Index (BCI)

Now that I can calculate the tasseled cap components I can also calculate the BCI. The BCI was developed as a one-dimensional measure of urban land cover composition based on the observation that the average values of the Brightness and Wetness tasseled cap components are much higher than the average value of the Greenness component. Deng and Wu (2012) proposed a normalized difference approach that compresses the information in the first three tasseled cap components into a measure defined on the interval $[-1, 1]$. It is defined:

$$ BCI = \frac{0.5(H + L) - V}{0.5(H + L) + V} $$

Where $H, V, L$ refer to the normalized first three tasseled cap components, $TC1, TC2, TC3$, normalized such that:

$$ H = \frac{\mathrm{TC1} - \mathrm{TC1}_{min}}{\mathrm{TC1}_{max} - \mathrm{TC1}_{min}} \quad V = \frac{\mathrm{TC2} - \mathrm{TC2}_{min}}{\mathrm{TC2}_{max} - \mathrm{TC2}_{min}} \quad L = \frac{\mathrm{TC3} - \mathrm{TC3}_{min}}{\mathrm{TC3}_{max} - \mathrm{TC3}_{min}} $$

Below is a simple Python implementation of the BCI.

def biophysical_composition_index(rast, nodata=-9999):
    '''
    Calculates the biophysical composition index (BCI) of Deng and Wu (2012)
    in Remote Sensing of Environment 127. The input raster is expected to be
    a tasseled cap-transformed raster. The NoData value is assumed to be
    negative (could never be the maximum value in a band).
    '''
    shp = rast.shape

    # Can accept either a gdal.Dataset or numpy.array instance
    if not isinstance(rast, np.ndarray):
        x = rast.ReadAsArray().reshape(shp[0], shp[1]*shp[2])

    else:
        x = rast.reshape(shp[0], shp[1]*shp[2])

    unit = np.ones((1, shp[1] * shp[2]))

    stack = []
    for i in range(0, 3):
        # Calculate the minimum values after excluding NoData values
        tcmin = np.setdiff1d(x[i, ...].ravel(), np.array([nodata])).min()

        # Calculate the normalized TC component TC_i for i in {1, 2, 3}
        stack.append(np.divide(np.subtract(x[i, ...], unit * tcmin),
            unit * (x[i, ...].max() - tcmin)))

    # Unpack the High-albedo, Vegetation, and Low-albedo components
    h, v, l = stack

    return np.divide(
        np.subtract(np.divide(np.add(h, l), unit * 2), v),
        np.add(np.divide(np.add(h, l), unit * 2), v))\
        .reshape((1, shp[1], shp[2]))

Deng and Wu found significant positive correlation between the BCI and impervious surface, significant negative correlation between the BCI and vegetation cover. I can confirm that BCI seems to be an excellent indicator of these land covers, as it is positively correlated with the National Land Cover Dataset (NLCD) impervious surface fraction. The 2001 NLCD impervious layer (left) and a BCI image calculated from a July 2002 Landsat TM image (right) are shown below; the Pearson's correlation coefficient between the datasets is $r=0.624$.

I also compared the BCI to measures of vegetation, impervious surface, and soil derived through spectral mixture analysis of a 1998 Landsat TM image. The correlation matrix is below and indicates that these independently derived land covers are strongly correlated with the BCI. In this case, however, the soil fraction from spectral mixture analysis is likely a mixed endmember with non-photosynthetic vegetation and therefore cannot be readily interpreted in this context.

Pearson's Correlation Coefficients (r)

              BCI   Veg.    Imp.   Soil
----------  ----- ------  ------  -----
BCI         1.000 -0.877   0.838  0.538
Vegetation         1.000  -0.724 -0.810
Impervious                 1.000  0.182
Soil                              1.000

References

Ridd, M. 1995. Exploring a VIS (vegetation-impervious surface-soil) model for urban ecosystem analysis through remote sensing: comparative anatomy for cities. International Journal of Remote Sensing 16 (12):2165–2185.
Phinn, S., M. Stanford, P. Scarth, A. T. Murray, and P. Shyy. 2002. Monitoring the composition of urban environments based on the vegetation-impervious surface-soil (VIS) model by subpixel analysis techniques. International Journal of Remote Sensing 23 (20):4131–4153.
Small, C. 2003. High spatial resolution spectral mixture analysis of urban reflectance. Remote Sensing of Environment 88 (1-2):170–186.
Kärdi, T. 2007. Remote sensing of urban areas: Linear spectral unmixing of Landsat Thematic Mapper images acquired over Tartu (Estonia). Proc. Estonian Acad. Sci. Biol. Ecol 56 (1):19–32.
Deng, C., and C. Wu. 2012. BCI: A biophysical composition index for remote sensing of urban environments. Remote Sensing of Environment 127:247–259.
Xu, H. 2010. Analysis of Impervious Surface and its Impact on Urban Heat Environment using the Normalized Difference Impervious Surface Index (NDISI). Photogrammetric Engineering & Remote Sensing 76 (5):557–565.
Liu, C., Z. Shao, M. Chen, and H. Luo. 2013. MNDISI: A multi-source composition index for impervious surface area estimation at the individual city scale. Remote Sensing Letters 4 (8):803–812.
Deng, Y., C. Wu, M. Li, and R. Chen. 2015. RNDSI: A ratio normalized difference soil index for remote sensing of urban/suburban environments. International Journal of Applied Earth Observation and Geoinformation 39:40–48.
Kauth, R. J., and G. S. Thomas. 1976. The tasseled cap - A graphic description of the spectral-temporal development of agricultural crops as seen by Landsat. In Proceedings of the Symposium on Machine Processing of Remotely Sensed Data, West Lafayette, Indiana, U.S.A, 29 June-1 July 1976, 41–51.
Crist, E. P., and R. J. Kauth. 1986. The Tasseled Cap de-mystified. Photogrammetric Engineering and Remote Sensing 52 (1):81–86.
Crist, E. P., and R. C. Cicone. 1984. A Physically-Based Transformation of Thematic Mapper Data---The TM Tasseled Cap. IEEE Transactions on Geoscience and Remote Sensing GE-22 (3):256–263.
Huang, C., B. Wylie, L. Yang, C. Homer, and G. Zylstra. 2002. Derivation of a tasseled cap transformation based on Landsat 7 at-satellite reflectance. International Journal of Remote Sensing 23 (8):1741–1748.
Lobser, S. E., and W. B. Cohen. 2007. MODIS tasseled cap: land cover characteristics expressed through transformed MODIS data. International Journal of Remote Sensing 28 (22):5079–5101.
Yarbrough, L. D., K. Navulur, and R. Ravi. 2014. Presentation of the Kauth–Thomas transform for WorldView-2 reflectance data. Remote Sensing Letters 5 (2):131–138.
Baig, M. H. A., L. Zhang, T. Shuai, and Q. Tong. 2014. Derivation of a tasseled cap transformation based on Landsat 8 at-satellite reflectance. Remote Sensing Letters 5 (5):423–431.

Masking saturated pixels may improve spectral mixture analysis

2015-07-28T15:05:00+02:00

The Landat Surface Reflectance (SR) product sometimes contains saturation in one or more bands (a value of 16,000 reflectance units or 160% reflectance). Presumably, these correspond to saturation at the detector; the same kind of saturation that is likely to occur over clouds or snow-covered areas. Such clouds and snow can be masked out with a provided masking layer (like CFMask) but there are other land covers like bright soil, deserts, water, and impervious surface that can saturate one or more bands (e.g., Asner et al. 2010; Rashed et al. 2001).

In the summer ("leaf-on" season), we can be confident that, in most areas, snow simply isn't a possible land cover type. Urban areas are full of bright targets, however, and it seems that almost every surface reflectance image contains a few saturated pixels not due to clouds or snow. The saturation value (16,000 for Landsat SR; 20,000 for Landsat top-of-atmosphere reflectance) may not be present in all bands—some targets are only thermally bright while others can produce spectral reflections in the optical bands (e.g., sun glint). Thus, it seems that these bright targets aren't always masked by the included quality assurance (QA) layer; there is still spectral reflectance information in the bands that aren't saturated.

Saturated pixels are easily masked; I'll provide some sample Python code that shows how. But can saturated pixels be ignored or are they a problem for certain types of analyses? I'm currently in the midst of a study of urban reflectance in Southeast Michigan, using spectral mixture analysis to estimate the fractional abundance of certain land cover types. To improve the abundance estimates (by improving the signal-to-noise ratio of the input) and to reduce throughput, I use a minimum noise fraction (MNF) transformation, akin to a principal components rotation that projects the Landsat TM/ETM+ reflectance to an orthogonal basis. Although I had masked these saturated pixels from the start I began to wonder what the difference would be if I had left them in.

The example I present here is for a Landsat 7 ETM+ SR image that has been clipped to Oakland County, Michigan (from within WRS-2 row 20, path 30). The Landsat 7 ETM+ image was captured in July 1999 (ordinal day 196); the image identifier is LE70200301999196EDC00. The original SR image had the QA mask, provided by the USGS, applied to the image in both cases; where saturation was masked and where it was not.

Visualizing Saturation in the Mixing Space

If saturation persists prior to the transformation, the transformed values are scaled differently than when they are first removed. This can be seen "heads-up" in a GIS by toggling between layers: The MNF-transformed image with saturated values (below, left) and the one without (below, right).

It can also be seen in the histograms (left, with saturation; right, without).

It's interesting to note that the raster image with saturation shows much brighter soil/ cropland areas (yellow in the image) than in the raster image where saturated values were masked. This would seem to suggest that leaving the fully (and partially) saturated pixels in place prior to transformation could improve the accuracy of the abundance estimation. The change in the histograms is not so easy to interpret, however.

Another way of visualizing the difference between these two images is to examine their feature spaces. Below is an example of the mixing space of the image where saturated pixels were not masked; a 2D slice of the feature space showing the first three MNF components. These are the three components with the highest signal-to-noise ratio (SNR) where the SNR of the first component is greater than that of the second, and so on. We see that there are several pixels in this mixing space that are far-flung from the main volume.

If you're unaccustomed to looking at images like the one above, take my word for it: This looks wrong. Or don't take my word for it; see Tompkins et al. (1997), Wu and Murray (2003) or Small (2004) for examples of mixture spaces. But then what does the feature space look like for the transformed image where saturation was masked? Below, right, is the feature space for that image. On the left, again, is the image where saturation persists, yet I've "zoomed in" to the main mixture volume to facilitate a fair comparison between these images.

Now these mixing spaces look like what we see in the literature: They should be well-defined convex volumes. They should resemble a ternary diagram. We can see that in the left-hand plot (with saturation), the third MNF component (represented in color) has such a wide range of values that the mixture volume is all one color; there are extreme values in this componenent that are not in view. Those must be the pixels we intially saw scattered throughout the space. The right-hand plot (without saturation) seems properly scaled. Note that I've indicated the locations of pixels with known ground cover types in both plots.

Masking Saturation in Python with GDAL and NumPy

How did I mask the saturated values, by the way? It can be tricky when one or more but not all of the bands are saturated at a given pixel. Below is a Python function, utilizing GDAL and NumPy, that demonstrates how this might be done.

from osgeo import gdal
import numpy as np

def mask_saturation(rast, nodata=-9999):
    '''
    Masks out saturated values (surface reflectances of 16000). Arguments:
        rast    A gdal.Dataset or NumPy array
        nodata  The NoData value; defaults to -9999.
    '''
    # Can accept either a gdal.Dataset or numpy.array instance
    if not isinstance(rast, np.ndarray):
        rast = rast.ReadAsArray()

    # Create a baseline "nothing is saturated in any band" raster
    mask = np.empty((1, rast.shape[1], rast.shape[2]))
    mask.fill(False)

    # Update the mask for saturation in any band
    for i in range(rast.shape[0]):
        np.logical_or(mask,
            np.in1d(rast[i,...].ravel(), (16000,)).reshape(rast[i,...].shape),
            out=mask)

    # Repeat the NoData value across the bands
    np.place(rast, mask.repeat(rast.shape[0], axis=0), (nodata,))
    return rast

What Leads to Better Estimates?

The differences I've shown above are interesting but what actually leads to more accurate estimation of fractional land cover abundances in spectral mixture analysis? I performed a linear spectral mixture analysis on the MNF-transformed images using the same four (4) image endmembers: Impervious surface, Green vegetation, Soil, and Shade. The shade endmember is actually photometric shade: zero reflectance in all bands. Both water and "NoData" pixels are masked to this value as the linear algebra libraries used cannot work with "NoData" values and masked arrays in NumPy have serious performance drawbacks.

One measure of the "fit" between our abundance estimates and reality is to compare the observed to predicted Landsat ETM+ reflectance, where predicted reflectance is computed by a forward model using the endmember spectra and their estimated fractional abundance in a given pixel. It is standard practice (e.g., Rashed et al. 2003; Wu and Murray, 2003) to calculate this "fit" at the root mean squared error (RMSE) between the observed reflectance and the predicted reflectance. Powell et al. (2007) presents a formula for the RMSE of a pixel $i$ that is normalized by the number of endmembers, $M$:

$$ RMSE_i = \left(M^{-1} \sum_{k=1}^M e_{(i,k)}^2 \right)^{1/2} $$

I then normalized the sum of the RMSE values across a large, random subset of the pixels (for performance considerations) by the range in reflectances, $r$:

$$ \% RMSE = \frac{\sum_{i=0}^N RMSE_i}{r_{max} - r_{min}} $$

As the minimum reflectance is always zero (given the presence of shade), $r_{min} \equiv 0$. The abundance images are each compared to their respective, original Landsat ETM+ image; with or without saturated pixels masked out, as the case may be.

Percent RMSE was calculated for both abundance estimates (with and without saturation) where abundance was estimated two different ways each time: Through a non-negative least squares (NNLS) estimation and through a fully-constrained least squares (FCLS) estimation. With FCLS, abundance estimates are constrained to be both non-negative and on the interval $[0,1]$. With NNLS, the second constraint is dropped. The final abundance images, with or without saturated pixels, look very similar. The FCLS image from the estimate without saturated pixels is below. Impervious surface is mapped to red, vegetation to green, and soil to blue; each pixel's abundance estimates were re-summed to one after shade was subtracted.

Results from the validation are in the table below.

=============================
Saturation?   Approach  %RMSE
-----------   --------  -----
Yes           NNLS       7.9%
No            NNLS       9.2%
Yes           FCLS       1.2%
No            FCLS       3.2%
=============================

It would seem that leaving saturated pixels unmasked results in improved modeling of the ETM+ reflectance. This is only a first approximation of the accuracy, however; we don't really care about the modeled reflectance. The true validation is in comparing the abundance estimates to ground data.

Validation of Abundance Estimates

Shade and soil were subtracted from the abundance images before validation against manually interpreted land cover from aerial photographs in 90-meter windows centered on random sampling points. 169 sample points were located in Oakland County and the coefficient of determination $\left( R^2 \right)$, RMSE, and mean absolute error (MAE) were calculated for each of the abundance images.

Without saturation:

============================
N     R^2     MAE     RMSE
---   ------  ------  ------
169   0.4742  0.1534  0.1998
============================

With saturated pixels retained:

============================
N     R^2     MAE     RMSE
---   ------  ------  ------
169   0.4360  0.1557  0.2069
============================

Conclusion

Despite the better performance in the forward model, the image in which saturated pixels were retained provided worse estimates than the image in which they were masked. Removing the saturated pixels seems to improve the accuracy of abundance estimates. The magnitude of the improvement is small and we cannot be sure, from this test case alone, that the improvement is not simply a coincidence&emdash;that is, while removing these saturated pixels improved the results in this case, it may be that the saturated pixels in place in another image conspire to produce equal or better results to estimates where they are removed.

However, given the topology of the feature space, the most likely explanation for this marginal improvement in accuracy is the removal of saturated pixels. And perhaps the most important reason for masking saturated pixels is that it makes interpretation of the mixing space much easier.

References

Asner, G. P., and K. Heidebrecht. 2010. Spectral unmixing of vegetation, soil and dry carbon cover in arid regions: Comparing multispectral and hyperspectral observations. International Journal of Remote Sensing 23 (19):3939–3958.
Rashed, T., J. R. Weeks, M. S. Gadalla, and A. G. Hill. 2001. Revealing the Anatomy of Cities through Spectral Mixture Analysis of Multispectral Satellite Imagery: A Case Study of the Greater Cairo Region, Egypt. Geocarto International 16 (4):7–18.
Rashed, T., J. R. Weeks, D. A. Roberts, J. Rogan, and R. L. Powell. 2003. Measuring the Physical Composition of Urban Morphology Using Multiple Endmember Spectral Mixture Models. Photogrammetric Engineering & Remote Sensing 69 (9):1011–1020.
Green, A. A., M. Berman, P. Switzer, and M. D. Craig. 1988. A transformation for ordering multispectral data in terms of image quality with implications for noise removal. IEEE Transactions on Geoscience and Remote Sensing 26 (1):65–74.
Tompkins, S., J. F. Mustard, C. M. Pieters, and D. W. Forsyth. 1997. Optimization of Endmembers for Spectral Mixture Analysis. Remote Sensing of Environment 59 (3):472–189.
Wu, C., and A. T. Murray. 2003. Estimating impervious surface distribution by spectral mixture analysis. Remote Sensing of Environment 84 (4):493–505.
Small, C. 2004. The Landsat ETM+ spectral mixing space. Remote Sensing of Environment 93 (1-2):1–17.
Powell, R. L., D. A. Roberts, P. E. Dennison, and L. L. Hess. 2007. Sub-pixel mapping of urban land cover using multiple endmember spectral mixture analysis: Manaus, Brazil. Remote Sensing of Environment 106 (2):253–267.

Clipping rasters in Python

2015-07-17T12:30:00+02:00

Clipping rasters can be trivial with a desktop GIS like QGIS or with command line tools like GDAL. However, I recently ran into a situation where I needed to clip large rasters in an automated, online Python process. It simply wouldn't do to interrupt the procedure and clip them myself. The Python bindings for GDAL/OGR are pretty neat but they are very low-level; how could I use Python to clip the rasters as part of a continuous Python session?

Jared Erickson posted an excellent tutorial on this topic, one of many in his "Python GDAL/OGR Cookbook". Even this simple example hints at how complicated something like clipping can be with the low-level GDAL/OGR API. However, there are still some limitations to the example he provided:

It only works for single clipping features.
It only works for multiple-band rasters.
The clipping features must be bounded within the raster.

Here I'll present a solution to the second and third points: Clipping a raster, stacked or single-band, with clipping features that extend beyond its bounds.

Improvement: Support for Single-Band Images

This is an easy fix. We simply need to use a try...catch sequence of statements to anticipate the error in our NumPy indexing when a single-band raster is presented.

# Multi-band image?
try:
    clip = rast[:, ulY:lrY, ulX:lrX]

# Nope: Must be single-band
except IndexError:
    clip = rast[ulY:lrY, ulX:lrX]

Improvement: Out-of-Bounds Clipping Features

The second part is not so easy. The clipping in Jared's example is based on NumPy arrays and therefore must be conformal. One thing that can happen when the clipping features extend "above" a raster's extent is that the estimated pixel coordinate for the upper-left corner become negative--and negative pixels don't make sense! We have to catch this case, remember the negative pixel coordinate, and set it to zero.

# If the clipping features extend out-of-bounds and ABOVE the raster...
if gt[3] < maxY:
    # In such a case... ulY ends up being negative--can't have that!
    iY = ulY
    ulY = 0

However, in doing this, the clipping features are effectively "pushed down" (southwards) because we prevented the upper-left corner coordinate from being negative. To compensate, we must pull the clipping features "back up" (northwards). Note that here we need to add an else clause for the case when the clipping features don't extend above the raster.

# If the clipping features extend out-of-bounds and ABOVE the raster...
if gt[3] < maxY:
    # The clip features were "pushed down" to match the bounds of the
    #   raster; this step "pulls" them back up
    premask = image_to_array(raster_poly)
    # We slice out the piece of our clip features that are "off the map"
    mask = np.ndarray((premask.shape[-2] - abs(iY), premask.shape[-1]), premask.dtype)
    mask[:] = premask[abs(iY):, :]
    mask.resize(premask.shape) # Then fill in from the bottom

    # Most importantly, push the clipped piece down
    gt2[3] = maxY - (maxY - gt[3])

else:
    mask = image_to_array(raster_poly)

Finally, we have to deal with the converse problem: clipping features that extend below the raster's bounds. This is trickier. What I've done is created a larger NumPy array onto which the clipping features are projected.

# Clip the image using the mask
try:
    clip = gdalnumeric.choose(mask, (clip, nodata))

# If the clipping features extend out-of-bounds and BELOW the raster...
except ValueError:
    # We have to cut the clipping features to the raster!
    rshp = list(mask.shape)
    if mask.shape[-2] != clip.shape[-2]:
        rshp[0] = clip.shape[-2]

    if mask.shape[-1] != clip.shape[-1]:
        rshp[1] = clip.shape[-1]

    mask.resize(*rshp, refcheck=False)

    clip = gdalnumeric.choose(mask, (clip, nodata))

The Complete Source Code

Here's a complete Python function for clipping a raster. It follows Jared's example but expands on the comments throughout and includes the two improvements I've described. It does not yet support clipping by multiple features at once. Also, I haven't evaluated (nor come across the problem of) its performance with clipping features that extend only to the side of the raster; not sure if this would be a problem.

The function's API consists of two required arguments: The raster to clip and the file path to the clipping features (e.g., a Shapefile). An optional GDAL GeoTransform can be provided. Also, the NoData value can be set.

from osgeo import gdal, gdalnumeric, ogr
from PIL import Image, ImageDraw
import os
import numpy as np

def clip_raster(rast, features_path, gt=None, nodata=-9999):
    '''
    Clips a raster (given as either a gdal.Dataset or as a numpy.array
    instance) to a polygon layer provided by a Shapefile (or other vector
    layer). If a numpy.array is given, a "GeoTransform" must be provided
    (via dataset.GetGeoTransform() in GDAL). Returns an array. Clip features
    must be a dissolved, single-part geometry (not multi-part). Modified from:

    http://pcjericks.github.io/py-gdalogr-cookbook/raster_layers.html
    #clip-a-geotiff-with-shapefile

    Arguments:
        rast            A gdal.Dataset or a NumPy array
        features_path   The path to the clipping features
        gt              An optional GDAL GeoTransform to use instead
        nodata          The NoData value; defaults to -9999.
    '''
    def array_to_image(a):
        '''
        Converts a gdalnumeric array to a Python Imaging Library (PIL) Image.
        '''
        i = Image.fromstring('L',(a.shape[1], a.shape[0]),
            (a.astype('b')).tostring())
        return i

    def image_to_array(i):
        '''
        Converts a Python Imaging Library (PIL) array to a gdalnumeric image.
        '''
        a = gdalnumeric.fromstring(i.tobytes(), 'b')
        a.shape = i.im.size[1], i.im.size[0]
        return a

    def world_to_pixel(geo_matrix, x, y):
        '''
        Uses a gdal geomatrix (gdal.GetGeoTransform()) to calculate
        the pixel location of a geospatial coordinate; from:
        http://pcjericks.github.io/py-gdalogr-cookbook/raster_layers.html#clip-a-geotiff-with-shapefile
        '''
        ulX = geo_matrix[0]
        ulY = geo_matrix[3]
        xDist = geo_matrix[1]
        yDist = geo_matrix[5]
        rtnX = geo_matrix[2]
        rtnY = geo_matrix[4]
        pixel = int((x - ulX) / xDist)
        line = int((ulY - y) / xDist)
        return (pixel, line)

    # Can accept either a gdal.Dataset or numpy.array instance
    if not isinstance(rast, np.ndarray):
        gt = rast.GetGeoTransform()
        rast = rast.ReadAsArray()

    # Create an OGR layer from a boundary shapefile
    features = ogr.Open(features_path)
    if features.GetDriver().GetName() == 'ESRI Shapefile':
        lyr = features.GetLayer(os.path.split(os.path.splitext(features_path)[0])[1])

    else:
        lyr = features.GetLayer()

    # Get the first feature
    poly = lyr.GetNextFeature()

    # Convert the layer extent to image pixel coordinates
    minX, maxX, minY, maxY = lyr.GetExtent()
    ulX, ulY = world_to_pixel(gt, minX, maxY)
    lrX, lrY = world_to_pixel(gt, maxX, minY)

    # Calculate the pixel size of the new image
    pxWidth = int(lrX - ulX)
    pxHeight = int(lrY - ulY)

    # If the clipping features extend out-of-bounds and ABOVE the raster...
    if gt[3] < maxY:
        # In such a case... ulY ends up being negative--can't have that!
        iY = ulY
        ulY = 0

    # Multi-band image?
    try:
        clip = rast[:, ulY:lrY, ulX:lrX]

    except IndexError:
        clip = rast[ulY:lrY, ulX:lrX]

    # Create a new geomatrix for the image
    gt2 = list(gt)
    gt2[0] = minX
    gt2[3] = maxY

    # Map points to pixels for drawing the boundary on a blank 8-bit,
    #   black and white, mask image.
    points = []
    pixels = []
    geom = poly.GetGeometryRef()
    pts = geom.GetGeometryRef(0)

    for p in range(pts.GetPointCount()):
        points.append((pts.GetX(p), pts.GetY(p)))

    for p in points:
        pixels.append(world_to_pixel(gt2, p[0], p[1]))

    raster_poly = Image.new('L', (pxWidth, pxHeight), 1)
    rasterize = ImageDraw.Draw(raster_poly)
    rasterize.polygon(pixels, 0) # Fill with zeroes

    # If the clipping features extend out-of-bounds and ABOVE the raster...
    if gt[3] < maxY:
        # The clip features were "pushed down" to match the bounds of the
        #   raster; this step "pulls" them back up
        premask = image_to_array(raster_poly)
        # We slice out the piece of our clip features that are "off the map"
        mask = np.ndarray((premask.shape[-2] - abs(iY), premask.shape[-1]), premask.dtype)
        mask[:] = premask[abs(iY):, :]
        mask.resize(premask.shape) # Then fill in from the bottom

        # Most importantly, push the clipped piece down
        gt2[3] = maxY - (maxY - gt[3])

    else:
        mask = image_to_array(raster_poly)

    # Clip the image using the mask
    try:
        clip = gdalnumeric.choose(mask, (clip, nodata))

    # If the clipping features extend out-of-bounds and BELOW the raster...
    except ValueError:
        # We have to cut the clipping features to the raster!
        rshp = list(mask.shape)
        if mask.shape[-2] != clip.shape[-2]:
            rshp[0] = clip.shape[-2]

        if mask.shape[-1] != clip.shape[-1]:
            rshp[1] = clip.shape[-1]

        mask.resize(*rshp, refcheck=False)

        clip = gdalnumeric.choose(mask, (clip, nodata))

    return (clip, ulX, ulY, gt2)

Identifying water bodies from Landsat TM/ETM+ with density slicing, machine learning

2015-06-22T09:45:00+02:00

I recently found myself in need of a comprehensive water body mask I could generate from and apply to Landsat TM/ETM+ images. Based on my past experience and the available literature (e.g., Frazier et al. 2000), I knew that any solution was best sourced from Landsat's thermal bands. I wasn't entirely sure what to expect from a solution based only on spectral data but Frazier et al. (2000) presented a very robust solution using only density slicing of Landast TM Band 5 (1.55-1.75 um).

I decided I would test density slicing and compare the results to a simple supervised classification, just as Frazier et al. (2000) had done. They selected a maximum likelihood classifier; I selected a naive Bayes classifier. The study area is the Oakland county, Michigan subset of a Landsat 5 image from July, 1999. Oakland county has several small and mid-size lakes throughout its extent that make for a compelling example. The examples below require, among other tools, the GDAL and OGR command line utilities.

Data Preparation

Supervised classification requires examples of the classes we wish to identify; in this case, water bodies and non-water areas. Example areas of water and non-water were generated in two different ways:

In one iteration, the water areas were generated directly from hydrology data provided by the Michigan Geographic Data Library (MiGDL).
In the second iteration, I prepared examples of water and non-water areas myself in random samples from the study area.

For the second part, I wrote a previous article on how to generate sampling areas on the same grid as a Landsat image; see that article for details. Here, I proceed as if the human-produced training data are already prepared.

Projection Issues

Use of MiGDL data can be frustrating because the data are stored in the Michigan State Plane projection. In clipping the lake and stream data to Oakland county, for instance, I had to ensure my clipping features were in the same projection. I re-projected the clipped data to UTM 17N (WGS84) afterwards.

MICHIGAN_GEOREF='PROJCS["NAD_1983_Michigan_GeoRef_Meters",GEOGCS["GCS_North_American_1983",DATUM["North_American_Datum_1983",SPHEROID["GRS_1980",6378137,298.257222101]],PRIMEM["Greenwich",0],UNIT["Degree",0.017453292519943295]],PROJECTION["Hotine_Oblique_Mercator"],PARAMETER["False_Easting",2546731.496],PARAMETER["False_Northing",-4354009.816],PARAMETER["Scale_Factor",0.9996],PARAMETER["Azimuth",337.255555555556],PARAMETER["Longitude_Of_Center",-86],PARAMETER["Latitude_Of_Center",45.30916666666666],UNIT["Meter",1],AUTHORITY["EPSG","102123"]]'

# Create clipping features in Michigan GeoRef
ogr2ogr clip_features.shp Oakland_county.shp -t_srs $MICHIGAN_GEOREF

# Perform the clip
ogr2ogr -f "ESRI Shapefile" -s_srs $MICHIGAN_GEOREF\
 hydropoly_miv14a_clip.shp hydropoly_miv14a.shp\
 -t_srs "EPSG:32617" -clipsrc clip_features.shp

Cleaning the Hydrology Exemplar Data

After reprojeciton, the lake and stream areas delineated in the MiGDL dataset were filtered, discarding water bodies smaller than 20 acres, which might be ephemeral.

ogr2ogr -where "ACRES >= 20" hydropoly_miv14a_clip_gte_20ac.shp hydropoly_miv14a_clip.shp

Next, the filtered water bodies were shrunk in size by diminishing their outward edges by 90 meters---a reverse buffering. This is to account for a lack of spatial fit between the MiGDL data and the Landsat data and also eliminates uncertainty in our water area model due to nearshore vegetation cover.

ogr2ogr hydropoly_miv14a_clip_gte_20ac_shrink_90m.shp hydropoly_miv14a_clip_gte_20ac.shp -dialect sqlite -sql "SELECT GID, ACRES, SQKM, ST_Buffer(Geometry, -90) FROM hydropoly_miv14a_clip_gte_20ac"

The final step in preparing known examples of water cover is to rasterize the layer of known water bodies and clip it to the area of analysis. For this, I used my own gdal_extent.py utility and the gdal_rasterize command line utility. In this iteration, all Landsat pixels not intersected by the water area examples are considered to be examples of land areas.

EXTENT=$(gdal_extent.py LandsatTM_raster.tiff)
SIZE=$(gdal_extent.py -size LandsatTM_raster.tiff)
gdal_rasterize -burn 1 -ts $SIZE -tr 30 30 -te $EXTENT hydropoly_miv14a_clip_gte_20ac_shrink_90m.shp hydropoly_miv14a_clip_gte_20ac_shrink_90m.tiff

In the second iteration, the water bodies generated above are validated in Google Earth by a human analyst. Those water bodies that intersect land are removed from the water examples layer. After applying machine learning in the first iteration, it was discovered that some man-made reservoirs were incorrectly identified as land areas. These reservoir areas were added to the water examples layer for the second iteration.

In contrast to the previous approach, land areas were explicitly identified for this iteration. Rectangular areas of 300 meters squared were generated on a grid aligned with the Landsat data, randomly sampled, and then validated in Google Earth by a human analyst (me), discarding any areas that intersected a water body. The remaining areas were used as exemplars of land areas.

Results: Unsupervised Classification

A Gaussian naive Bayes estimator was applied to known water and non-water areas. A stratified 10-fold cross-validation was applied to the water examples generated from the MiGDL dataset (without human validation). The mean precision and mean recall of the cross-validation folds are presented below. The results are very good excepting the precision with which we can detect water, which is poor.

----------------------------
Class     Precision   Recall
--------- ---------   ------
Not water 0.9997      0.9787
Water     0.3090      0.9763
----------------------------

After human validation was used to better discriminate land and water in the training data, naive Bayes was applied again and the precision for the water class is much improved.

----------------------------
Class     Precision   Recall
--------- ---------   ------
Not water 0.9675      1.0000
Water     1.0000      0.9664
----------------------------

These outstanding results are the first indication that a water discrimination technique based solely on spectral information might be very effective. In the last step, we'll apply density slicing of Landsat TM Band 5.

Results: Density Slicing

If we examine histograms of segregated water and land areas (below), we can see that there is a sharp divide between the spectral responses of these land cover types in Band 5.

There are two cutoffs we might choose: less than or equal to 500 or 1000 reflectance units. In addition, we might want to sieve the results to reduce commission error. Thus, I experimented with four combinations of thresholds and sieving: a cutoff at either 500 or 1000 reflectance units with or without sieving. We find that sieving had no effect on the performance as measured by precision and recall:

------------------------------------------
Approach      Class     Precision   Recall
-----------   --------- ---------   ------
B5 </= 500    Not water 0.94        1.00
B5 </= 500    Water     1.00        0.96
B5 </= 1000   Not water 0.96        1.00
B5 </= 1000   Water     1.00        0.98
------------------------------------------

And that, on average, a cutoff of 1000 reflectance units performs slightly better:

------------------------------
Approach    Precision   Recall
----------- ---------   ------
B5 </= 500  0.98        0.98
B5 </= 1000 0.99        0.99
------------------------------

Density slicing performs exceptionally well! It performs even better than the naive Bayes estimator. The performance of simple density slicing seems surprising but, then again, we looked at the histogram for Band 5 when choosing the thresholds for slicing and it was very apparent that the land and water pixels were separated at somewhere between 500 and 1000 reflectance units. The real question, then, is whether this relationship holds up across Landsat images (across time and space). In, it works very well for Landsat surface reflectance (SR) data from other dates, though I haven't tried this yet outside of southeast Michigan.

References

Frazier, P. S., and K. J. Page. 2000. Water Body Detection and Delineation with Landsat TM Data. Photogrammetric Engineering & Remote Sensing 66 (12):1461–1467.

Generating sample validation points with the Unix Shell and QGIS

2015-06-09T18:30:00+02:00

For ongoing work I'm doing with Landsat data, I recently needed to generate some quick validation points against high-resolution aerial photography. I wanted to generate a fixed number of random rectangles, 90-meters squared (3 by 3 Landsat pixels), within the extent of a high-resolution aerial photograph I was using as "ground data." More specifically, we want to generate a grid of 90-meter squared polygons distributed throughout that align with our Landsat image.

The steps involved are:

Clip the Landsat image to the bounds of (the intersection with) my aerial photograph.
Resample this clip to the desired resolution of my validation window (e.g., 90 meters).
Use QGIS to generate a grid from the coarsened Landsat clip.
Randomly sample from the generated grid.

Clipping the Landsat Image to the Airphoto Reference

So, we want, e.g., 100 validation points within the bounds of the aerial photograph. To create the validation points within a defined extent, we first need to measure the extent of the reference image. The gdalinfo utility is useful for this (as is the image properties dialog in QGIS). I recently created a GDAL-esque tool that can help automate this process called gdal_extent.py. After using it to extract the bounds of the aerial photograph, airphoto.tiff as a GeoJSON file, I can convert it to a Shapefile, which is the expected format for cutline features when clipping with GDAL. As the input file to ogr2ogr is GeoJSON, I have to tell it the spatial reference system (SRS) it should expect from the source file with the -s_srs switch.

python gdal_extent.py -geojson airphoto.tiff > extent.json
ogr2ogr -f "ESRI Shapefile" -s_srs "EPSG:32617" extent.shp extent.json

Alternatively, I could have just created a Shapefile that represents the bounds of my aerial photograph any other way. Also, my approach generates a rectangular bounding box whereas we could use any more complex polygon. So, assuming that you have a file extent.shp that somehow represents the bounds of your image, it's time to clip our Landsat image (L7Image.tiff) to this extent. I'll use gdalwarp; note that I specify the output resolution as 90 meters in both the horizontal and vertical directions.

gdalwarp -cutline extent.shp -cl extent -crop_to_cutline -tr 90 90 L7Image.tiff grid_90m.tiff

Generating a Sampling Grid in QGIS

This is where QGIS comes in. A close cousin of ArcGIS' Fishnet tool, QGIS' "Vector grid" tool (under "Vector, Research Tools") is what I'll use to convert the pixels of my Landsat image to polygons.

The output Shapefile should resemble a grid of pixels.

Selecting a Subset of Random Validation points

In the last step, I must downselect from this grid of pixels a random subset. Here, I'll present two techniques for introducing a stochastic selection process.

Random Selection within the Database

The simplest way would be to let a file database randomly select features from my polygon grid. In this example, I'll generate KML output because, for another application, I want to validate my points in Google Earth. The output Shapefile from the Vector grid tool is vector_grid.shp and we're downselecting 100 validation points:

ogr2ogr -f "KML" -dialect "sqlite" -sql "SELECT * FROM vector_grid ORDER BY RANDOM() LIMIT 100" validation_points.kml vector_grid.shp

Random Selection with Bash

If you don't trust the randomization capabilities of a database or have your own reason for generating random features, you might prefer this approach. In this example, the output Shapefile from the Vector grid tool is vector_grid.shp; we'll use a combination of Bash built-ins and GDAL tools here:

# Generate 100 randoms, reverse the string, cut the last character, reverse again
PIXELS=$(ogrinfo vector_grid.shp -al | grep "Feature Count: " | cut -c 16-)
RANDS="$(shuf -i 0-$PIXELS -n 100 | tr '\n' ',' | rev | cut -c 2- | rev)"
ogr2ogr -f "ESRI Shapefile" -where "ID IN ($RANDS)" validation_points.shp vector_grid.shp

As an added bonus, this approach allows us to create fields for validating certain quantities directly in our Shapefile. This is a great way to automate the generation of validation points for the worthwhile consideration of the undergraduate interns, graduate students, and other valuable research assistants working with you. For example, if I want to validate the fraction of vegetation cover (VegFrac) and impervious surface cover (ImpFrac) and record that information directly in my Shapefile (while editing it in QGIS, later), I simply add those fields as part of a SELECT statement that includes my previous WHERE condition:

ogr2ogr -f "ESRI Shapefile" -sql "SELECT ID, 0.0 AS VegFrac, 0.0 AS ImpFrac FROM vector_grid WHERE ID IN ($RANDS)" validation_points.shp vector_grid.shp

In Conclusion

Now I'm ready to analyze the areas specified in validation_points.shp by going through them, one at a time, in QGIS or another desktop GIS tool. Technically, I used more than Bash and QGIS but GDAL really is the most fundamental tool in Bash, right?

LEDAPS installation on Ubuntu GNU/Linux

2015-05-06T11:30:00+02:00

Update: The Landsat Ecosystem Disturbance Adaptive Processing System (LEDAPS) is a software system for generating surface reflectance data for Landsat 4, 5, and 7 TM or ETM+ sensors. The installation can be difficult, so I've prepared a guide based on my last successful installation of version 2.2.0. Since I first authored this article, I've heard reports of various difficulties I didn't encounter, including that the auxiliary files are no longer available. On the other hand, new releases are now on GitHub and the documentation seems to have improved (over the original on Google Code). I am now downloading surface reflectance data directly and in bulk from the USGS EROS Science Products Architecture (ESPA), so I haven't needed to use LEDAPS myself in awhile. In short, this article may need to be updated, but what follows is a guide for installing LEDAPS 2.2.0 on Ubuntu GNU/Linux 14.04 or higher.

LEDAPS Dependencies

The Google Code Wiki for LEDAPS lists the following dependencies:

HDF-EOS2 libraries (hdf-eos)
ESPA common libraries (espa-common)
GCTP libraries (Bundled with HDF-EOS2)
TIFF libraries (libtiff5)
GeoTIFF libraries (libgeotiff2)
HDF4 libraries (libhdf4)
libxml2

Some of these dependencies have their own dependencies, which adds "JPEG support" (for espa-common) and zlib (for HDF4) to the list.

A Walkthrough

This walkthrough is written and tested for Ubuntu GNU/Linux 14.04 ("trusty"). It is available in full as a public Gist. You'll want to make sure you have the basic dependencies installed first.

sudo apt-get install zlib1g zlib1g-dev libtiff5 libtiff5-dev \
libgeotiff2 libgeotiff-dev libxml2 libxml2-dev

I could not figure out which JPEG library is needed by espa-common but it is likely installed by default on Ubuntu. For the following examples, I assume you want to build the libraries in /usr/local/ and you have sudo privileges. Don't forget to set a USERNAME variable.

USERNAME=heyyouguys

HDF4 Library

First, we install HDF4 support. You could install this as a package with sudo apt-get install libhdf4-0 libhdf4-0-alt libhdf4-alt-dev. I prefer to build it from source in this case. This also ensures that the paths I've provided for shared libraries later on in the walkthrough will match.

sudo mkdir /usr/local/hdf4
sudo chown $USERNAME /usr/local/hdf4
cd /usr/local/hdf4

# Download the latest HDF4 source code
wget http://www.hdfgroup.org/ftp/HDF/HDF_Current/src/hdf-4.2.11.tar.gz
tar -xzvf hdf-4.2.11.tar.gz
cd hdf-4.2.11
./configure
make
make check
make install

# Update shared links
sudo ldconfig

HDF-EOS2 Library

Next, we install the HDF-EOS2 library. They actually do provide some fairly helpful instructions for building from source, from which I've adapted these instructions. Note that your path to the h4cc compiler may differ, especially if you did not build HDF4 from source. HDF-EOS2 is available as a package (sudo apt-get install libhdfeos0 libhdfeos-dev libgctp0d libgctp-dev) but I could not get LEDAPS to build off of this package.

sudo mkdir /usr/local/hdf-eos
sudo chown $USERNAME /usr/local/hdf-eos
cd /usr/local/hdf-eos

# Download latest HDF-EOS source code
wget ftp://edhs1.gsfc.nasa.gov/edhs/hdfeos/latest_release/HDF-EOS2.19v1.00.tar.Z
tar -xzvf HDF-EOS2.19v1.00.tar.Z
cd hdfeos
./configure -enable-install-include \
CC=/usr/local/hdf4/hdf-4.2.11/hdf4/bin/h4cc
make
make check
make install

# Update shared links
sudo ldconfig

Pay attention to this message after make install, which might be slightly different on your machine. You'll want to remember this path for later.

Libraries have been installed in:
   /usr/local/hdf4/hdf-4.2.11/hdf4/lib

ESPA Common Library

Next is the ESPA Common Library, its documentation here. The original documentation consists of only one instruction: "Goto the src/raw_binary directory and build the source code there." Anyway, the first thing we have to do is get the source code. Note that while I've linked to the 1.1.0 version documentation, the instructions are (apparently) the same for version 1.3.1, which is what we're installing.

sudo svn checkout http://espa-common.googlecode.com/svn/releases/version_1.3.1 \
/usr/local/espa-common/version_1.3.1
sudo chown -R $USERNAME espa-common

To inform epsa-common and, later, LEDAPS on where to find shared libraries, we have to set a number of environmental variables. At least one of the following paths will likely not match your setup. The only hints I can give you are that the *INC paths should point to the respective library's includes while the *LIB paths should point to the exported library.

export HDFEOS_GCTPINC="/usr/include/gctp/"
export HDFEOS_GCTPLIB="/usr/local/hdf-eos/hdfeos/hdfeos2/lib"
export TIFFINC="/usr/include/x86_64-linux-gnu/"
export TIFFLIB="/usr/lib/x86_64-linux-gnu/"
export GEOTIFF_INC="/usr/include/geotiff/"
export GEOTIFF_LIB="/usr/lib/"
export HDFINC="/usr/local/hdf4/hdf-4.2.11/hdf4/include/"
export HDFLIB="/usr/local/hdf4/hdf-4.2.11/hdf4/lib/"
export HDFEOS_INC="/usr/local/hdf-eos/hdfeos/include/"
export HDFEOS_LIB="/usr/local/hdf-eos/hdfeos/hdfeos2/lib"
export JPEGINC="/usr/include/"
export JPEGLIB="/usr/lib/x86_64-linux-gnu/"
export XML2INC="/usr/include/libxml2/"
export XML2LIB="/usr/lib/x86_64-linux-gnu/"

It's easy to mistake where HDF-EOS2 keeps its lib files. Remember the path we wrote down after make install? You have to carefully read the output from make install to note that it installs the lib files in a subfolder called hdfeos2 (by default, I assume). If the above path doesn't work for you, just run make install on HDF-EOS2 again and look for the message that indicates the proper path.

We next cd to the src/raw_binary subfolder. From there, with a little luck, building is straightforward.

make
make install
sudo ldconfig

If you encounter errors about the -lGctp switch, then make is unable to find the GCTP lib files; you specified the wrong path under HDFEOS_GCTPLIB.

LEDAPS

Finally, we're ready to install LEDAPS. Again, though I will link you to the version 2.2.0 documentation, we're going to install version 2.2.1. This bug report might also be helpful if you run into errors. The GitHub site (formerly on the Google Code Wiki) indicates that you should download the LEDAPS auxiliary files (ledaps_aux.1978-2014.tar.gz) before building LEDAPS, there is no reason to do this before LEDAPS has been successfully installed (it's a ~2.9 GB file). We first have to export the paths to the includes and lib files we just built in espa-common.

export ESPAINC="/usr/local/espa-common/version_1.3.1/src/raw_binary/include/"
export ESPALIB="/usr/local/espa-common/version_1.3.1/src/raw_binary/lib/"

Now you're ready to build LEDAPS! Good luck.

sudo svn checkout http://ledaps.googlecode.com/svn/releases/version_2.2.1 \
/usr/local/ledaps/version_2.2.1
sudo chown -R $USERNAME ledaps
cd /usr/local/ledaps/version_2.2.1/ledapsSrc/src/
make
make install
make clean
sudo ldconfig

Clean-Up

A description of the LEDAPS outputs, which you won't find linked to from the USGS or the Google Code site, is available here.

Holism versus reductionism: The holy war in ecology

2015-03-18T11:00:00+01:00

The philosophical tension between the worldviews of holism and reductionism persists in today's ecology classroom. This debate traces roots to the "individualistic" versus "organismal" debate at the beginning of the twentieth century between the population and community ecology schools [1]. The chief actors in this debate were Henry Gleason, proponent of the individualistic view, and Frederic Clements, who argued that plant communities function as "complex organisms." Though not cast in the same terms, Gleason's point of view has come to be associated with reductionism while Clements' point of view, including his analogy relating plant communities to the human body, was soon attached to the doctrine of holism.

Odenbaugh (2007) provides an excellent summary of the debate between Gleason and Clements:

Suppose a set of species in a particular place and time is disturbed by some exogenous process like a forest fire from a lightening strike. Clements argued that communities in response to such disturbances follow a very specific sequence of stages called "seres" and that there is a single self‐perpetuating and tightly integrated climax community. Clements considered communities to be "superorganisms"...Gleason considered Clements' views to be without empirical support and argued that succession results from individual species' physiological requirements and local meteorological conditions.

In general, holism is the view that an integrated whole has a reality independent of and greater than the sum of its parts. It is marked, particularly, by the belief in "emergent properties" which are only observed at the system-level. Reductionism, in contrast, posits that all phenomena are at all times physically realized and therefore system-level behavior is determined by and can be derived from the constituent parts [3].

Reductionism is clearly useful as a foundation on which concepts in ecosystem ecology can easily be built. For example, in examining ecosystems as functional units of nature, biochemical pathways in photosynthesis are often discussed as determining the spatial distribution of certain plant communities. We observe that C4 plants have higher photosynthetic and water use efficiency and are stimulated by higher temperatures than C3 plants. We also observe that C3 and C4 plants are found in different patterns on the landscape, with C4 plants found predominantly in drier ecoregions than C3 plants (and CAM plants in ecoregions drier still). The reductionist argument is that these individual differences in plant life histories determine their spatial distribution and, thus, the plant communities associated with a particular ecosystem.

There are also problems with Clements' view. His perspective was too broad-scale to appreciate real, fine-grained changes in species turnover along environmental gradients. However, rejecting Clement's view does not necessitate rejecting holism. There are very real examples of emergent phenomena at work. Conway's oft-cited Game of Life is the premier example but there are examples from the physical world as well (beyond cellular automata). While the spatial distribution of an animal species might be modeled effectively in a vacuum, when we consider multiple species interactions it becomes increasingly difficult if not impossible to predict their spatial distributions with any appreciable accuracy while restricting ourselves to thinking of them as linear combinations.

But is this merely "pragmatic anti-reductionism?" [3]. I think not. As Joe Faith (1998) argues, there are two problems with the reductionist view of some physical systems. One is that our understanding of some physicals systems is limited to conceptual and mathematical models (e.g., ideal bodies in physics). Another is that properties of the constituent elements of a system are often determined by system-level behavior (while an insistence that it is exclusively the reverse is reductionism at its purest). An example of this from physics, provided by Faith, is the compression of an ideal gas, which results in an increased mean momentum per unit volume (and increased momentum of the constiuent particles) [3].

Many ecologists seem to agree there are merits to both views. My favorite postulation comes from Sierra et al. (2015):

Where these ecosystem manipulation experiments have included the interaction with one or two additional factors, results suggest that effects are not additive or predictable from individual variables alone.

While Currie (2011) and others push back on this view by distinguishing between holism and complexity, I would argue that the defense of reductionism by appealing to complexity is only as tenable as pragmatic anti-reductionism. In the end, the reality of complexity, seen either way, has led to more ideological defensiveness rather than scientific advancement. That is, the debate tends to generate more noise than heat (or more heat than work).

References

Dalgaard, T., N.J. Hutchings, J.R. Porter. 2003. "Agroecology, scaling and interdisciplinarity." Agriculture, Ecosystems & Environment. 100 (1):39-51.
Odenbaugh, J. 2007. Seeing the Forest and the Trees: Realism about Communities and Ecosystems. Philosophy of Science 74 (5):628–641.
Faith, J. 1998. Why gliders don’t exist: Anti-reductionism and emergence. In Artificial Life VI: Proceedings of the Sixth International Conference on Artificial Life, eds. C. Adami, R. K. Belew, H. Kitano, and C. Taylor.
Sierra, C. A., S. E. Trumbore, E. A. Davidson, S. Vicca, and I. Janssens. 2015. Sensitivity of decomposition rates of soil organic matter with respect to simultaneous changes in temperature and moisture. Journal of Advances in Modeling Earth Systems 7:355–356.
Currie, W. S. 2011. Units of nature or processes across scales? The ecosystem concept at age 75. New Phytologist 190 (1):21–34.

Custom citation styles in LaTeX with CSL and Pandoc

2015-01-28T15:00:00+01:00

I love to use LaTeX for typesetting my papers. The flexibility of the environment and the crisp beauty of the final product&I think anyone who uses it regularly knows what I'm gushing about. While working on a recent paper, however, I was frustrated by the prospect of customizing bibliography and citation styles in LaTeX. I care about fine-grained control of my bibliography. I wondered if there was another sensitive soul out there who felt the same way and decided to create a better soluion.

Here, I'll describe an alternative to the standard bibliography environment that I found for style customization without sacrificing the raw power of LaTeX. I'll also comment briefly on some alternatives I've seen and tried. I take for granted that BibTeX library is already available as creating one is outside the scope of this article. I will mention that my favorite reference manager, Mendeley, can export its database as a BibTeX library.

The Problem

LaTeX's default bibliography environment is simple enough to invoke. For a complete description, see Stefan Kottwitz's book [1]. Inline styles are written with the \cite{} command, as in the example below.

Previous studies have demonstrated that collective efficacy \cite{Morenoff2001} and prejudice \cite{Sampson2004} shape neighborhoods more strongly than their physical make-up and condition.

The final and only other requirement is to add the bibliography and point to a BibTeX reference database. In the example below, it would be named myrefs.bib.

\bibliographystyle{alpha}
\bibliography{myrefs}

It is necessary to invoke the bibtex program. This translates the references from our BibTeX library (myrefs.bib) that we have cited in our document into a thebibliography environment which is put in place where we invoked the command, \bibliography. In short, we call bibtex and then typeset twice.

bibtex my_document.tex

After typesetting (twice), we see in-line citations and our bibliography. The bibliography style we specified, alpha, is formatted such that the in-line citation labels are a combination of a shortened author name and publication year, the bibliography is sorted by author name, and square brakets surround the labels. But what if you want something different?

Well, there are four default styles. The other three are described by Kottwitz [1] are listed below. ShareLaTeX has a list of eight (8), including the four discussed here, with example output.

plain: Arabic numbers for the labels, sorted according to the names of the authors. The number is written in square brackets which also appear with \cite.
unsrt: No sorting. All entries appear like they were cited in the text, otherwise it looks like plain.
abbrv: Like plain, but first names and other field entries are abbreviated.

The bibtex program figures out how to style your citations and bibliography from a specification that resides in a *.bst file (i.e., there is a plain.bst for the built-in plain style). You could write one of these yourself, perhaps using one of the built-ins as a template, but the postfix language they are written can be very difficult to read and write.

It seems that others have recognized the need for more flexible customization and, to be fair, there are other options out there in the form of preprocessors and LaTeX packages. This TeX StackExchange article does a good job of summarizing their trade-offs. The only alternative among the extant packages and programs that I've tried is natbib which doesn't completely solve the problem of fully customizable citations and bibliographies (though it's a good start). In the end, one still has to provide a *.bst file with natbib.

A Solution

I should mention that my high expectations of full customization were shaped by my previous experience with R Markdown and knitr. With R Markdown, formatting for bibliographies and in-line citations can be specified by the Citation Style Language (CSL); see also this reference. I have a vivid memory of first seeing the Zotero repository of CSL stylesheets: all 7,438 of them. Clearly, 7,438 options are better than 4 (or 8), right? If the numbers don't convince you, just open up a CSL stylesheet; it's an XML variant, making it much easier to read and write than *.bst files.

So, CSL works out-of-the-box in R Markdown—great! And knitr, with help from Pandoc, enables R Markdown documents to be serialized to a wide variety of formats (e.g., PDF, HTML, Microsoft Word). But what if you want to use the full range of TeX commands and environments available through third-party packages? Some people might also object to using R Markdown to write their paper, anyway; particularly if they don't use R or markdown.

I'm not one of those people but I do want to use raw TeX sometimes. So, I started looking into Pandoc to preprocess my TeX documents. Pandoc supports CSL and can defer raw TeX input to the LaTeX typesetting program. By chaining Pandoc and LaTeX, with a custom CSL file to my liking, I can fully customize my bibliography and citation styles without sacrificing the full range of TeX features available in third-party libraries.

As an example, here is the References section of a paper I wrote. I wanted hanging indents in my bibliography—one last, obsessive detail to achieve my vision for a bibliography—so I have invoked \setlength and \hangparas which require the setspace and hanging packages, respectively. The last line looks a lot like it did before, right? I just point to my BibTeX bibliography database.

\section{References}
\setlength{\parindent}{0pt} % Reset indentation for references...
\hangparas{32pt}{1}
\bibliography{/home/arthur/library.bib}

To typeset this with Pandoc, I have a little shell script that encapsulates the variety of options available with that program. I locate my BibTeX library for Pandoc with the --bibliography option and I tell it how I'd like my bibliography and citations formatted, in CSL, with the --csl option.

pandoc \
-M author="K. Arthur Endsley" \
-M date="February 2, 2015" \
-f latex+raw_tex -N -R \
--smart \
--include-in-header=header.tex \
--bibliography=/home/arthur/library.bib \
--csl=citation_style.csl \
--template=template.latex \
-o MyPaper.pdf MyPaper.tex

Note that I have a custom LaTeX template and a header TeX file that load some packages and set up my document. These are important, as \usepackage and some other commands can only be used in the preamble—they can't go inside your input TeX file to Pandoc. To get an idea of what I mean, see my input TeX file (MyPaper.tex) below:

\title{Assessment of Urban Change through a Land-Cover Change Proxy at the Neighborhood Scale with Subpixel Measurements from Satellite Remote Sensing}
\author{K. Arthur Endsley}

\begin{document}

\maketitle

\section{Background}

Neighborhood change manifests in changes in the physical environment...

Anything else has to go in the template or in the header.

References

LaTeX: Beginner's Guide by Stefan Kottwitz

Bayesian networks for land cover classification

2014-12-04T13:00:00+01:00

For a term project in my first semester as a PhD student at the University of Michigan, in a class on landscape modeling, I wanted to investigate the relationship between socioeconomic or demographic change and land cover changes in urban areas. My intent was to produce a model sensitive to neighborhood change—particularly new development or abandonment as signaled by census measures—and to explain that change in terms of physical changes on the landscape: changes in vegetation, impervious surface, or soil cover (i.e., the VIS model) [1]. My choice of Bayesian networks was inspired by a study [2] in which they were used to determine land cover transition probabilities and, in turn, drive a cellular automata model for predicting urbanization (new urban development).

I found that while (static) Bayesian networks are very good at representing complex conditional probability distributions and reproducing static patterns, they aren't useful for predicting relatively rare events like slow and sparse land cover changes. Furthermore, the interpretation of the conditional probabilities can be somewhat subjective. Nonetheless, Bayesian networks could be a powerful tool for generalizing from a sparse dataset and may perform well on classification problems such as land cover classification.

Background

There are many motivations for studying urban environments and urban change in terms of land use or land cover changes. Studies include mitigating the impacts of urban sprawl [3], generating urban development scenarios [2], or monitoring rates of urban growth and impervious surface increase [4]. To these ends, previous studies have employed a variety of models that attempt either to predict future states of the landscape or to explain the drivers of urban change.

Some ostensibly explanatory models are not easily interpretable despite the modeler's intentions. With some cellular automata models, a failure to reproduce fine-scale patterns and accurately locate urban growth may indicate a problem with the model's structure (transition rules, namely) but identifying which parts of the structure that are at fault can be challenging as it requires many different measures of model outcomes both quantitative and qualitative [5]. Are Bayesian networks an interpretable (explanatory) and yet spatially accurate modeling approach for investigating changes in urban land cover?

Detroit is a particularly interesting case study for urban change because of its considerable economic and population decline [6]. The "greening" of Detroit—an increase in vegetation cover within the urban core due to abandonment and revegetation—is a well-documented phenomenon [7,8] that further implicates land cover change as a signal of socioeonomic and demographic change. From a modeling perspective, boom-and-bust dynamics are more challenging than constant growth and some landscape models are ill-equipped for such purposes [3].

Bayesian Networks

Bayesian networks (BNs)—also known variously as belief networks, Bayesian belief networks, Bayes nets, and causal probabilistic networks—are a relatively recent [9] tool for estimating probabilities of occurrence given sparse observations and have been demonstrated to be useful for land cover modeling. They are directed, acyclic graphs where each node is a variable and the probabilities of both "predictor" and "response" variables can be queried [10]. Nodes connected to one another imply a conditional dependence in a certain direction (hence, the graph is directed) and the links between them cannot form cycles (hence, the graph is acyclic). BNs must also exhibit the Markov property; that is, the conditional probability of any node must depend only on its immediate parents [10,11].

In general, BNs are either discrete or continuous; discrete and continuous variables are usually not mixed and software tools that support mixed types in the network are not common [10,12]. This is to facilitate calculating the joint probability distribution, which is either a multinomial distribution in the case of discrete-valued variables or a Gaussian distribution in the case of continuous values. In the case of continuous variables, the parameters are just regression coefficients. Because the nodes of a Bayesian network are linked, multivariate regression is performed to predict the distribution arising at each node in the network, providing regression coefficients for each pairwise interaction between a node and its connections [10].

Training a Bayesian network generally consists of two steps: learning the network structure and then fitting the parameters. In some studies, the network structure may be known or specified by an expert. The conditional probability tables (CPTs) for some or all of the variables might also be specified by an expert [2].

Structure learning is computationally intensive but many different algorithms are available that are all tractable on end-user hardware. The second step, fitting parameters, is generally done through a maximum likelihood approach (whereby the best fit parameters are estimated) or a Bayesian approach (whereby the posterior distribution of the parameters for a discrete distribution is estimated). The Bayesian approach is preferred as it provides more robust estimates and guarantees the conditional probability tables will be complete [10].

Bayesian Networks in R

The book Bayesian Networks in R by Nagarjan et al. [10] mentions a number of different R packages for investigating Bayesian networks. I'll speak only about the bnlearn package [13], which I've found to be the easiest to use and yet quite robust.

I used three classes of network learning algorithms—all available in the bnlearn package for R—to investigate possible network structures and to select a stable (consistent) structure for modeling based on a random sample of my discretized training data. Most of the algorithms I tried produced extremely dense graphs including many complete graphs (i.e., every node is connected to every other), which generally do not perform well for prediction. Ultimately, two hybrid algorithms, General 2-Phase Restricted Maximization (RSMAX2) and Max-Min Hill Climbing (MMHC), agreed upon the same network structure. The hybrid score and constraint-based class of algorithms is considered to produce more reliable networks than either score-based or constraint-based algorithms alone [10].

Learning Network Structure in R

The bnlearn package makes learning network structure a one-liner for any algorithm of choice. Some algorithms have more options than others. For instance, Incremental Association Markov Blanket (IAMB or iamb) doesn't require any parameters. The Hill Climbing algorithm (hc) optionally allows the user to specify how many times it will randomly restart to avoid local maxima and how many times it will add, delete, or reverse arcs after a restart.

iamb(training.sample)
hc(training.sample, restart=10, perturb=5)

One particularly nice feature of the bnlearn package is that it has graph plotting built right in so that you can visualize the structure of the network that was learned.

mmhc.dag <- mmhc(training.sample)
plot(mmhc.dag)

We can manipulate the graphs post-hoc by setting arcs ourselves. For instance, we might want to insist that there is an arc pointing from "old" land cover to "new" land cover.

mmhc.dag <- set.arc(mmhc.dag, 'old', 'new')

Fitting Network Parameters in R

Fitting networks in bnlearn is also short and sweet. Note that we're using a different set of training data to fit the model parameters. The method argument is where one specifies whether to use maximum likelihood estimation (mle) or Bayesian parameter estimation (bayes); the latter is currently only implemented for discrete data sets.

fitted.network <- bn.fit(mmhc.dag, data=training.sample2, method='bayes')

An Example

The BNs were trained from high-resolution, 30-meter land cover data from 2001 and landscape measures (distance to parks and distance to roads) joined to the coarse-resolution census data for 2006. A three-folds random sample of the combined predictors was created so that the samples used to learn the network structure, score the network structure, and fit the model were disjoint. Each disjoint sample contains less than 4% of the complete dataset. The predictor variables were then aggregated to 300 meters using nearest neighbor resampling to reduce the computational complexity of prediction (classification).

Land cover classification with Bayesian networks consists of the following general steps:

For each pixel, get the available evidence (e.g., census measures, proximities, and land cover observations).
Obtain the posterior probability distributions given the evidence.
Considering the "new" land cover variable, choose the outcome (e.g., land cover) from the posterior probability distribution.

I'll consider in detail each of these three steps in the following sections.

Step 1: Get the Evidence

One of the virtues of BNs is that predictions do not require a simultaneous observation of all predictor variables; even just one predictor variable can be used as evidence. Below is a striking example of how this works.

In the lower right portion of this Detroit metro area image land cover image and around the bottom and left edges there is what looks like random noise. This is an area of the scene where we have no predictor variables; it's an area consisting chiefly of the Detroit River and Windsor, Canada and, thus, was masked out in our dataset. Without any evidence to show to our model, land cover predictions are pulled from the prior distribution only. When predictions are made in areas where evidence exists, however, the posterior distribution is obtained and structure emerges.

In this step, for each pixel we want to show the network the evidence (the pixel's predictor variables or the values across all image bands). We first need to create an independence network so that we can query our network's conditional probabilities. This requires the gRain package. The bnlearn package knows how to manipulate gRain data structures, provided the package is available; it provides a as.grain function to return our trained network as a gRain object. Then, gRain compiles our network as an independence network using the junction tree algorithm.

require(gRain)

# We use the junction tree algorithm to create an independence network that we can query
prior <- compile(as.grain(fitted.network))

We call the output junction tree by the name prior as querying it will basically provide us with the prior distribution for land cover (before any evidence is shown).

# Get the prior probabilities for new land cover
querygrain(prior, nodes='new')$new

In the next step, we'll show the evidence to this independence network in order to obtain the posterior distribution.

Step 2: Show Evidence and Get the Posterior Distribution

This step is a little trickier. I had to write my own function to update the junction tree with the evidence, in particular because of the unique application to land cover classification. There's additional complexity due to the fact that our network has nodes (variables) with human-readable character strings (e.g. "med.hhold.income") but the raster layers that contain our discrete training data are integer-valued. As a result, we have to translate the raster layer class identifiers (e.g., 0, 1, 2...) into their class labels (e.g., "med.hhold.income", ...).

In the update.network function, we take in the junction tree to be modified (shown evidence) and an associative array (named list of vector in R) of the evidence (e.g., "med.hhold.income"=1, ...).

VARS <- c(...) # Some list of variable names as character strings

# A function to update the posterior probability distribution with evidence
update.network <- function (jtree, states) {
  states <- na.omit(states)

  # Do not do anything if the input data are all NA
  if (dim(states)[1] == 0) {
    return(jtree)
  }

  # Translate the raster classes [0, 1, 2, ...] into factors
  evidence <- data.frame(matrix(nrow=dim(states)[1], ncol=length(VARS)))
  names(evidence) <- VARS
  for (var in VARS) {
    evidence[,var] <- t(factors[var,][states[,var] + 1])
  }

  for (i in seq(1, dim(evidence)[1])) {
    jtree <- setEvidence(jtree, nodes=names(evidence),
                         nslist=mapply(list, evidence[i,]))
  }

  return(jtree)
}

To obtain the posterior distribution, then, looks something like this.

# Update the posterior
posterior <- update.network(prior, c(med.hhold.income=1, ...))

# Get the posterior probabilities for new land cover
querygrain(posterior, nodes='new')$new

Step 3: Predict the Outcome from the Posterior Distribution

Here, we need another convenience function; one to pick from the posterior distribution. The choose.outcome function does this by creating a cumulative probability distribution, generating a random deviate on the uniform interval between 0 and 1, and then choosing the class that covers the interval in which the deviate is found.

For example, if there are two classes with posterior probabilities of 44% for class 0 and 56% for class 1, then a random deviate generated on [0, 0.44] will cause the pixel to be assigned to class 0 (44% of the time) while a random deviate generated on (0.44, 1.0] will cause the pixel to be assigned to class 1 (56% of the time).

# A function to choose outcomes, one at a time, with the same probability as the given posterior distribution
choose.outcome <- function (posterior) {
  posterior <- sort(posterior)

  # Sort the posterior probabilties by factors, e.g. "1=0.56,0=0.44" becomes "0=0.44,1=0.56"
  post <- numeric()
  for (i in seq.int(1, length(posterior))) {
    post[i] <- posterior[as.character(i - 1)]
  }

  # Generate a vector of probability thresholds e.g. [0.0, 0.44] for transitions to [0, 1]; upper bound of p=1.0 is implied.
  prob <- rep(0, length(post))
  for (i in seq.int(length(post) - 1, 1, by=-1)) {
    j <- length(post) - i
    prob <- prob + c(rep(0, j), post[(j-1):(length(post)-j)])
  }

  # Generate a random uniform deviate on [0, 1] to determine which factor to output
  r <- runif(1)
  for (i in seq.int(0, length(prob) - 2)) {
    if (r < prob[i + 2]) {
      return(i) # p < threshold in e.g. [0, 0.44]? Output that factor
    }
  }

  (length(prob) - 1) # p < implied upper bound of 1.0? Output last factor
}

Finally, we're ready to make some predictions, which, in this case, mean showing evidence to the network, obtaining the posterior distribution, and making a prediction by sampling from the posterior distribution. For our land cover classification, we can use the stackApply function from the raster package to operate on a stack of raster layers, each corresponding to a predictor variable, for an efficient way of generating a vector of evidence.

require(gRain)
require(raster)

# Use the junction tree algorithm to create an independence network to query
prior <- compile(as.grain(fitted.network))

# A function to operate on each vector of predictors (vector of pixels across bands)
func <- function (r, ...) {
  choose.outcome(querygrain(update.network(prior, as.data.frame(t(r))),
                            nodes='new')$new)
}

expert.prediction <- stackApply(layers,
                                rep(1, length(names(layers))), func)

In using stackApply, we need to convert the vector of discrete raster values into a data frame with as.data.frame(t(r)) where r is the vector of raster values; we take the transpose, t(r), before turning it into a data frame so it has the right shape expected by the update.network function.

Transition Probabilities

Another neat feature of Bayesian networks is that we can easily obtain transition probabilities for our outcomes. In this case, transition probability refers to the probability that a given pixel will be assigned a certain land cover class by our model.

Recall that in the choose.outcome function we were sampling a single outcome from the posterior distribution. To calculate transition probabilities, we instead want to assign the probability of a specific outcome for a given pixel as the value of that pixel. This time, we use the raster calculator, calc, in the raster package to apply an arbitrary function over the pixels of our raster stack.

require(gRain)
require(raster)

# In our case, we have 3 possible outcomes
no.outcomes <- 3

# Find transition probabilities for the expert graph
trans.probs.expert <- raster::calc(layers, function (states) {
  trans <- matrix(nrow=dim(states)[1], ncol=no.outcomes)

  for (i in seq(1, dim(states)[1])) {
    # Query the network for the posterior probability of a certain "new" outcome
    trans[i,] <- querygrain(update.network(prior, as.data.frame(t(states[i,]))),
                            nodes='new')$new
  }

  return(trans)
}, forcefun=TRUE)

Below are images of the transition probabilities for the Detroit metro area land cover in 2011 as predicted from 2010 U.S. Census and landscape measures (click on each for their full resolution).

We can see that the transition probabilities all three of the predicted land cover classes—undeveloped, low development, and high development—are all fairly high but are spatially distinct. Here, "development" refers to proportion of impervious surface area as indicated by the National Land Cover Dataset (NLCD). Undeveloped areas are thought to have less than 20% impervious surface cover, low-development areas between 20% and 80%, and high-development areas more than 80%. Thus, in the "undeveloped" transition probabilities, we see very high (>/= 0.8) probabilities in the outlying suburban and exurban areas where the urban core has basically 0% chance of transitioning to undeveloped. The urban core is easily resolved in the "low development" transition probabilities and main road and highway corridors are seen in the "high development" transition probabilities, as expected.

How does our final land cover classification look? Land cover data from 2006, as the "new" or "outcome" land cover, were used to train the Bayesian network so it's no surprise the 2006 classification looks very good. The classification from 2011 is just slightly worse. The images below show the difference between the classifier's prediction and the observed NLCD land cover.

To quantify the agreement, we can use Cohen's kappa, a measure of the rank-order agreement between two sets, where the sets are the scene-wide predicted and observed land cover values, pixel for pixel. For the 2006 prediction, Cohen's kappa is 0.97—considerably high given that 1.0 would represent perfect agreement. In 2011, this agreement drops to 0.92. Different data structures are required in the training and validation process and there isn't an efficient way to remove the training pixels from the output in R. Thus, these kappas are also inflated slightly due to the inclusion of training data in the validation set, which is the entire image. However, the training data constitute less than 4% of the dataset.

Concluding Remarks

While the classification accuracies as indicated by Cohen's kappa are quite high, there are three important considerations that should mitigate our enthusiasm:

The model completely fails to predict rare events, in this case, new urban development.
The model included "old" land cover as a predictor, which is a considerable advantage as most pixels don't change their land cover from year-to-year.
This classification was based in part on another classification, the NLCD "development" land cover classification.

The failure to predict rare events is related to our inclusion of "old" land cover. In a sense, there is a considerable inertia to land cover change; extant land cover rarely does change. It would be interesting to run the model again without the "old" land cover. It would also be interesting to investigate the model's performance given a remote sensing dataset rather than a previous classification. And we haven't even begun to look at the conditional probability tables! In summary, I think it's fair to say that Bayesian networks are promising for reproducing static patterns given sparse evidence and deserve attention in future land cover classification applications that are considering machine learning approaches.

References

Ridd, M. 1995. Exploring a VIS (vegetation-impervious surface-soil) model for urban ecosystem analysis through remote sensing: comparative anatomy for cities. International Journal of Remote Sensing 16 (12):2165–2185.
Kocabas, V., and S. Dragicevic. 2007. Enhancing a GIS Cellular Automata Model of Land Use Change: Bayesian Networks, Influence Diagrams and Causality. Transactions in GIS 11 (5):681–702.
Jantz, C. A., S. J. Goetz, and M. K. Shelley. 2004. Using the SLEUTH urban growth model to simulate the impacts of future policy scenarios on urban land use in the Baltimore-Washington metropolitan area. Environment and Planning B: Planning and Design 31 (2):251–271.
Sexton, J. O., X.-P. Song, C. Huang, S. Channan, M. E. Baker, and J. R. Townshend. 2013. Urban growth of the Washington, D.C.–Baltimore, MD metropolitan region from 1984 to 2010 by annual, Landsat-based estimates of impervious cover. Remote Sensing of Environment 129:42–53.
Brown, D. G., P. H. Verburg, R. G. Pontius, and M. D. Lange. 2013. Opportunities to improve impact, integration, and evaluation of land change models. Current Opinion in Environmental Sustainability 5 (5):452–457.
Hoalst-Pullen, N., M. W. Patterson, and J. D. Gatrell. 2011. Empty spaces: neighbourhood change and the greening of Detroit, 1975–2005. Geocarto International 26 (6):417–434.
Emmanuel, R. 1997. Urban vegetational change as an indicator of demographic trends in cities: the case of Detroit. Environment and Planning B: Planning and Design 24:415–426.
Ryznar, R. M., and T. W. Wagner. 2001. Using Remotely Sensed Imagery to Detect Urban Change: Viewing Detroit from Space. Journal of the American Planning Association 67 (3):327–336.
Pearl, J. 1985. Bayesian networks: A model of self-activated memory for evidential reasoning. In Seventh Annual Conference of the Cognitive Science Society.
Nagarajan, R., M. Scutari, and S. Lebre. 2013. Bayesian Networks in R. New York, New York, USA: Springer.
Charniak, E. 1991. Bayesian Networks without Tears. AI Magazine 12 (4).
Uusitalo, L. 2007. Advantages and challenges of Bayesian networks in environmental modelling. Ecological Modelling 203 (3-4):312–318.
Scutari, M. 2014. bnlearn - an R package for Bayesian network learning and inference. http://www.bnlearn.com

PostGIS 2.x: Getting raster data out of the database

2013-08-21T12:00:00+02:00

This article was originally posted on the AmericaView Blog. You can also fork the Gist of this article. The Python and raw SQL examples are taken from my work on the Burned Area Emergency Response Spatial WEPP Model Inputs Generator.

PostGIS 2.x (latest release, 2.1) enables users to do fairly sophisticated raster processing directly in a database. For many applications, these data can stay in the database; it's the insight into spatial phenomena that comes out. Sometimes, however, you need to get file data (e.g. a GeoTIFF) out of PostGIS. It isn't immediately obvious how to do this efficiently, despite the number of helpful functions that serialize a raster field to Well-Known Binary (WKB) or other "flat" formats.

Background

In particular, I recently needed to create a web service that delivers PostGIS raster outputs as file data. The queries that we needed to support were well suited for PostGIS and sometimes one query would consume another (one or more) as subquer(ies). These and other considerations led me to decide to implement the service layer in Python using either GeoDjango or GeoAlchemy. More on that later. Suffice to say, a robust and stable solution for exporting and attaching file data from PostGIS to an HTTP response was needed. I found at least six (6) different ways of doing this; there may be more:

Export an ASCII grid ("AAIGrid")
Connect to the database using a desktop client (e.g. QGIS) [1]
Use a procedural language (like PLPGSQL or PLPython) [2]
Use the COPY declaration to get a hex dump out, then convert to binary
Fill a 2D NumPy array with a byte array and serialize it to a binary file using GDAL or psycopg2 [3, 4]
Use a raster output function to get a byte array, which can be written to a binary field

It's nice to have options. But what's the most appropriate? If that's a difficult question to answer, what's the easiest option? I'll explore some of them in detail.

Export An ASCII Grid

This works great! Because an ASCII grid file (or "ESRI Grid" file, with the *.asc or *.grd extension, typically) is just plain text, you can directly export it from the database. The GDAL driver name is "AAIGrid" which should be the second argument to ST_AsGDALRaster(). Be sure to remove the column header from your export (see image below).

Here's a contrived example:

SELECT ST_AsGDALRaster(mytable.rast, 'AAIGrid') AS rast
  FROM mytable

This approach has a downside, however. What you get is a file that has no projection information that you may need to convert to another format. This can present problems for your workflow, especially if you're trying to automate the production of raster files, say, through a web API.

Connecting Using the QGIS Desktop Client

There is a plug-in for QGIS that promises to allow you to load raster data from PostGIS directly into a QGIS workspace. I used the Plugins Manager ("Plugins" > "Fetch Python Plugins...") in QGIS to get this plug-in package. The first time I selected the "Load PostGIS Raster to QGIS" plug-in and tried to install it, I found that I couldn't write to the plug-ins directory (this with a relatively fresh installation of QGIS). After creating and setting myself as the owner of the python/plugins directory, I was able to install the plug-in without any further trouble. Connecting to the database and viewing the available relations was also no trouble at all. One minor irritation is that you need to enter your password every time the plug-in interfaces with the database, which can be very often, at every time the list of available relations needs to be updated.

There are a few options available to you in displaying raster data from the database: "Read table's vector representation," "Read one table as a raster," "Read one row as a raster," or "Read the dataset as a raster." It's not clear what the second and last choices are, but "Reading the table as a raster" did not work for me where my table has one raster field and a couple of non-raster, non-geometry/geography fields; QGIS hung for a few seconds then said it "Could not load PG..." Reading one row worked, however, you have to select the row by its primary key (or row number in a random selection, not sure which it is returning). In summary, this might work for a single raster of interest but it is very awkward and time-consuming.

Using the COPY Declaration in SQL

My colleague suggested this method, demonstrated in Python, which requires the pygresql module to be installed; easy enough with pip:

pip install psycopg2 pygresql

The basic idea is to use the COPY declaration in SQL to export the raster to a hexadecimal file, then to convert that file to a binary file using xxd. The following is an implementation in Python:

import os, stat, pg
# See: http://www.pygresql.org/install.html
# pip install psycopg2, pygresql

# Designate path to output file
outfile = '/home/myself/temp.tiff'

# Name of PostgreSQL table to export
pg_table = 'geowepp_soil'

# PostgreSQL connection parameters
pg_server = 'my_server'
pg_database = 'my_database'
pg_user = 'myself'

# Desginate a file to receive the hex data; make sure it exists with the right permissions
pg_hex = '/home/myself/temp.hex'
os.mknod(pg_hex, stat.S_IRUSR | stat.S_IWUSR | stat.S_IRGRP | stat.S_IWGRP)

conn = pg.connect(pg_database, pg_server, 5432, None, None, pg_user)
sql = "COPY (SELECT encode(ST_AsTIFF(ST_Union(" + pg_table + ".rast)), 'hex') FROM " + pg_table + ") TO '" + pg_hex + "'"

# You can check it with: print sql
conn.query(sql)

cmd = 'xxd -p -r ' + pg_hex + ' > ' + outfile
os.system(cmd)

This needs to be done on the file system of the database server, which is where PostgreSQL will write.

Serializing from a Byte Array

Despite the seeming complexity of this option (then again, compare it to the above), I think it is the most flexible approach. I'll provide two examples here, with code: using GeoDjango to execute a raw query and using GeoAlchemy2's object-relational model to execute the query. Finally, I'll show an example of writing the output to a file or to a Django HttpResponse() instance.

Using GeoDjango

First, some setup. We'll define a RasterQuery class to help with handling the details. While a new class isn't exactly an idiomatic example, I'm hoping it will succinctly illustrate the considerations involved in performing raw SQL queries with Django.

class RasterQuery:
    '''
    Assumes some global FORMATS dictionary describes the valid file formats,
    their file extensions and MIME type strings.
    '''
    def __init__(self, qs, params=None, file_format='geotiff'):
        assert file_format in FORMATS.keys(), 'Not a valid file format'

        self.cursor = connection.cursor()
        self.params = params
        self.query_string = qs
        self.file_format = file_format
        self.file_extension = FORMATS[file_format]['file_extension']

    def execute(self, params=None):
        '''Execute the stored query string with the given parameters'''
        self.params = params

        if self.params is not None:
            self.cursor.execute(self.query_string, params)

        else:
            self.cursor.execute(self.query_string)

    def fetch_all(self):
        '''Return all results in a List; a List of buffers is returned'''
        return [
            row[0] for row in self.cursor.fetchall()
        ]

    def write_all(self, path, name=None):
        '''For each raster in the query, writes it to a file on the given path'''
        name = name or 'raster_query'

        i = 0
        results = self.fetch_all()
        for each in results:
            name = name + str(i + 1) + self.file_extension
            with open(os.path.join(path, name), 'wb') as stream:
                stream.write(results[i])

            i += 1

With the RasterQuery class available to us, we can more cleanly execute our raw SQL queries and serialize the response to a file attachment in a Django view:

def clip_one_raster_by_another(request):

    # Our raw SQL query, with parameter strings
    query_string = '''
    SELECT ST_AsGDALRaster(ST_Clip(landcover.rast,
        ST_Buffer(ST_Envelope(burnedarea.rast), %s)), %s) AS rast
      FROM landcover, burnedarea
     WHERE ST_Intersects(landcover.rast, burnedarea.rast)
       AND burnedarea.rid = %s'''

    # Create a RasterQuery instance; apply the parameters
    query = RasterQuery(query_string)
    query.execute([1000, 'GTiff', 2])

    filename = 'blah.tiff'

    # Outputs:
    # [(<read-only buffer for 0x2613470, size 110173, offset 0 at 0x26a05b0>),
    #  (<read-only buffer for 0x26134b0, size 142794, offset 0 at 0x26a01f0>)]

    # Return only the first item
    response = HttpResponse(query.fetch_all()[0], content_type=FORMATS[_format]['mime'])
    response['Content-Disposition'] = 'attachment; filename="%s"' % filename
    return response

Seem simple enough? To write to a file instead, see the write_all() method of the RasterQuery class. The query.fetch_all()[0] at the end is contrived. I'll show a better way of getting to a nested buffer in the next example.

Using GeoAlchemy2

GeoAlchemy2's object-relational model (ORM) allows tables to be represented as classes, just like in Django.

class LandCover(DeclarativeBase):
    __tablename__ = 'landcover'
    rid = Column(Integer, primary_key=True)
    rast = Column(ga2.types.Raster)
    filename = Column(String(255))

class BurnedArea(DeclarativeBase):
    __tablename__ = 'burnedarea'
    rid = Column(Integer, primary_key=True)
    rast = Column(ga2.types.Raster)
    filename = Column(String(255))
    burndate = Column(Date)
    burnname = Column(String(255))

Assuming that SESSION and ENGINE global variables are available, the gist of this approach can be seen in this example:

def clip_fccs_by_mtbs_id2(request):

    query = SESSION.query(LandCover.rast\
        .ST_AsGDALRaster(LandCover.rast\
        .ST_Clip(LandCover.rast, BurnedArea.rast\
        .ST_Envelope()\
        .ST_Buffer(1000)), 'GTiff').label('rast'))\
        .filter(LandCover.rast.ST_Intersects(BurnedArea.rast), BurnedArea.rid==2)

    filename = 'blah.tiff'

    # Outputs:
    # [(<read-only buffer for 0x2613470, size 110173, offset 0 at 0x26a05b0>),
    #  (<read-only buffer for 0x26134b0, size 142794, offset 0 at 0x26a01f0>)]

    result = query.all()
    while type(result) != buffer:
        result = result[0] # Unwrap until a buffer is found

    # Consequently, it returns only the first item
    response = HttpResponse(result, content_type=FORMATS[_format]['mime'])
    response['Content-Disposition'] = 'attachment; filename="%s"' % filename

Here we see a better way of getting at a nested buffer. If we wanted all of the rasters that were returned (all of the buffers), we could call ST_Union on our final raster selection before passing it to ST_AsGDALRaster.

In Summary...

After considering all my (apparent) options, I found this last technique, using the PostGIS raster output function(s) and writing the byte array to a file-attachment in an HTTP response, to be best suited for my application. I'd be interested in hearing about other techniques not described here.

References

Obe, Regina. Leo S. Hsu. "Using PostGIS in a desktop environment." Chapter 12. PostGIS in Action. 2011.
"Exporting PostGIS Rasters to Other Formats...Quickly"
StackOverflow: Using a psycopg2 converter to retrieve bytea data from PostgreSQL
GDAL API Tutorial

K. Arthur Endsley - Blog

Approximate Bayesian computation in Python

Lotka-Volterra Example

The Black-Box Likelihood Operation

Parameter Estimation with PyMC

Other Considerations

References

Fast unpacking of QC bit-flags in Python

Converting from Decimal to Binary Strings

Option 1: NumPy's Binary Representation

Option 2: Python's Built-in Binary Conversion

Option 3: Bitwise Operators

Option 4: NumPy's unpackbits()

Clocking In

Day length, sunrise, and sunset calculation for Earth system models

Day Length Algorithm Description

Calculating the Mean Solar Time

Position of the Sun in Ecliptic Coordinates

Position of the Sun in Equatorial Coordinates

Edge Cases Near the Poles

Putting it All Together

Calculating Photoperiod

Python Implementation

Using Photoperiod for Climatic Data Aggregation

Competing Approaches

References

Implementing fixed effects panel models in R

Use and Interpretation of Fixed Effects Regression

Relevance of Fixed Effects Regression for Causal Inference

General Specification of Fixed Effects Models

Alternative Specifications for Longitudinal Data

Interpretation

Including Time Period Fixed Effects

Implementation in R

With R's Built-in Ordinary Least Squares Estimation

With Dedicated Approaches for Mean Deviations

Diagnostics and Inference in R

Assessing Multicollinearity in Fixed Effects Regression Models

Linear Model Assumptions: Homoscedasticity

Checking for Influential Observations

Partial Effects Plots

References

Parallel processing of raster arrays in Python with NumPy

A Note on Concurrency

Multiple Threads, Multiple Processes

On Concurrency in Python

Example

Getting Started

Combining Array Data from Multiple Files

Sidebar: Linear Regression in SciPy

Calculating Regressions on Subarrays across Multiple Processes

Concurrency in Python 3

Performance Metrics

In Summary

References

Unsupervised learning for time series data: Singular spectrum versus principal components analysis

Principal Components Analysis

Spectral or Eigenvalue Decomposition

Singular Value Decomposition

Singular Spectrum Analysis

SSA, Stationarity, and Autocorrelation

SSA versus PCA

Implementation in R

PCA for Time Series Data in R

Deriving the Spatial Principal Components

Visualizing the Spatial Principal Components

SSA for Time Series Data in R

Construction of the Trajectory Matrix

The Lagged-Covariance Matrix

Derivation of the Eigenvectors

Visualizing the Temporal Principal Components

Conclusion

References

A visual tool for analyzing trends among group means in R

Example Using American Community Survey

First Plot: Income versus Vacant Housing

Second Plot: Income and White Population

The Levels Plot Function

Teaching the Q Method in a class on urban sustainability

On Deciding to Teach the Q Method

Option 4: NumPy's `unpackbits()`