K. Arthur Endsley - Bloghttp://karthur.org/2021-11-13T12:00:00+01:00Approximate Bayesian computation in Python2021-11-13T12:00:00+01:002021-11-13T12:00:00+01:00K. Arthur Endsleytag:karthur.org,2021-11-13:/2021/abc-in-python.html<p>The <a href="https://docs.pymc.io/en/v3/">PyMC</a> library offers a solid foundation for probabilistic programming and Bayesian inference in Python, but it has some drawbacks. Although the API is robust, it has changed frequently along with the shifting momentum of the entire PyMC project (formerly "PyMC3"). This is most evident in the <a href="https://github.com/pymc-devs/pymc4">abandoned "PyMC4" project …</a></p><p>The <a href="https://docs.pymc.io/en/v3/">PyMC</a> library offers a solid foundation for probabilistic programming and Bayesian inference in Python, but it has some drawbacks. Although the API is robust, it has changed frequently along with the shifting momentum of the entire PyMC project (formerly "PyMC3"). This is most evident in the <a href="https://github.com/pymc-devs/pymc4">abandoned "PyMC4" project</a> and the <a href="https://github.com/Theano/Theano">stranding of Theano</a>. The PyMC developers have picked up where the developers of Theano left off, introducing "Aesara" as a replacement; but, as it stands, little has been written about how to migrate from Theano to Aesara, especially for PyMC users who don't have prior experience with computational graphs or tensor libraries.</p>
<p>PyMC is actually quite expressive and there are <a href="https://docs.pymc.io/en/v3/nb_examples/index.html">rich examples</a> of how to use it for more than just Bayesian inference. These community-contributed examples are not always updated, however, to reflect the changes to PyMC over the years. <strong>In particular, there are few examples of how to use PyMC for approximate Bayesian computation (ABC), and those I've seen [<a href="#refs">1,2</a>] are <a href="https://discourse.pymc.io/t/blackbox-likelihood-example-doesnt-work/5378">functionally deprecated</a> given the current state of the library.</strong></p>
<p>ABC, also referred to as <em>likelihood-free inference</em> [<a href="#refs">3</a>] involves Bayesian simulation—primarily through Markov Chain Monte Carlo (MCMC)—for models which have intractable or analytically unavailable likelihood functions. This includes models that have no formal likelihood, such as physical models or other simulations that have no analytical representation. In such cases, we require an approach to modeling with a "black-box" likelihood function, for instance, the mean squared error between a model's predictions and observed values. This approach is popular for fitting a variety of models to observed data and is the focus of another Python library, SPOTPY [<a href="#refs">4</a>]. However, in my experience, SPOTPY has some outstanding issues related to the proposal distribution, in which bounded priors are sampled in a way that fails to preserve balance in the Markov chain [<a href="#refs">5</a>]. Moreover, it doesn't have the advantages that PyMC offers, with its rich diagnostics and performance enhancements. I spent some time figuring out how to do ABC in PyMC, so I think it's worth reporting here in case anyone else is struggling to navigate the vast and shifting sands of PyMC's API.</p>
<h2>Lotka-Volterra Example</h2>
<p>Here, I use the example of the Lotka-Volterra ordinary differential equations (ODEs), as presented in one of the deprecated PyMC examples [<a href="#refs">2</a>]. These well-known ODEs are used to model the dynamics of a pair of interacting predator and prey populations. <span class="math">\(N_t\)</span> and <span class="math">\(M_t\)</span> describe the size of prey and predator populations, respectively, at time <span class="math">\(t\)</span>. Change in the populations is given:
</p>
<div class="math">$$
\frac{d N_t}{d t} = \alpha N_t - \beta N_t M_t
$$</div>
<div class="math">$$
\frac{d M_t}{d t} = -\gamma M_t + \delta N_t M_t
$$</div>
<p>Where <span class="math">\(\alpha\)</span> is the prey growth rate (in the absence of predators), <span class="math">\(\beta\)</span> is the prey mortality rate, <span class="math">\(\gamma\)</span> is the predator mortality rate (in the absence of prey), and <span class="math">\(\delta\)</span> is the conversion efficiency by which prey are consumed and predators increased. A Python implementation of this model, postponing a solution, could be written as follows where, for simplicity, we set <span class="math">\(\gamma = -1.5\)</span> and <span class="math">\(\delta = 0.75\)</span>:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="c1"># True parameter values (alpha, beta)</span>
<span class="n">alpha</span> <span class="o">=</span> <span class="mf">1.0</span>
<span class="n">beta</span> <span class="o">=</span> <span class="mf">0.1</span>
<span class="c1"># Initial population of rabbits and foxes</span>
<span class="n">X0</span> <span class="o">=</span> <span class="p">[</span><span class="mf">10.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">]</span>
<span class="n">size</span> <span class="o">=</span> <span class="mi">100</span> <span class="c1"># Size of data</span>
<span class="n">time</span> <span class="o">=</span> <span class="mi">15</span> <span class="c1"># Time lapse</span>
<span class="n">t</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">time</span><span class="p">,</span> <span class="n">size</span><span class="p">)</span>
<span class="c1"># Lotka-Volterra equations</span>
<span class="k">def</span> <span class="nf">dX_dt</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">t</span><span class="p">,</span> <span class="n">a</span> <span class="o">=</span> <span class="n">alpha</span><span class="p">,</span> <span class="n">b</span> <span class="o">=</span> <span class="n">beta</span><span class="p">):</span>
<span class="s1">'Return the growth rate of fox and rabbit populations.'</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span>
<span class="n">a</span> <span class="o">*</span> <span class="n">X</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">-</span> <span class="n">b</span> <span class="o">*</span> <span class="n">X</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">X</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span>
<span class="o">-</span><span class="mf">1.5</span> <span class="o">*</span> <span class="n">X</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="mf">0.75</span> <span class="o">*</span> <span class="n">b</span> <span class="o">*</span> <span class="n">X</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">X</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="p">])</span>
</code></pre></div>
<p>The <em>solution</em> to this system of ODEs might be written:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">scipy.integrate</span> <span class="kn">import</span> <span class="n">odeint</span>
<span class="k">def</span> <span class="nf">competition_model</span><span class="p">(</span><span class="n">params</span><span class="p">,</span> <span class="n">x</span> <span class="o">=</span> <span class="kc">None</span><span class="p">):</span>
<span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">params</span>
<span class="s1">'Simulator function (an ordinary differential equation to be integrated)'</span>
<span class="k">return</span> <span class="n">odeint</span><span class="p">(</span><span class="n">dX_dt</span><span class="p">,</span> <span class="n">y0</span> <span class="o">=</span> <span class="n">X0</span><span class="p">,</span> <span class="n">t</span> <span class="o">=</span> <span class="n">t</span><span class="p">,</span> <span class="n">rtol</span> <span class="o">=</span> <span class="mf">0.01</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">))</span>
</code></pre></div>
<p>We generate some synthetic, "observed" data as:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Observed data (with added random noise)</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">competition_model</span><span class="p">(</span>
<span class="n">params</span> <span class="o">=</span> <span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="kc">None</span><span class="p">))</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="n">size</span> <span class="o">=</span> <span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span>
</code></pre></div>
<h2>The Black-Box Likelihood Operation</h2>
<p>PyMC is built on top of Theano, a computational graph library that allows for symbolic computing and automatic differentiation. When trying to do ABC, we have to develop a way for Theano to sample the posterior likelihood even though we have no formal likelihood function that can be represented analytically. With ABC, there is no likelihood function that Theano can differentiate with respect to model parameters; no gradient to be calculated.</p>
<p>We can, however, create a custom Theano operator, <code>tt.Op</code>, that returns a quasi-likelihood value, allowing for a sampler (e.g., Metropolis-Hasting) to sample from the posterior likelihood even though a gradient is unavailable. Here, the quasi-likelihood function is the root-mean squared error (RMSE), though we could use any other goodness-of-fit metric. The <code>perform()</code> method represents the operation that is performed on Theano's computational graph: given a vector of proposed parameter values, calculate the posterior likelihood. For calculating the RMSE, it is necessary that this operation have the observed values available in-memory, so we set them as an instance attribute, <code>self.observed</code>. Because our model may require some additional driver data, we also set these data as an instance attribute, <code>self.x</code>.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">theano.tensor</span> <span class="k">as</span> <span class="nn">tt</span>
<span class="k">class</span> <span class="nc">BlackBoxLikelihood</span><span class="p">(</span><span class="n">tt</span><span class="o">.</span><span class="n">Op</span><span class="p">):</span>
<span class="n">itypes</span> <span class="o">=</span> <span class="p">[</span><span class="n">tt</span><span class="o">.</span><span class="n">dvector</span><span class="p">]</span> <span class="c1"># Expects a vector of parameter values when called</span>
<span class="n">otypes</span> <span class="o">=</span> <span class="p">[</span><span class="n">tt</span><span class="o">.</span><span class="n">dscalar</span><span class="p">]</span> <span class="c1"># Outputs a single scalar value (the log likelihood)</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">observed</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Parameters</span>
<span class="sd"> ----------</span>
<span class="sd"> model : Callable</span>
<span class="sd"> An arbitrary "black box" function that takes two arguments: the</span>
<span class="sd"> model parameters ("params") and the forcing data ("x")</span>
<span class="sd"> observed : numpy.ndarray</span>
<span class="sd"> The "observed" data that our log-likelihood function takes in</span>
<span class="sd"> x:</span>
<span class="sd"> The forcing data (input drivers) that our model requires</span>
<span class="sd"> '''</span>
<span class="bp">self</span><span class="o">.</span><span class="n">model</span> <span class="o">=</span> <span class="n">model</span>
<span class="bp">self</span><span class="o">.</span><span class="n">observed</span> <span class="o">=</span> <span class="n">observed</span>
<span class="bp">self</span><span class="o">.</span><span class="n">x</span> <span class="o">=</span> <span class="n">x</span>
<span class="k">def</span> <span class="nf">loglik</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">params</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">observed</span><span class="p">):</span>
<span class="c1"># The root-mean squared error (RMSE)</span>
<span class="n">predicted</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="p">(</span><span class="n">params</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">nanmean</span><span class="p">((</span><span class="n">predicted</span> <span class="o">-</span> <span class="n">observed</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">perform</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">node</span><span class="p">,</span> <span class="n">inputs</span><span class="p">,</span> <span class="n">outputs</span><span class="p">):</span>
<span class="c1"># The method that is used when calling the Op</span>
<span class="p">(</span><span class="n">params</span><span class="p">,)</span> <span class="o">=</span> <span class="n">inputs</span>
<span class="n">logl</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">loglik</span><span class="p">(</span><span class="n">params</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">x</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">observed</span><span class="p">)</span>
<span class="n">outputs</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">logl</span><span class="p">)</span> <span class="c1"># Output the log-likelihood</span>
</code></pre></div>
<p><strong>Note that we return the <em>negative</em> RMSE, as PyMC samplers are used to working with log-likelihoods and we wish to maximize the likelihood function</strong> (i.e., obtain the smallest negative RMSE).</p>
<p>For some applications, we may want to use the Gaussian log-likelihood or a similar exponential likelihood function.
</p>
<div class="math">$$
\mathrm{log}\,\mathcal{L}(\hat{\theta}, \hat{\sigma}) = -\frac{N}{2}\,\mathrm{log}(2\pi\hat{\sigma}^2)
- \frac{1}{2\hat{\sigma}^2} \sum (\hat{y}(\hat{\theta}) - y)^2
$$</div>
<p>In this case, we need to include as a parameter in our model the error variance, <code>sigma</code>; it's most convenient to make this last the parameter in the parameters vector:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">loglik</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">params</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">observed</span><span class="p">):</span>
<span class="n">predicted</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="p">(</span><span class="n">params</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
<span class="n">sigma</span> <span class="o">=</span> <span class="n">params</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="k">return</span> <span class="o">-</span><span class="mf">0.5</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">pi</span> <span class="o">*</span> <span class="n">sigma</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="o">-</span> <span class="p">(</span><span class="mf">0.5</span> <span class="o">/</span> <span class="n">sigma</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="o">*</span>\
<span class="n">np</span><span class="o">.</span><span class="n">nansum</span><span class="p">((</span><span class="n">predicted</span> <span class="o">-</span> <span class="n">observed</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span>
</code></pre></div>
<h2>Parameter Estimation with PyMC</h2>
<p>Finally, we're ready to estimate our model's parameters using PyMC. If you're unfamiliar with PyMC and how models are defined, check out the tutorial, <a href="https://docs.pymc.io/en/v3/pymc-examples/examples/getting_started.html">"Getting started with PyMC3"</a> before trying to figure out what's going on below. If you are already familiar with PyMC models, hopefully there are no surprises here.</p>
<p>In this example, we use a Gaussian log-likelihood function, so we add the error variance, <code>sigma</code>, as a model parameter.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">arviz</span> <span class="k">as</span> <span class="nn">az</span>
<span class="kn">import</span> <span class="nn">pymc3</span> <span class="k">as</span> <span class="nn">pm</span>
<span class="kn">from</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="n">pyplot</span>
<span class="n">loglik</span> <span class="o">=</span> <span class="n">BlackBoxLikelihood</span><span class="p">(</span><span class="n">competition_model</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="kc">None</span><span class="p">)</span>
<span class="k">with</span> <span class="n">pm</span><span class="o">.</span><span class="n">Model</span><span class="p">()</span> <span class="k">as</span> <span class="n">my_model</span><span class="p">:</span>
<span class="c1"># (Stochastic) Priors for unknown model parameters</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">HalfNormal</span><span class="p">(</span><span class="s1">'a'</span><span class="p">,</span> <span class="n">sd</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">HalfNormal</span><span class="p">(</span><span class="s1">'b'</span><span class="p">,</span> <span class="n">sd</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">sigma</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">HalfNormal</span><span class="p">(</span><span class="s1">'sigma'</span><span class="p">,</span> <span class="n">sd</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="c1"># Convert model parameters to a tensor vector</span>
<span class="n">params</span> <span class="o">=</span> <span class="n">tt</span><span class="o">.</span><span class="n">as_tensor_variable</span><span class="p">([</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">sigma</span><span class="p">])</span>
<span class="c1"># Define the likelihood as an arbitrary potential</span>
<span class="n">pm</span><span class="o">.</span><span class="n">Potential</span><span class="p">(</span><span class="s1">'likelihood'</span><span class="p">,</span> <span class="n">loglik</span><span class="p">(</span><span class="n">params</span><span class="p">))</span>
<span class="n">trace</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span>
<span class="n">draws</span> <span class="o">=</span> <span class="n">NUM_DRAWS</span><span class="p">,</span> <span class="n">step</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">Metropolis</span><span class="p">(),</span>
<span class="n">start</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'a'</span><span class="p">:</span> <span class="mf">0.5</span><span class="p">,</span> <span class="s1">'b'</span><span class="p">:</span> <span class="mf">0.5</span><span class="p">,</span> <span class="s1">'sigma'</span><span class="p">:</span> <span class="mi">10</span><span class="p">})</span>
<span class="n">az</span><span class="o">.</span><span class="n">plot_trace</span><span class="p">(</span><span class="n">trace</span><span class="p">)</span>
<span class="n">pyplot</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div>
<p>Because, in this toy example, we know the true parameter values, a look at the trace plot suggests that they are correctly identified. However, there is a lot of uncertainty in the <span class="math">\(\beta\)</span>, or <code>b</code>, parameter.</p>
<p><a href="/images/20211113_trace_plot_gaussian.png"><img style="float:left;" src="/images/thumbs/20211113_trace_plot_gaussian_thumbnail_square.png" /></a></p>
<div style="clear:both;"></div>
<p><strong>Let's try it again but using the RMSE quasi-likelihood this time.</strong> Again, we'll take 50,000 samples of the posterior in each of four chains. At first glance, our posterior distributions seem narrower with the RMSE quasi-likelihood; however, if we look closely at the horizontal axes, we see that the RMSE quasi-likelihood is permitting a few larger jumps in the proposal distribution than we saw with the Gaussian likelihood.</p>
<p><a href="/images/20211113_trace_plot_rmsd.png"><img style="float:left;" src="/images/thumbs/20211113_trace_plot_rmsd_thumbnail_square.png" /></a></p>
<div style="clear:both;"></div>
<p>Despite these differences, the choice of whether to use the RMSE or a formal likelihood function should be based on substantive modeling concerns. In this case, is the Gaussian likelihood appropriate, given that our observed data might not be independent and identically distributed?</p>
<h2>Other Considerations</h2>
<p>When fitting a model to data with MCMC, we gain the ability to estimate the uncertainty in our model parameters, which can be visually assessed from plots likes the ones above. But how do we pick the best-fit parameters? In this toy example, the maximum posterior estimate obviously corresponds to the true parameter values, but in models with less well-identified parameters (or more process or instrument noise), it may not be obvious what to choose. We may want the parameters associated with the maximum model log-likelihood. You may have expected that PyMC would record the model's log-likelihood at each sampling step—I certainly did. But that's not the case, as can be seen from the <code>InferenceData</code> that were returned:</p>
<div class="highlight"><pre><span></span><code>>>> trace
Inference data with groups:
> posterior
> sample_stats
</code></pre></div>
<p><a href="https://discourse.pymc.io/t/log-likelihood-not-found-in-inferencedata/8169/3">The reason for this is complicated and related to the active development of PyMC.</a> For now, you can store the model log-likelihood by explicitly defining it during the model's configuration:</p>
<div class="highlight"><pre><span></span><code><span class="k">with</span> <span class="n">pm</span><span class="o">.</span><span class="n">Model</span><span class="p">()</span> <span class="k">as</span> <span class="n">my_model</span><span class="p">:</span>
<span class="o">...</span>
<span class="c1"># Store the model log-likelihood as well</span>
<span class="n">loglik</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">Deterministic</span><span class="p">(</span><span class="s1">'log_likelihood'</span><span class="p">,</span> <span class="n">model</span><span class="o">.</span><span class="n">logpt</span><span class="p">)</span>
</code></pre></div>
<h2 id="refs">References</h2>
<ol>
<li><a href="https://docs.pymc.io/en/v3/pymc-examples/examples/case_studies/blackbox_external_likelihood.html">"Using a 'black box' likelihood function"</a> (Accessed: November 13, 2021.)</li>
<li><a href="https://docs.pymc.io/en/v3/pymc-examples/examples/samplers/SMC-ABC_Lotka-Volterra_example.html">"Sequential Monte Carlo - Approximate Bayesian Computation"</a> (Accessed: November 13, 2021.)</li>
<li>Sisson, S.A. & Y. Fan. 2011. "Likelihood-Free MCMC." Chapter 12, <u>Handbook of Markov Chain Monte Carlo.</u> CRC Press, LLC.</li>
<li>Houska, T., Kraft, P., Chamorro-Chavez, A. & Breuer, L. 2015. <a href="http://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0145180">"SPOTting model parameters using a ready-made Python package."</a> <em>PLoS ONE,</em> <strong>10</strong>(12), e0145180.</li>
<li>Vrugt, J. A. 2016. <a href="https://doi.org/10.1016/j.envsoft.2015.08.013">Markov chain Monte Carlo simulation using the DREAM software package: Theory, concepts, and MATLAB implementation.</a> <em>Environmental Modelling & Software,</em> 75, 273–316.</li>
</ol>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Fast unpacking of QC bit-flags in Python2021-03-13T11:30:00+01:002021-03-13T11:30:00+01:00K. Arthur Endsleytag:karthur.org,2021-03-13:/2021/fast-bitflag-unpacking-python.html<p>Satellite remote sensing products often represent quality control (QC) information as <em>bit-flags</em>, a sequence of binary digits that individually (single digit) or collectively (multiple digits) convey additional information about a remote sensing measurement. This information may be related to cloud cover, atmospheric conditions in general, the status of the space-borne …</p><p>Satellite remote sensing products often represent quality control (QC) information as <em>bit-flags</em>, a sequence of binary digits that individually (single digit) or collectively (multiple digits) convey additional information about a remote sensing measurement. This information may be related to cloud cover, atmospheric conditions in general, the status of the space-borne sensor, or the overall quality of the retrieval. Bitflags are convenient because a sequence of varied true/ false (binary) indicators can be represented as a decimal number. Here's an example from the Moderate Resolution Imaging Spectrometer (MODIS) MOD15 product User Guide.</p>
<p><img alt="Example from MOD15 of QC bitflags" src="http://karthur.org/images/20210313_MOD15_QC_bitflag_example.png"></p>
<p>Strictly speaking, the phrase "bit-word" is misappropriated here; computer scientists have told me it refers to a sequence of two bytes (16 bits total). It may be more appropriate to refer to them as bit-flags. The example above is the binary equivalent of the decimal number 64. For a given remote sensing image, in addition to the raster array of data values (e.g., remotely sensed leaf area index, as in the MOD15 example), there is a QC band that has the same size and shape but is full of decimal values. Where the QC band has the decimal number 64, based on the above example, we can infer that the corresponding pixel in the data band has "good quality" (bit number 0) and "significant clouds [were] NOT present" (bits numbered 3-4).</p>
<h2>Converting from Decimal to Binary Strings</h2>
<p><strong>The problem of trying to determine whether certain conditions apply to a given data pixel, then, is the problem of converting from decimal to binary strings.</strong> While this is relatively trivial for a single decimal number, we often need to do this conversion or "unpacking" of bit-flags for several thousands or millions of pixels at once, in the case of large or high-resolution raster image. Recently, I developed a pipeline for extracting MODIS MOD16 data in 500-m tiles (on the MODIS Sinusoidal grid), masking out low-quality pixels, resampling to 1000-m, and stitching in a global mosaic. Line profiling revealed that, by far, the most time-intensive operation was unpacking the QC bit-flags!</p>
<p>So, what's the fastest way to do unpack QC bit-flags in Python? Equivalently, what's the fastest way to convert from decimal to binary strings for an arbitrary array of decimal numbers? There are several options, assuming we have 8-bit binary strings that we need to unpack and parse for bit-flags. For clarity, we're evaluating options that behave something like (to use the previous example yet again):</p>
<div class="highlight"><pre><span></span><code><span class="o">>>></span> <span class="n">my_function</span><span class="p">(</span><span class="mi">64</span><span class="p">)</span>
<span class="s1">'01000000'</span>
</code></pre></div>
<h2>Option 1: NumPy's Binary Representation</h2>
<p>NumPy's <code>binary_repr()</code> function is a straight-forward, high-level interface for converting decimal numbers to binary strings. It's not a <code>ufunc</code>, however, so if we want to apply it to an arbitrary array, we need to vectorize it. In addition, we need to use <code>partial()</code> to curry a version of this function that always returns 8-bit binary strings. Vectorizing functions makes them slow, so I don't expect this one to perform best.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">functools</span> <span class="kn">import</span> <span class="n">partial</span>
<span class="n">dec2bin_numpy</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vectorize</span><span class="p">(</span><span class="n">partial</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">binary_repr</span><span class="p">,</span> <span class="n">width</span> <span class="o">=</span> <span class="mi">8</span><span class="p">))</span>
</code></pre></div>
<p>In fact, this is the approach I was originally using, which accounted for the vast majority of the program's run-time.</p>
<h2>Option 2: Python's Built-in Binary Conversion</h2>
<p>Python, of course, has a built-in function <code>bin()</code> to do this conversion. The strings returned by <code>bin()</code> have a special format (e.g., <code>0b1000000</code> for 64) that's not exactly what we want when unpacking QC bit-flags, so we need to use additional string methods to get a compact, 8-bit string representation. This function must also be vectorized, so it may be slow.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="n">dec2bin_py</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vectorize</span><span class="p">(</span><span class="k">lambda</span> <span class="n">v</span><span class="p">:</span> <span class="nb">bin</span><span class="p">(</span><span class="n">v</span><span class="p">)</span><span class="o">.</span><span class="n">lstrip</span><span class="p">(</span><span class="s1">'0b'</span><span class="p">)</span><span class="o">.</span><span class="n">rjust</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="s1">'0'</span><span class="p">))</span>
</code></pre></div>
<h2>Option 3: Bitwise Operators</h2>
<p>A common solution for decimal-to-binary conversion in any language is the use of bitwise operators. Below, we use the left shift operator, <code><<</code> to get an array of decimal numbers that represent every 8-bit string with exactly one <code>1</code>. Then, the bitwise AND operator <code>&</code> compares each of these 8 options with the (implicitly binary) representation of the decimal number we want to convert to a binary string. In effect, this updates the array with powers of 2 that, when compared to 0, are converted to true/ false flags that are equivalent to the binary representation of that number! It's a neat trick I found from multiple online forums.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="k">def</span> <span class="nf">dec2bin_bitwise</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="s1">'For a 2D NumPy array as input'</span>
<span class="n">shp</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">fliplr</span><span class="p">((</span><span class="n">x</span><span class="o">.</span><span class="n">ravel</span><span class="p">()[:,</span><span class="kc">None</span><span class="p">]</span> <span class="o">&</span> <span class="p">(</span><span class="mi">1</span> <span class="o"><<</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">8</span><span class="p">)))</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span>\
<span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">uint8</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="o">*</span><span class="n">shp</span><span class="p">,</span> <span class="mi">8</span><span class="p">))</span>
</code></pre></div>
<p>We have to know the shape of the input array because this approach does create a new axis along which the binary digits are enumerated; for example:</p>
<div class="highlight"><pre><span></span><code><span class="o">>>></span> <span class="n">dec2bin_bitwise</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">((</span><span class="mi">64</span><span class="p">,)))</span>
<span class="n">array</span><span class="p">([[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">uint8</span><span class="p">)</span>
</code></pre></div>
<p>The approach above requires an <code>np.fliplr()</code> operation because <code>np.arange()</code> produces a numeric sequence in ascending order, but bit strings are almost all computer systems are ordered from the most-significant bit (at the left) to the least-significant bit. This means that are binary strings are flipped left-to-right. I'm not sure what impact this has on performance compared to simply reversing <code>np.arange()</code>, as below.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="k">def</span> <span class="nf">dec2bin_bitwise2</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="s1">'For a 2D NumPy array as input'</span>
<span class="n">shp</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span>
<span class="k">return</span> <span class="p">((</span><span class="n">x</span><span class="o">.</span><span class="n">ravel</span><span class="p">()[:,</span><span class="kc">None</span><span class="p">]</span> <span class="o">&</span> <span class="p">(</span><span class="mi">1</span> <span class="o"><<</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">7</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)))</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span>\
<span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">uint8</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="o">*</span><span class="n">shp</span><span class="p">,</span> <span class="mi">8</span><span class="p">))</span>
</code></pre></div>
<p><strong>An important thing to note about these two functions is that they require arguments that are <code>numpy.ndarray</code> instances,</strong> so, they don't strictly work like the example above, with an atomic decimal number, but they are well-suited to the actual problem of working with decimal arrays.</p>
<h2>Option 4: NumPy's <code>unpackbits()</code></h2>
<p>I wish I could remember where I first saw this function... It may have been from the NumPy documentation itself (in frustration, desperately seeking a faster solution). This approach does have a stronger assumptions than the previous implementations; specifically, <code>x</code> must be typed unsigned-integer (<code>np.uint8</code>). Also, as in Option 3, this function needs to create a third axis along which the binary digits are enumerated. The <code>axis</code> argument provides some control over this, but in most cases you'd want the new axis to be last (hence, equal to the current number of dimensions, <code>ndim</code>, in Python's number system). Finally, like the first example in Option 3, this function does return a binary string in big-endian order (most significant bit at the end). Thus, it is necessary to reverse-index (flip) the array to get little-endian order</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="k">def</span> <span class="nf">dec2bin_unpack</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="kc">None</span><span class="p">):</span>
<span class="s1">'For an arbitrary NumPy array a input'</span>
<span class="n">axis</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">ndim</span> <span class="k">if</span> <span class="n">axis</span> <span class="ow">is</span> <span class="kc">None</span> <span class="k">else</span> <span class="n">axis</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">unpackbits</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="o">...</span><span class="p">,</span><span class="kc">None</span><span class="p">],</span> <span class="n">axis</span> <span class="o">=</span> <span class="n">axis</span><span class="p">)[</span><span class="o">...</span><span class="p">,</span><span class="o">-</span><span class="mi">8</span><span class="p">:]</span>
</code></pre></div>
<h2>Clocking In</h2>
<p>For simplicity, in timing each approach, I create a 15-by-15 array, <code>A</code>, of each of the number 0 through 225; the last square matrix I can create using unique decimal numbers smaller than 255. Each function is timed using <code>timeit</code> with 100 loops, the timer repeated 30 times (default). Furthermore, I execute the <code>timeit</code> statement three times and take the average of each trial.</p>
<div class="highlight"><pre><span></span><code>$ python -m timeit -n <span class="m">100</span> -r <span class="m">30</span> -s <span class="s2">"import numpy as np;from dec2bin import dec2bin_numpy, A"</span>
<span class="s2">"dec2bin_numpy(A)"</span>
</code></pre></div>
<table>
<thead>
<tr>
<th>Implementation</th>
<th style="text-align: right;">Time (usec)</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>dec2bin_numpy()</code></td>
<td style="text-align: right;"><code>211.0</code></td>
</tr>
<tr>
<td><code>dec2bin_py()</code></td>
<td style="text-align: right;"><code>106.0</code></td>
</tr>
<tr>
<td><code>dec2bin_bitwise()</code></td>
<td style="text-align: right;"><code>18.8</code></td>
</tr>
<tr>
<td><code>dec2bin_bitwise2()</code></td>
<td style="text-align: right;"><code>13.7</code></td>
</tr>
<tr>
<td><code>dec2bin_unpack()</code></td>
<td style="text-align: right;"><code>5.1</code></td>
</tr>
</tbody>
</table>
<p><strong>We can see that it matters a lot which approach you choose!</strong> My first implementation, <code>dec2bin_numpy()</code>, was over 40 times slower than the best choice I found. I'm not sure what makes <code>numpy.unpackbits()</code> so fast; I need to look under the hood and report back as, incidentally, the new NumPy documentation makes it harder to find a function's source code.</p>Day length, sunrise, and sunset calculation for Earth system models2021-02-10T15:45:00+01:002021-02-10T15:45:00+01:00K. Arthur Endsleytag:karthur.org,2021-02-10:/2021/day-length-calculation-earth-system-models.html<!--TODO Consult: https://www.google.com/books/edition/Almanac_for_Computers/YfWgsxEqzZMC?hl=en&gbpv=1&dq=%226.622%22+sunrise+OR+sunset&pg=SL2-PA7&printsec=frontcover Page B-6-->
<p>In Earth system models that run on daily or weekly time steps, there are many quantities that are we may wish to calculate only when the sun is in the sky. Vapor pressure deficit (VPD), as one important example, has strong diurnal variation, tending to decrease in the evening as …</p><!--TODO Consult: https://www.google.com/books/edition/Almanac_for_Computers/YfWgsxEqzZMC?hl=en&gbpv=1&dq=%226.622%22+sunrise+OR+sunset&pg=SL2-PA7&printsec=frontcover Page B-6-->
<p>In Earth system models that run on daily or weekly time steps, there are many quantities that are we may wish to calculate only when the sun is in the sky. Vapor pressure deficit (VPD), as one important example, has strong diurnal variation, tending to decrease in the evening as the ambient temperature declines. As VPD is a key driver of transpiration and photosynthesis [<a href="#refs">1</a>], if we want to correctly estimate the impact of VPD on these processes, we may need to estimate VPD as it is experienced by plants in the heat of the day. Consequently, we would want to integrate hourly VPD data (such as from a numerical weather model) only for those hours when the sun is up (the <em>photoperiod</em>) and we would therefore need to know the timing of sunrise and sunset.</p>
<p>An alternative approach to such day-length calculations, particularly if we're already integrating data from a weather model, would be to use solar irradiance—an empirical threshold on down-welling short-wave radiation, for instance, would be an acceptable proxy for the sun is in the sky. This is the approach used in photosynthesis models like MODIS MOD17 [<a href="#refs">2</a>]. However, this empirical threshold might vary between sites and seasons; it's ultimately arbitrary. But the approach is attractive because the calculation of sunrise and sunset times is fairly elaborate. Further, if our model has a large spatial domain—say, continental to global scales—we inevitably need to calculate sunrise and sunset at different latitudes, across different longitudes, and possibly taking into account very different elevations.</p>
<p><strong>I recently ran into this issue when developing a new photosynthesis model in Python.</strong> The Python developer community ostensibly has at least two choices of open-source libraries for calculating sunrise and sunset times (I did not look at Skyfield or AstroPy). Both libraries, <a href="https://rhodesmill.org/pyephem/"><code>pyephem</code></a> and <a href="https://astral.readthedocs.io/en/latest/"><code>astral</code></a>, are targeted at more general problems than calculating solar transits on Earth, however. Their generality leads to complexity and this translates to longer execution times when you have a global spatial domain and millions of data cells (pixels). Both libraries do offer greater precision and, in the case of the <code>pyephem</code> library, greater accuracy than the solution I've settled upon, which I describe below. <code>pyephem</code> also corrects for atmospheric refraction and, optionally, elevation effects. However, for determining sunrise and sunset times to the nearest hour, we can get by with less sophistication.</p>
<!--TODO Show wall times for various approaches-->
<p><strong>Descriptions of sunrise and sunset calculation on the internet are pretty lacking.</strong> NOAA's Earth System Research Laboratories (ESRL) <a href="https://www.esrl.noaa.gov/gmd/grad/solcalc/calcdetails.html">offers a spreadsheet calculator</a> but no details on how it works. This has led some developers to implement a version in code that calculates the value in each column, <a href="https://www.mathworks.com/matlabcentral/fileexchange/62180-sunriseset-lat-lng-utcoff-date-plot">as in this Matlab example,</a> as if the ESRL calculator was some sacred but indecipherable tome. I wanted to do better, at least in the documentation, verification, and wall time of my own sunrise and sunset calculation. Jean Meeus' "Astronomical Algorithms" [<a href="#refs">3</a>] provides a calculation, but it relies on ephemeris tables to look-up declination and right ascension. Here, I mostly follow the algorithm described in the U.S. Naval Observatory's <em>Almanac for Computers</em> [<a href="#refs">4</a>], which <a href="https://www.edwilliams.org/sunrise_sunset_algorithm.htm">a hobbyist named Ed Williams put on their website.</a> I can't find a copy anywhere else, so I'm indebted to them. Both Meeus and the U.S. Naval Observatory describe the calculation in the <em>equatorial coordinate system,</em> in terms of hour angles and the declination of the sun. The approach in <em>Almanac for Computers</em> may be preferable from a standpoint of numeric stability stability, as Meeus' equations require a high number of significant digits in order to be accurate.</p>
<!--TODO Discussion of how pyephem does this--Jean Meeus uses ephemeris tables, after all; perhaps table look-ups explain why pyephem is so slow?-->
<h2>Day Length Algorithm Description</h2>
<!--NOTE this is Equation 14.1 in Meeus' "Astronomical Algorithms"-->
<p>Jean Meeus [<a href="#refs">3</a>] gives the equation for the approximate hour angle of sunrise and sunset:
</p>
<div class="math">$$
\mathrm{cos}(H_0) = \frac{\mathrm{sin}(h_0) - \mathrm{sin}(\phi)\mathrm{sin}(\delta)}{\mathrm{cos}(\phi)\mathrm{cos}(\delta)}
$$</div>
<p>Where <span class="math">\(H_0\)</span> is the Greenwich hour angle of sunrise (or sunset), i.e., the local hour angle at the prime meridian; <span class="math">\(h_0\)</span> is the zenith angle of the sun; <span class="math">\(\phi\)</span> is the observer's latitude; and <span class="math">\(\delta\)</span> is the declination of the sun.</p>
<p><strong>The general approach to calculating sunrise and sunset times treats the heavens as a clock.</strong> Celestial phenomena (including sunrise and sunsets) happen on a schedule, so the positions of celestial bodies, including the sun, can be described by periodic equations that take time as an argument. <strong>There are four basic steps to calculating sunrise and sunset times:</strong></p>
<ol>
<li>Finding the <em>mean solar time</em> that describes where (or "when") the current moment is in the celestial cycle;</li>
<li>Calculating the sun's position in the <em>ecliptic coordinate system;</em></li>
<li>Converting the sun's ecliptic coordinates to coordinates in the <em>equatorial coordinate system;</em></li>
<li>Obtaining the sunrise and sunset times at the Greenwich meridian, then adding or subtracting the offset for our longitude (or "time zone").</li>
</ol>
<p>The <em>Almanac for Computers</em> algorithm, which I modify, is based on Meeus' equation above. The algorithm walks us through how to calculate <span class="math">\(\delta\)</span> and to convert between hour angles and clock time so as to obtain sunrise and sunset times. The calculations are first performed in the ecliptic coordinate system, where we pretend that sun revolves around the Earth, and then we perform a coordinate transformation to the equatorial coordinate system. According to the U.S. Naval Observatory, the algorithm obtains an accuracy of <span class="math">\(\pm 2\)</span> minutes for any location between <span class="math">\(\pm 65\)</span> degrees latitude. The central weakness of this algorithm is that it uses constants that are estimated for the "latter half of the twentieth century" for the orbital elements (eccentricity, argument of the perihelion) and for converting between sidereal time and Universal Time. Some of these constants we can replace with mathematical models (e.g., VSOP87) like those described by Jean Meeus, but that's totally unnecessary if we simply we wish to obtain the nearest hour of sunrise and sunset.</p>
<p><strong>The procedure described below assumes that all angles are in degrees, not radians.</strong> To convert from radians to degrees, multiply by <span class="math">\(180/\pi\)</span>; to convert from degrees to radians, multiply by <span class="math">\(\pi/180\)</span>. Latitude and longitude should be in decimal degrees; latitude values south of the equator are negative and longitude values west of the prime meridian are negative. I refer to Greenwich mean time and Coordinated Universal Time (UTC) interchangeably.</p>
<h3>Calculating the Mean Solar Time</h3>
<p>Our approach to calculating sunrise and sunset times uses the mean solar clock—the average motion of the sun in the sky as seen from Earth. Naturally, we begin by figuring out where we are in this solar cycle based on the current date. We often work in <em>hour angles,</em> an angular distance expressed as the number of hours the Earth has rotated since (or must rotate until) the meridian plane (plane containing Earth's axis and the zenith) intersects the point of the body of interest (here, the sun) projected onto the celestial sphere. Essentially, the hour angle quantifies how many hours the Earth must rotate until (or has rotated since) the object is in the same place in the sky again.</p>
<p>Here, the solar hour angle (SHA) quantifies the angle between the sun and the observer expressed as the number of hours difference between solar noon at the observer's location and that of the Greenwich meridian. As the Earth spins on its axis at a rate of about 15 degrees longitude per hour, this calculate is simply longitude, <span class="math">\(\lambda\)</span>, divided by 15:
</p>
<div class="math">$$
\mathrm{SHA} = \frac{\lambda}{15}
$$</div>
<p>This is basically the offset from Greenwich mean time: the number of hours east or west of the prime meridian (negative values west). We can use the SHA to obtain an <em>approximate</em> time of sunrise and sunset, <strong>the mean solar time,</strong> expressed as a fraction of a day (e.g., 30.5 is noon on DOY 30). This approximate time is in Greenwich mean time, and so we subtract the SHA from 12h00 as this is the middle of the solar day, which starts at midnight. <strong>The <em>Algorithm</em> calculates this separately for sunrise and sunset, but it's easier to calculate the approximate of <em>transit</em> (when the sun is highest in the sky) rather than do twice the work.</strong>
</p>
<div class="math">$$
T = [\mathrm{DOY}] + \frac{12 - [\mathrm{SHA}]}{24}
$$</div>
<p><span class="math">\(DOY\)</span> is the number of days since January 1 of that year on the closed interval <span class="math">\([1, 366]\)</span> (on December 31 in leap years, DOY<span class="math">\(=366\)</span>). This can be obtained easily in Python and other programming languages <a href="https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior">with date string formatting.</a></p>
<!-- One approach is to find the **Julian ephemeris day,** $T$, which is the number of days before or after the epoch January 1, 2000 at noon, measured in Julian centuries (i.e., divided by 36,525 days, or the number of days in a century). It is the difference between the Julian day, $J$, and the epoch Julian day, then divided by 36,525. The epoch's Julian ephemeris day is 2,451,545 days, so $J$ can be found by counting the number of days before or after this date (January 1, 2000). Adding 0.5 will obtain solar noon on that day, which gives the exact same $J$ value of Meeus' approach.
$$
T = \frac{J - 2\,451\,545}{36\,525}
$$
As Meeus [<a href="#refs">3</a>] discusses, in order to be accurate this quantity should be calculated with high precision, as it is units of Julian centuries.
The approach below obtains sunrise and sunset times as *hour angles,* an angular distance expressed as the number of hours the Earth has rotated since (or must rotate until) the meridian plane (plane containing Earth's axis and the zenith) intersects the point of the body of interest (here, the sun) projected onto the celestial sphere. Essentially, the hour angle quantifies how many hours the Earth must rotate until (or has rotated since) the object is in the same place in the sky again. So, to convert from hour angles to clock time, we also need to calculate the **mean sidereal time** at Greenwich; this formula provides the answer as the hour angle of the mean vernal point:
$$
\theta_0 = 280.460\,618 + 360.985\,647\,366(J - 2451\,545) +
0.000\,387\,933\,T^2 - \frac{T^3}{38\,710\,000}
$$ -->
<h3>Position of the Sun in Ecliptic Coordinates</h3>
<p><strong>In order to calculate the sun's declination, we must first calculate the solar mean anomaly and the equation of the center.</strong> The solar mean anomaly, <span class="math">\(M\)</span>, is best described by first considering the <em>true anomaly;</em> the true anomaly, <span class="math">\(v\)</span>, that quantifies the position of the Earth in its elliptical orbit around the sun. If we imagine a line drawn through the points of aphelion and perihelion (farthest and closest points on our orbit), this is the angle between that line and the line connecting us to the sun (the radius vector), measured from perihelion. "It is the angle over which the object moved, as seen from the Sun, since the previous passage through the perihelion" [<a href="#refs">3</a>]. The <em>solar mean anomaly,</em> then, is the equivalent angle for a circular orbit with the same period as the true object on its elliptical orbit. "The mean anomaly is the angular distance from perihelion which the planet would have moved if it moved around the Sun with a constant angular velocity" [<a href="#refs">3</a>]. <strong>If <span class="math">\(T\)</span> is measured in Julian centuries, instead:</strong>
</p>
<div class="math">$$
M = [357.5291 + 35\,999.0503\,T - 0.000\,155\,9\,T^2 - 0.000\,000\,48\,T^3]\,\mathrm{mod}\, 360^{\circ}
$$</div>
<!--Meeus (1991), page 151, Equation 24.3 -->
<p>Where mod refers to the modulus function, i.e., the above quantity should be on the interval <span class="math">\((0,360]\)</span>.</p>
<p>The <em>Algorithm for Computers</em> provides an approximation that is simpler and no less accurate for transit estimates to the nearest hour, <strong>where <span class="math">\(T\)</span> is measured in Julian days:</strong>
</p>
<div class="math">$$
M = 0.9856\,T - 3.289
$$</div>
<p>Note that <span class="math">\(35\,999.0503\)</span> divided by <span class="math">\(36\,525\)</span> (number of days in a Julian century) equals <span class="math">\(0.9856\)</span>. <strong>In both cases, note that <span class="math">\(M\)</span> is expressed in degrees, not radians.</strong></p>
<p>Using a circular orbit is a simplification that is corrected in the next step, with <em>the equation of the center,</em> which describes the angular difference (<span class="math">\(v - M\)</span>) between the position of the true body (in its elliptical orbit) and the hypothetical body with constant angular velocity (in its circular orbit). This equation can be used in place of Kepler's equation when the eccentricity, <span class="math">\(e\)</span> of the orbit is small, which holds in the case of Earth's orbit [<a href="#refs">5</a>], <span class="math">\(e = 0.016711\)</span>. The equation of the center is often approximated by a Taylor series expansion, with the accuracy of the estimate improving with the number of powers of <span class="math">\(e\)</span> employed. Meeus [<a href="#refs">3</a>] provides the expansion for the first four terms, which is sufficient for our aim of day-length calculations on Earth with hourly precision:
</p>
<div class="math">$$
v - M = \left(
2e - \frac{e^3}{4} + \frac{5e^5}{96}
\right)\mathrm{sin}(M) + \left(
\frac{5}{4}e^2 - \frac{11}{24}e^4
\right)\mathrm{sin}(2M)
$$</div>
<!--See also Meeus (1991) page 152 for another form of the equation of center-->
<!--TODO Investigate the sensitivity of the equation below to the precision of the coefficients; to number of terms.-->
<p><strong>The above equation gives the angular difference in radians.</strong> If we plug in Earth's <span class="math">\(e\)</span>, multiply the coefficients by <span class="math">\(180/\pi\)</span>, and move <span class="math">\(M\)</span> to the right-hand side, we obtain the approximation for the true anomaly, <span class="math">\(v\)</span>, used in the <em>Almanac for Computers</em> [<a href="#refs">4</a>]:
</p>
<div class="math">$$
v = M + 1.915\,\mathrm{sin}(M) + 0.020\,\mathrm{sin}(2M)
$$</div>
<p><strong>We then calculate the ecliptic longitude,</strong> <span class="math">\(L\)</span>, the angular distance of the sun in the plane of the ecliptic, measured from the primary direction (a line pointing from Earth to the sun on the date of the vernal equinox), i.e., in the geocentric coordinate system. This is the first step of calculating the sun's position in the sky.
</p>
<div class="math">$$
L = (v + 180^{\circ} + \omega)\,\mathrm{mod}\,360^{\circ}
$$</div>
<p>Where <span class="math">\(\omega \approx 102.937\,348\)</span> is the argument of the perihelion. <span class="math">\(\omega\)</span> can be calculated as: <span class="math">\(\omega = \pi - \Omega\)</span>, where <span class="math">\(\pi\)</span> is the longitude of the perihelion and <span class="math">\(\Omega\)</span> is the longitude of the ascending node [<a href="#refs">6</a>]. I could not find a reference for <span class="math">\(\omega\)</span> in the Earth-sun system so I calculated it by reading <span class="math">\(\pi = 102.937\,348\)</span> from Meeus' (1991) Table 30.A [<a href="#refs">3</a>] and assuming that <span class="math">\(\Omega = 0\)</span>, given that it does not appear in Table 30.A and at the "mean equinox of the date" the Earth is at the ascending node in its orbit around the sun. This may not be the correct interpretation of this calculation (I am not an astronomer), but it does produce the correct value of <span class="math">\(L\)</span>, which can also be verified by comparing the value <span class="math">\(180 + 102.937\,348\)</span> to the value used in <em>Almanac for Computers</em>, <span class="math">\(282.634\)</span>. The <em>Almanac</em> does provide a time-varying formula of the longitude of perihelion, <span class="math">\(\ell\)</span>, that could be used in place of <span class="math">\(180 + \omega\)</span>:
</p>
<div class="math">$$
\ell = 180^{\circ} + 100.460 + 36000.772\,T
$$</div>
<p>Where <span class="math">\(T\)</span> is measured in Julian centuries and <span class="math">\(\ell\)</span> is obtained in degrees.</p>
<h3>Position of the Sun in Equatorial Coordinates</h3>
<!--Surprisingly accurate: https://en.wikipedia.org/wiki/Position_of_the_Sun -->
<!--Note that Chapter 24 in Jean Meeus (1991) describes the position of the sun in the ecliptic coordinate system.-->
<p>So far, we've been working in the ecliptic coordinate system. Now, we want to convert these coordinates (ecliptic longitude and latitude of the sun, the latter assumed to be zero as it is always very small) to an equatorial coordinate system, expressed as right ascension and declination. <a href="https://skyandtelescope.org/astronomy-resources/right-ascension-declination-celestial-coordinates/">This is a great introduction to the equatorial (or "celestial") coordinate system,</a> using Earth's geographic coordinate system as an analog.</p>
<p><strong>The declination of the sun, <span class="math">\(\delta\)</span>, is obtained:</strong>
</p>
<div class="math">$$
\mathrm{sin}(\delta) = \mathrm{sin}(L)\times \mathrm{sin}(23.44^{\circ})
$$</div>
<p>Where 23.44 degrees is the maximum tilt of Earth's axis [<a href="#refs">5</a>]. Note that all of these sines should accept arguments in degrees, not radians.</p>
<!--TODO Let's introduce the time-varying obliquity equation from Meeus and examine its impact on sunrise, sunset times at different latitudes -->
<p><strong>The right ascension of the sun, <span class="math">\(\alpha\)</span>, is obtained:</strong>
</p>
<div class="math">$$
\mathrm{tan}(\alpha) = \mathrm{cos}(23.44^{\circ})\times \mathrm{tan}(L)
$$</div>
<p>Where, again, we use the obliquity of the Earth (23.44 degrees). Pre-computing this cosine (and the sine of the obliquity, above) can speed things up.</p>
<p>We also need to check to make sure that the right ascension, <span class="math">\(\alpha\)</span>, is in the same quadrant as the sun's longitude, <span class="math">\(L\)</span>:
</p>
<div class="math">$$
\alpha^* = \alpha + 90\times \left[
f\left(\frac{L}{90}\right) - f\left(\frac{\alpha}{90}\right)
\right]
$$</div>
<p>Where <span class="math">\(f()\)</span> is the floor function (i.e., round down to the nearest integer).</p>
<p><strong>The sun's local hour angle can now be obtained using Meeus' equation from before.</strong> We define <span class="math">\(h_0 \rightarrow 0\)</span> as the zenith of the sun such that <span class="math">\(h_0 = 90^{\circ}\)</span> for solar noon, i.e., "when the sun is at its zenith." Note that <em>Almanac for Computers</em> defines solar zenith differently such that it equals zero at solar noon; consequently, they use <span class="math">\(\mathrm{cos}(h_0)\)</span> in place of <span class="math">\(\mathrm{sin}(h_0)\)</span>, below. We might not set <span class="math">\(h_0\)</span> exactly equal to zero (or exactly equal to 90 if following <em>Almanac for Computers</em>) for sunrise or sunset because of the finite width of the sun's disc and the effect of atmospheric refraction. Elevation of the observer might also be considered. Thus, the definition of <span class="math">\(h_0\)</span> will depend on your application. Meeus writes that 34 minutes (<span class="math">\(0.5\bar{6}\)</span> degrees) "is generally adopted for the effect of refraction at the horizon" and that 16 minutes is added as an approximation of radius of the sun (seen from Earth). Hence, Meeus recommends <span class="math">\(h_0 = -0.8\bar{3}\)</span> degrees (the negative sign indicates the center of the body is below the horizon).
</p>
<div class="math">$$
\mathrm{cos}(H_0) = \frac{\mathrm{sin}(h_0) -
\mathrm{sin}(\phi)\mathrm{sin}(\delta)}{\mathrm{cos}(\phi)\mathrm{cos}(\delta)}
$$</div>
<p>Recall that <span class="math">\(\phi\)</span> is the latitude of the observer.</p>
<h3>Edge Cases Near the Poles</h3>
<p>If we're above the Arctic Circle or below the Antarctic Circle, it's possible that the sun is <em>always up</em> (never sets) or <em>never up</em> (never rises). We can detect these conditions as follows. If <span class="math">\(\mathrm{cos}(H_0) > 1\)</span>, the sun never rises at this location on this date. If <span class="math">\(\mathrm{cos}(H_0) < -1\)</span>, the sun never sets at this location on this date.</p>
<h3>Putting it All Together</h3>
<p><strong>We can now calculate the local rise time or local setting time.</strong> We do this differently for sunrise (<span class="math">\(m_{\uparrow}\)</span>) and sunset (<span class="math">\(m_{\downarrow}\)</span>): to make sure that <span class="math">\(H_0\)</span> is in the right quadrant, we must add 360 degrees to the rising time. This allows us to obtain the <em>Greenwich apparent sidereal time</em> (GAST) of sunrise or sunset. Note that, below, we are converting both the local hour angle, <span class="math">\(H_0\)</span> and right ascension, <span class="math">\(\alpha\)</span> to hours by dividing by 15 degrees. The sum of <span class="math">\(H_0\)</span> and <span class="math">\(\alpha*\)</span>, both in hours, gives the GAST. We convert from GAST to the <em>local apparent sidereal time</em> by adding the solar hour angle (SHA). Then, we convert from sidereal time to Universal Time (for our purposes, equivalent UTC) for a sunrise (<span class="math">\(1/4\)</span> of a day earlier than the transit) or sunset (<span class="math">\(1/4\)</span> of a day later).
</p>
<div class="math">$$
m_{\uparrow} = \frac{360 - H_0 + \alpha^*}{15} - [\mathrm{SHA}] -
\left(0\rlap{.}^{\mathrm{h}} 06571\,(T - 0.25)\right) - 6\rlap{.}^{\mathrm{h}} 622
$$</div>
<div class="math">$$
m_{\downarrow} = \frac{H_0 + \alpha^*}{15} - [\mathrm{SHA}] -
\left(0\rlap{.}^{\mathrm{h}} 06571\,(T + 0.25)\right) - 6\rlap{.}^{\mathrm{h}}622
$$</div>
<p>Because <span class="math">\(T\)</span> is the transit time at the Greenwich meridian, as a fraction of the day, we want to subtract <span class="math">\(1/4\)</span> of a day when calculating sunrise and add <span class="math">\(1/4\)</span> of a day when calculating sunset. The scale factor <span class="math">\(0\rlap{.}^{\mathrm{h}}06571\)</span> and offset <span class="math">\(6\rlap{.}^{\mathrm{h}}622\)</span> are in units of hours and are essentially magic numbers.</p>
<!--TODO I think m_0 should equal exactly 0.5 on the equinoxes for...the prime meridian?-->
<p>Note that the sunrise time, <span class="math">\(m_{\uparrow}\)</span>, or sunset time <span class="math">\(m_{\downarrow}\)</span> calculated above may not lie on the interval <span class="math">\([0,24)\)</span> and in that case you must add or subtract 24.</p>
<h3>Calculating Photoperiod</h3>
<p>Although the <code>pyephem</code> implementation (see later in the article) is slow, I take it to be the most accurate source for sunrise and sunset calculation and, hence, photoperiod calculation. I therefore wanted to compare my photoperiod calculation to what <code>pyephem</code> comes up with. I calculated sunrise and sunset hours with both approaches for a latitude-longitude grid, with steps of 5/8° longitude and 1/2° latitude. This is the grid used by the NASA GMAO Modern-Era Retrospective Re-analysis dataset, version 2 (MERRA-2). You can make a pretty picture of photoperiod for the entire globe! For example, on June 25, 2012, global photoperiod kind of looks like this, from short days (darker colors) to longer days (lighter colors):</p>
<p><a href="/images/20210217_daylight_hrs.png"><img style="float:left;" src="/images/thumbs/20210217_daylight_hrs_thumbnail_square.png" /></a></p>
<!--Meeus (1991) page 84, Equation 11.4--><div style="clear:both;"></div>
<p>Below, I visualize the difference between my approach and <code>pyephem</code>; differences are within 2 hours, with <code>pyephem</code> describing slightly shorter days (by 1 hour, shown in blue, or by 2 hours, shown in red) relative to my implementation. Where day length appears to differ by 2 hours, this is only due to a 1-hour rounding difference in both the sunrise and sunset hours. The bias in the direction of shorter days is just because of the offset of the photoperiod calculations and my reduction in precision by rounding both down to the nearest hour. Overall, they agree pretty well.</p>
<p><a href="/images/20210217_diff_daylight_hrs_pyephem_minus_mine.png"><img style="float:left;" src="/images/thumbs/20210217_diff_daylight_hrs_pyephem_minus_mine_thumbnail_square.png" /></a></p>
<div style="clear:both;"></div>
<h2>Python Implementation</h2>
<p>I implemented the day-length algorithm described above in Cython—partly because I wanted the practice writing Cython extension modules for Python projects, but also because my experience with <code>pyephem</code> pointed to the need for better performance. A pure-Python implementation can be obtained simply by removing the lines that begin with <code>cdef</code>. Also, I'm rounding down to the nearest hour because I merely need to determine which hours to select from a re-analysis dataset.</p>
<p><strong>I also had to make a decision about how to calculate photoperiod near the poles when the sun is always above or below the horizon.</strong> For the austral and boreal summers (sun is always up), I simply decided that the photoperiod must be 24 hours long. Therefore, in the implementation below, I set the sunrise hour to zero and the sunset hour to 23 (Python starts counting at zero). For the austral and boreal winters (sun is always down), however, the solution is not obvious.</p>
<p>For a model that needs to make daily calculations everywhere on the Earth, it seems that even if the sun never rises we need to calculate a daily temperature, daily VPD, <em>et cetera.</em> These are <em>mean</em> quantities over any number of hours, so it doesn't matter how long the day is as long as we have at least one data point to average. Therefore, even when the sun is always below the horizon, I decided that "photoperiod" should also be 24 hours long (same as when the sun is always up). For a photosynthesis model, GPP is likely very close to zero under these conditions, so it doesn't matter much. For other applications, you may want to determine photoperiod during the winter differently. In my implementation below, I emit <code>-1</code> for the sunrise and sunset hours if the sun is always below the horizon so that we can decided what to do later.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">datetime</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="n">cdef</span> <span class="nb">list</span> <span class="n">coords</span>
<span class="n">cdef</span> <span class="nb">int</span> <span class="n">doy</span>
<span class="n">cdef</span> <span class="nb">float</span> <span class="n">x</span><span class="p">,</span> <span class="n">zenith</span><span class="p">,</span> <span class="n">lat</span><span class="p">,</span> <span class="n">lng</span><span class="p">,</span> <span class="n">lng_hour</span><span class="p">,</span> <span class="n">tmean</span><span class="p">,</span> <span class="n">anomaly</span>
<span class="n">cdef</span> <span class="nb">float</span> <span class="n">lng_sun</span><span class="p">,</span> <span class="n">ra</span><span class="p">,</span> <span class="n">ra_hours</span><span class="p">,</span> <span class="n">dec_sin</span><span class="p">,</span> <span class="n">dec_cos</span>
<span class="n">cdef</span> <span class="nb">float</span> <span class="n">hour_angle_cos</span><span class="p">,</span> <span class="n">hour_angle</span><span class="p">,</span> <span class="n">hour_rise</span><span class="p">,</span> <span class="n">hour_sets</span>
<span class="c1"># The sunrise_sunset() algorithm was written for degrees, not radians</span>
<span class="n">sine</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">sin</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">deg2rad</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">cosine</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">cos</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">deg2rad</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">tan</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">tan</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">deg2rad</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">arcsin</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">rad2deg</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arcsin</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">arccos</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">rad2deg</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arccos</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">arctan</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">rad2deg</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arctan</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">sunrise_sunset</span><span class="p">(</span><span class="n">coords</span><span class="p">,</span> <span class="n">dt</span><span class="p">,</span> <span class="n">zenith</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.83</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Returns the hour of sunrise and sunset for a given date. Hours are on the</span>
<span class="sd"> closed interval [0, 23] because Python starts counting at zero; i.e., if</span>
<span class="sd"> we want to index an array of hourly data, 23 is the last hour of the day.</span>
<span class="sd"> Recommended solar zenith angles for sunrise and sunset are -6 degrees for</span>
<span class="sd"> civil sunrise/ sunset; -0.5 degrees for "official" sunrise/sunset; and</span>
<span class="sd"> -0.83 degrees to account for the effects of refraction. A zenith angle of</span>
<span class="sd"> -0.5 degrees produces results closest to those of pyephem's</span>
<span class="sd"> Observer.next_rising() and Observer.next_setting(). This calculation does</span>
<span class="sd"> not include corrections for elevation or nutation nor does it explicitly</span>
<span class="sd"> correct for atmospheric refraction. Source:</span>
<span class="sd"> U.S. Naval Observatory. "Almanac for Computers." 1990.</span>
<span class="sd"> Parameters</span>
<span class="sd"> ----------</span>
<span class="sd"> coords : list or tuple</span>
<span class="sd"> The (longitude, latitude) coordinates of interest; coordinates can</span>
<span class="sd"> be scalars or arrays (for times at multiple locations on same date)</span>
<span class="sd"> dt : datetime.date</span>
<span class="sd"> The date on which sunrise and sunset times are desired</span>
<span class="sd"> zenith : float</span>
<span class="sd"> The sun zenith angle to use in calculation, i.e., the angle of the</span>
<span class="sd"> sun with respect to its highest point in the sky (90 is solar noon)</span>
<span class="sd"> (Default: -0.83)</span>
<span class="sd"> Returns</span>
<span class="sd"> -------</span>
<span class="sd"> tuple</span>
<span class="sd"> 2-element tuple of (sunrise hour, sunset hour)</span>
<span class="sd"> '''</span>
<span class="n">lat</span><span class="p">,</span> <span class="n">lng</span> <span class="o">=</span> <span class="n">coords</span>
<span class="k">assert</span> <span class="o">-</span><span class="mi">90</span> <span class="o"><=</span> <span class="n">lat</span> <span class="o"><=</span> <span class="mi">90</span><span class="p">,</span> <span class="s1">'Latitude error'</span>
<span class="k">assert</span> <span class="o">-</span><span class="mi">180</span> <span class="o"><=</span> <span class="n">lng</span> <span class="o"><=</span> <span class="mi">180</span><span class="p">,</span> <span class="s1">'Longitude error'</span>
<span class="n">doy</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">dt</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">'%j'</span><span class="p">))</span>
<span class="c1"># Calculate longitude hour (Earth turns 15 degrees longitude per hour)</span>
<span class="n">lng_hour</span> <span class="o">=</span> <span class="n">lng</span> <span class="o">/</span> <span class="mf">15.0</span>
<span class="c1"># Appoximate transit time (longitudinal average)</span>
<span class="n">tmean</span> <span class="o">=</span> <span class="n">doy</span> <span class="o">+</span> <span class="p">((</span><span class="mi">12</span> <span class="o">-</span> <span class="n">lng_hour</span><span class="p">)</span> <span class="o">/</span> <span class="mi">24</span><span class="p">)</span>
<span class="c1"># Solar mean anomaly at rising, setting time</span>
<span class="n">anomaly</span> <span class="o">=</span> <span class="p">(</span><span class="mf">0.98560028</span> <span class="o">*</span> <span class="n">tmean</span><span class="p">)</span> <span class="o">-</span> <span class="mf">3.289</span>
<span class="c1"># Calculate sun's true longitude by calculating the true anomaly</span>
<span class="c1"># (anomaly + equation of the center), then add (180 + omega)</span>
<span class="c1"># where omega = 102.634 is the longitude of the perihelion</span>
<span class="n">lng_sun</span> <span class="o">=</span> <span class="p">(</span><span class="n">anomaly</span> <span class="o">+</span> <span class="p">(</span><span class="mf">1.916</span> <span class="o">*</span> <span class="n">sine</span><span class="p">(</span><span class="n">anomaly</span><span class="p">))</span> <span class="o">+</span>\
<span class="p">(</span><span class="mf">0.02</span> <span class="o">*</span> <span class="n">sine</span><span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">anomaly</span><span class="p">))</span> <span class="o">+</span> <span class="mf">282.634</span><span class="p">)</span> <span class="o">%</span> <span class="mi">360</span>
<span class="c1"># Sun's right ascension (by 0.91747 = cosine of Earth's obliquity)</span>
<span class="n">ra</span> <span class="o">=</span> <span class="n">arctan</span><span class="p">(</span><span class="mf">0.91747</span> <span class="o">*</span> <span class="n">tan</span><span class="p">(</span><span class="n">lng_sun</span><span class="p">))</span> <span class="o">%</span> <span class="mi">360</span>
<span class="c1"># Adjust RA to be in the same quadrant as the sun's true longitude, then</span>
<span class="c1"># convert to hours by dividing by 15 degrees</span>
<span class="n">ra</span> <span class="o">+=</span> <span class="n">np</span><span class="o">.</span><span class="n">subtract</span><span class="p">(</span>
<span class="n">np</span><span class="o">.</span><span class="n">floor</span><span class="p">(</span><span class="n">lng_sun</span> <span class="o">/</span> <span class="mi">90</span><span class="p">)</span> <span class="o">*</span> <span class="mi">90</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">floor</span><span class="p">(</span><span class="n">ra</span> <span class="o">/</span> <span class="mi">90</span><span class="p">)</span> <span class="o">*</span> <span class="mi">90</span><span class="p">)</span>
<span class="n">ra_hours</span> <span class="o">=</span> <span class="n">ra</span> <span class="o">/</span> <span class="mi">15</span>
<span class="c1"># Sun's declination's (using 0.39782 = sine of Earth's obliquity)</span>
<span class="c1"># retained as sine and cosine</span>
<span class="n">dec_sin</span> <span class="o">=</span> <span class="mf">0.39782</span> <span class="o">*</span> <span class="n">sine</span><span class="p">(</span><span class="n">lng_sun</span><span class="p">)</span>
<span class="n">dec_cos</span> <span class="o">=</span> <span class="n">cosine</span><span class="p">(</span><span class="n">arcsin</span><span class="p">(</span><span class="n">dec_sin</span><span class="p">))</span>
<span class="c1"># Cosine of the sun's local hour angle</span>
<span class="n">hour_angle_cos</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">sine</span><span class="p">(</span><span class="n">zenith</span><span class="p">)</span> <span class="o">-</span> <span class="p">(</span><span class="n">dec_sin</span> <span class="o">*</span> <span class="n">sine</span><span class="p">(</span><span class="n">lat</span><span class="p">)))</span> <span class="o">/</span> <span class="p">(</span><span class="n">dec_cos</span> <span class="o">*</span> <span class="n">cosine</span><span class="p">(</span><span class="n">lat</span><span class="p">))</span>
<span class="c1"># Correct for polar summer or winter, i.e., when the sun is always</span>
<span class="c1"># above or below the horizon</span>
<span class="k">if</span> <span class="n">hour_angle_cos</span> <span class="o">></span> <span class="mi">1</span> <span class="ow">or</span> <span class="n">hour_angle_cos</span> <span class="o"><</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span>
<span class="k">if</span> <span class="n">hour_angle_cos</span> <span class="o">></span> <span class="mi">1</span><span class="p">:</span>
<span class="k">return</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="c1"># Sun is always down</span>
<span class="k">elif</span> <span class="n">hour_angle_cos</span> <span class="o"><</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span>
<span class="k">return</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">23</span><span class="p">)</span> <span class="c1"># Sun is always up</span>
<span class="n">hour_angle</span> <span class="o">=</span> <span class="n">arccos</span><span class="p">(</span><span class="n">hour_angle_cos</span><span class="p">)</span>
<span class="c1"># Local mean time of rising or setting (converting hour angle to hours)</span>
<span class="n">hour_rise</span> <span class="o">=</span> <span class="p">((</span><span class="mi">360</span> <span class="o">-</span> <span class="n">hour_angle</span><span class="p">)</span> <span class="o">/</span> <span class="mi">15</span><span class="p">)</span> <span class="o">+</span> <span class="n">ra_hours</span> <span class="o">-</span>\
<span class="p">(</span><span class="mf">0.06571</span> <span class="o">*</span> <span class="p">(</span><span class="n">tmean</span> <span class="o">-</span> <span class="mf">0.25</span><span class="p">))</span> <span class="o">-</span> <span class="mf">6.622</span>
<span class="n">hour_sets</span> <span class="o">=</span> <span class="p">(</span><span class="n">hour_angle</span> <span class="o">/</span> <span class="mi">15</span><span class="p">)</span> <span class="o">+</span> <span class="n">ra_hours</span> <span class="o">-</span>\
<span class="p">(</span><span class="mf">0.06571</span> <span class="o">*</span> <span class="p">(</span><span class="n">tmean</span> <span class="o">+</span> <span class="mf">0.25</span><span class="p">))</span> <span class="o">-</span> <span class="mf">6.622</span>
<span class="c1"># Round to nearest hour, convert to UTC</span>
<span class="k">return</span> <span class="p">(</span>
<span class="n">np</span><span class="o">.</span><span class="n">floor</span><span class="p">((</span><span class="n">hour_rise</span> <span class="o">-</span> <span class="n">lng_hour</span><span class="p">)</span> <span class="o">%</span> <span class="mi">24</span><span class="p">),</span>
<span class="n">np</span><span class="o">.</span><span class="n">floor</span><span class="p">((</span><span class="n">hour_sets</span> <span class="o">-</span> <span class="n">lng_hour</span><span class="p">)</span> <span class="o">%</span> <span class="mi">24</span><span class="p">))</span>
</code></pre></div>
<p><strong>Should we always round down to the nearest hour?</strong> Or should we round to the nearest whole hour (round, not floor), possibly rounding up? I find that rounding down (floor) produces evenly spaced photoperiod bands across longitudes. Any other rounding scheme produces a kind of aliasing that biases day length high or low at a certain longitude. Again, you should decide what is best for your application.</p>
<h2>Using Photoperiod for Climatic Data Aggregation</h2>
<p>What impact does a sun-up calculation of climatic variables have? Specifically, compared to a simple daily average over 24 hours, do we see an impact from an average calculated only over the photoperiod? If we look at the difference in VPD (on June 25, 2012) between a 24-hour calculation and a photoperiod calculation, we see that the atmospheric moisture demand over land is much higher when it is aggregated from hourly data when the sun is up. This makes sense—it's hotter during the day—but the sun-up calculation of VPD might also more accurately represent the atmospheric moisture demand experienced by plants during photosynthesis. The difference can be as much 700 Pa! (I clipped the image below to 600 Pa for better visualization.)</p>
<p><a href="/images/20210217_diff_VPD.png"><img style="float:left;" src="/images/thumbs/20210217_diff_VPD_thumbnail_square.png" /></a></p>
<div style="clear:both;"></div>
<p>Note that while there is a faint stamping pattern of the photoperiod on the above difference image, the daylight-aggregated VPD image does not show these artifacts.</p>
<h2>Competing Approaches</h2>
<p>Before describing the <code>pyephem</code> and <code>astral</code> approaches, I want to further motivate my work here. I timed the <code>pyephem</code> and <code>astral</code> implementations (code below) using Python's <code>timeit</code> module. The wall times below (in microseconds) are the average of the per-loop times in three trials under similar CPU loads, for an Intel Core i7-10710U (1.10 GHz) CPU.</p>
<div class="highlight"><pre><span></span><code>$ python -m timeit -n <span class="m">100</span> -s <span class="s2">"import datetime;from my_module import updown_func"</span>
<span class="s2">"updown_func((42, -83), datetime.datetime.today())"</span>
</code></pre></div>
<table>
<thead>
<tr>
<th>Implementation</th>
<th style="text-align: right;">Time (usec)</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>pyephem</code></td>
<td style="text-align: right;">544</td>
</tr>
<tr>
<td><code>astral</code></td>
<td style="text-align: right;">125</td>
</tr>
<tr>
<td><em>Almanac for Computers</em> (Python)</td>
<td style="text-align: right;">75</td>
</tr>
<tr>
<td>My algorithm (Cython)</td>
<td style="text-align: right;">63</td>
</tr>
</tbody>
</table>
<p>The Cython implementation is the one I showed above. The <em>Almanac for Computers</em> (Python) implementation is similar but is in pure Python and calculates separate quantities throughout for sunrise and sunset—literally the way it is described in the <em>Almanac</em>, but ultimately unnecessary. The speed-up between the Python and Cython versions is probably due to the static typing of Cython, not that we discarded some redundant calculations.</p>
<p><strong>The <code>pyephem</code> approach asks us to define an observer and a celestial body (in this case, the sun).</strong> I like their API, especially the custom error classes <code>AlwaysUpError</code> and <code>NeverUpError</code>. As with the <em>Almanac for Computers</em> implementation, <code>dt</code> is a <code>datetime.date</code> instance and <code>coords</code> is a 2-element sequence of latitude and longitude.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">ephem</span>
<span class="n">SUN</span> <span class="o">=</span> <span class="n">ephem</span><span class="o">.</span><span class="n">Sun</span><span class="p">()</span> <span class="c1"># Module-level constant</span>
<span class="n">obs</span> <span class="o">=</span> <span class="n">ephem</span><span class="o">.</span><span class="n">Observer</span><span class="p">()</span>
<span class="c1"># Positions in degrees are expected to be str type</span>
<span class="n">obs</span><span class="o">.</span><span class="n">lat</span><span class="p">,</span> <span class="n">obs</span><span class="o">.</span><span class="n">long</span> <span class="o">=</span> <span class="nb">map</span><span class="p">(</span><span class="nb">str</span><span class="p">,</span> <span class="n">coords</span><span class="p">)</span>
<span class="n">obs</span><span class="o">.</span><span class="n">date</span> <span class="o">=</span> <span class="n">dt</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">'%Y-%m-</span><span class="si">%d</span><span class="s1">'</span><span class="p">)</span>
<span class="n">obs</span><span class="o">.</span><span class="n">pressure</span> <span class="o">=</span> <span class="mi">0</span> <span class="c1"># Do not calculate refraction</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">rising</span> <span class="o">=</span> <span class="n">obs</span><span class="o">.</span><span class="n">next_rising</span><span class="p">(</span><span class="n">SUN</span><span class="p">)</span><span class="o">.</span><span class="n">datetime</span><span class="p">()</span><span class="o">.</span><span class="n">hour</span>
<span class="k">except</span> <span class="p">(</span><span class="n">ephem</span><span class="o">.</span><span class="n">AlwaysUpError</span><span class="p">,</span> <span class="n">ephem</span><span class="o">.</span><span class="n">NeverUpError</span><span class="p">):</span>
<span class="n">rising</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">setting</span> <span class="o">=</span> <span class="n">obs</span><span class="o">.</span><span class="n">next_setting</span><span class="p">(</span><span class="n">SUN</span><span class="p">)</span><span class="o">.</span><span class="n">datetime</span><span class="p">()</span><span class="o">.</span><span class="n">hour</span>
<span class="k">except</span> <span class="n">ephem</span><span class="o">.</span><span class="n">AlwaysUpError</span><span class="p">:</span>
<span class="n">setting</span> <span class="o">=</span> <span class="mi">23</span>
<span class="k">except</span> <span class="n">ephem</span><span class="o">.</span><span class="n">NeverUpError</span><span class="p">:</span>
<span class="n">setting</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span>
<span class="k">return</span> <span class="p">(</span><span class="n">rising</span><span class="p">,</span> <span class="n">setting</span><span class="p">)</span>
</code></pre></div>
<p><strong>In the <code>astral</code> implementation, we don't have custom error classes differentiating cases where the sun is always above or below the horizon.</strong> There is a <code>ValueError</code> issued when one tries to observe the sun from north of (south of) the Arctic (Antarctic) Circle. Therefore, I had to implement a pretty poor hack; we check to see if we're above or below the Arctic or Antarctic circles, respectively, in the summer or winter.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">astral</span> <span class="kn">import</span> <span class="n">LocationInfo</span>
<span class="kn">from</span> <span class="nn">astral.sun</span> <span class="kn">import</span> <span class="n">sun</span>
<span class="n">lat</span><span class="p">,</span> <span class="n">lng</span> <span class="o">=</span> <span class="n">coords</span>
<span class="n">loc</span> <span class="o">=</span> <span class="n">LocationInfo</span><span class="p">()</span>
<span class="n">loc</span><span class="o">.</span><span class="n">latitude</span> <span class="o">=</span> <span class="n">lat</span>
<span class="n">loc</span><span class="o">.</span><span class="n">longitude</span> <span class="o">=</span> <span class="n">lng</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">sun</span><span class="p">(</span><span class="n">loc</span><span class="o">.</span><span class="n">observer</span><span class="p">,</span> <span class="n">date</span> <span class="o">=</span> <span class="n">dt</span><span class="p">)</span>
<span class="k">except</span> <span class="ne">ValueError</span> <span class="k">as</span> <span class="n">err</span><span class="p">:</span>
<span class="k">if</span> <span class="n">lat</span> <span class="o">></span> <span class="mi">60</span> <span class="ow">and</span> <span class="n">dt</span><span class="o">.</span><span class="n">month</span> <span class="ow">in</span> <span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">9</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">23</span><span class="p">)</span> <span class="c1"># Sun always up above Arctic Circle</span>
<span class="k">if</span> <span class="n">lat</span> <span class="o"><</span> <span class="o">-</span><span class="mi">60</span> <span class="ow">and</span> <span class="n">dt</span><span class="o">.</span><span class="n">month</span> <span class="ow">in</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">11</span><span class="p">,</span> <span class="mi">12</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">23</span><span class="p">)</span> <span class="c1"># Sun always up below Antarctic Circle</span>
<span class="k">return</span> <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># Sun always down</span>
<span class="k">return</span> <span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="s1">'sunrise'</span><span class="p">]</span><span class="o">.</span><span class="n">hour</span><span class="p">,</span> <span class="n">s</span><span class="p">[</span><span class="s1">'sunset'</span><span class="p">]</span><span class="o">.</span><span class="n">hour</span><span class="p">)</span>
</code></pre></div>
<p>I can't recommend the <code>astral</code> implementation, above, because of this hack. <strong>But why is the <code>pyephem</code> approach so slow?</strong> The backbone of <code>pyephem</code> is implemented in C, but there may be some overhead in initialization of the <code>Observer()</code>. If you only need to calculate transit times for a single observer-body pair, this performance hit would go unnoticed. However, if you need to calculate transit times for all the pixels in a global Earth system model, I think the modified <em>Almanac</em> approach is the best choice.</p>
<h2 id="refs">References</h2>
<ol>
<li>Chapin, F. S., Matson, P. A., & Vitousek, P. M. (2011). <u>Principles of Terrestrial Ecosystem Ecology.</u> Springer New York.</li>
<li>Zhao, M., Heinsch, F. A., Nemani, R. R., & Running, S. W. (2005). Improvements of the MODIS terrestrial gross and net primary production global data set. <em>Remote Sensing of Environment,</em> 95, 164–176.</li>
<li>Meeus, J. (1991). <u>Astronomical Algorithms.</u> Willman-Bell Inc.</li>
<li>U.S. Naval Observatory. (1990). <u>Almanac for Computers.</u> The Nautical Almanac Office, U.S. Naval Observatory, Washington, D.C.</li>
<li>NASA (2021). <a href="https://nssdc.gsfc.nasa.gov/planetary/factsheet/earthfact.html">"Earth Fact Sheet."</a> Accessed: February 10, 2021.</li>
<li>Andøya Space Center (2021). <a href="https://www.narom.no/undervisningsressurser/sarepta/rocket-theory/satellite-orbits/introduction-of-the-six-basic-parameters-describing-satellite-orbits/">"Introduction of the six basic parameters describing satellite orbits."</a> Accessed: February 10, 2021.</li>
</ol>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Implementing fixed effects panel models in R2019-03-30T13:51:00+01:002019-03-30T13:51:00+01:00K. Arthur Endsleytag:karthur.org,2019-03-30:/2019/implementing-fixed-effects-panel-models-in-r.html<p><strong>Note:</strong> This post builds and improves upon <a href="/2016/fixed-effects-panel-models-in-r.html">an earlier one</a>, where I introduce the Gapminder dataset and use it to explore how diagnostics for fixed effects panel models can be implemented.</p>
<p><strong>Note (July 2019):</strong> I have since updated this article to add material on making partial effects plots and to …</p><p><strong>Note:</strong> This post builds and improves upon <a href="/2016/fixed-effects-panel-models-in-r.html">an earlier one</a>, where I introduce the Gapminder dataset and use it to explore how diagnostics for fixed effects panel models can be implemented.</p>
<p><strong>Note (July 2019):</strong> I have since updated this article to add material on making partial effects plots and to simplify and clarify the example models.</p>
<p><a href="/2016/fixed-effects-panel-models-in-r.html">My last post on this topic</a> explored how to implement fixed effects panel models and diagnostic tests for those models in R, specifically because the two libraries I used for this at the time, <code>plm</code> and <code>lfe</code>, in different ways, weren't entirely compatible with R's built-in tools for evaluating linear models.
<strong>Here, I want to write a much more general article on fixed effects regression and its implementation in R.
Specifically, I'll write about:</strong></p>
<ul>
<li>Use and interpretation of fixed effects (FE) regression models in the context of repeat-measures or longitudinal data;</li>
<li>How to implement an FE model in R using either the built-in <code>lm()</code> function or those provided by <code>plm</code> or <code>lfe</code>;</li>
<li>Calculating variance inflation factors (VIF);</li>
<li>Assessing multi-collinearity among predictor variables before fitting an FE model;</li>
<li>FE model criticism, including whether or not the assumptions of the linear model are met;</li>
<li>Calculating and plotting the marginal effect of <span class="math">\(X\)</span> on <span class="math">\(Y\)</span>, i.e., partial effects plots.</li>
</ul>
<p>In this article, I'll be using the Gapminder dataset again; the previous article gives a description of the dataset and its contents.</p>
<h2>Use and Interpretation of Fixed Effects Regression</h2>
<p><strong>I'm going to focus on fixed effects (FE) regression as it relates to time-series or longitudinal data, specifically, although FE regression is not limited to these kinds of data.</strong>
In the social sciences, these models are often referred to as "panel" models (as they are applied to a panel study) and so I generally refer to them as "fixed effects panel models" to avoid ambiguity for any specific discipline.
Longitudinal data are sometimes referred to as <em>repeat measures,</em> because we have multiple subjects observed over multiple periods, e.g., patients in a clinical trial or households in a study of spending habits throughout the year.
You can think of multiple examples where repeat measures are relevant.</p>
<p><a href="/2016/fixed-effects-panel-models-in-r.html">As I previously discussed</a>, fixed effects regression originates in the social sciences, in particular in econometrics and, separately, in prospective clinical or social studies:</p>
<blockquote>
<p>In these prospective studies, a panel of subjects (e.g., patients, children, families) are observed at multiple times (at least twice) over the study period. The chief premise behind fixed effects panel models is that each observational unit or individual (e.g., a patient) is used as its own control, exploiting powerful estimation techniques that remove the effects of any unobserved, time-invariant heterogeneity</p>
</blockquote>
<p>The term "fixed effects" can be confusing, and is contested, particularly in situations where fixed effects can be replaced with random effects.
Clark and Linzer (2014) provide a good discussion of the differences and trade-offs between fixed and random effects [<a href="#refs">1</a>].
Gelman and Hill (2007) or Bolker et al. (2009) also provide good discussions of the differences between fixed and random effects [<a href="#refs">2,3</a>].</p>
<h3>Relevance of Fixed Effects Regression for Causal Inference</h3>
<p>Repeat measures are commonly required for a particular type of causal inference.
In these studies, the interpretation of a causal effect is that it occurs before or at the same time as the measured outcome (some causal effects appear to be simultaneous with the outcome, such as flipping on a light switch).
In fact, FE regression models are often used to establish <em>weak causal inference</em> under certain circumstances; we'll soon see why.</p>
<p>But even where causal inference is not the goal, FE regression models allow us to control for omitted variables.
In the context of a regression model, an <em>omitted variable</em> is any variable that explains some variation in our response or dependent variable <em>and</em> co-varies with one or more of the independent variables.
It is something that we should be measuring and adding to our regression model because it predicts or explains our dependent variable but also because the relationship between one of our existing independent variables may depend on that omitted variable.
For example, if we're interested in measuring the effect of different amounts of a fertilizer on crop yield (i.e., the weight or biomass of the harvested crop) across a set of different crop types, omitted variables might include (if we failed to measure them) the crop type or the type of soil each plant is in.
Crop type certainly affects crop yield, as certain crops will have different ranges of yields they can achieve, but also may affect the way that fertilizer drives yields; certain crops may be more or less sensitive to the fertilizer we're using.
Soil type, too, will affect yields (without fertilizer, it is the only source of the crop's nutrients) and the properties of the soil may affect how fertilizer is retained and subsequently absorbed by a plant's roots.
In our study, failing to account for either crop type or soil type would be a source of <em>omitted variable bias</em> in our study design and in our model.</p>
<p><strong>FE regression models eliminate omitted variable bias with respect to potentially omitted variables that do not change over time.</strong>
Such time-invariant variables, like crop type or soil type, from our previous example, will be the same for each subject in our model every time it is measured.
In a clinical trial, patient sex, eye color, and height (in grown adults) are all examples of time-invariant variables.
We'll soon see how the use of subject-level fixed effects control for any and all time-invariant omitted variables.
But first, let's appreciate the implications for causal inference.</p>
<p>Let's say we have repeat measures of <span class="math">\(y\)</span>, some outcome of interest, and of multiple <span class="math">\(x_i\)</span> or independent variables.
We have measured every relevant variable that varies over time and affects <span class="math">\(y\)</span> and/or the relationship between <span class="math">\(y\)</span> and other of the <span class="math">\(x_i\)</span>.
Furthermore, we have controlled for all sources of time-invariant differences between subjects [<a href="#refs">1</a>].
That means the only variable(s) that can explain differences in <span class="math">\(y\)</span> are one or more of those time-varying <span class="math">\(x_i\)</span> we have measured.
By estimating the effect of <span class="math">\(x_i\)</span> within an individual subject over time, relative to that subject's long-term average conditions, we eliminate the effects of all unobserved, time-invariant heterogeneity between the different subjects. [<a href="#refs">4</a>].
We can then argue that a change level of any particular <span class="math">\(x_i\)</span>—if it has a sufficient mechanism we can explain—is a likely <em>cause</em> of a corresponding change in <span class="math">\(y\)</span>.
Much of this depends on the nature of your data, whether or not your proposed <em>treatment</em> variable is reasonable, whether or not you have actually controlled for everything relevant, and, no less important, the reception this type of model will receive from your intended audience (or field of study).
In general, causal inference with panel models still requires an assumption of strong exogeneity (simply put: no hidden variables and no feedbacks).</p>
<h3>General Specification of Fixed Effects Models</h3>
<p>In general, for a sample of subjects indexed <span class="math">\(i\in [0, 1, 2, \dots ]\)</span>, where each individual subject can be identified as part of a group, <span class="math">\(j\)</span>, of other observations (on the same individual or on multiple other individuals), the outcome for an individual can be modeled as:
</p>
<div class="math">$$
y_{ij} = \alpha_j + X_i\beta + \varepsilon_i;\quad \varepsilon_i\sim N(0, \sigma_y^2)
$$</div>
<p>In order for this model to be identified, it is essential that the rank of <span class="math">\(i\)</span> be larger than the rank of <span class="math">\(j\)</span>, i.e., that there are more individual subjects than there are groups.
Otherwise, the <span class="math">\(\alpha_j\)</span> terms would absorb all the degrees of freedom in the model.</p>
<h3>Alternative Specifications for Longitudinal Data</h3>
<p>Similarly, in a repeat-measures or longitudinal framework, where the "groups" of individuals are time periods, <strong>it is essential each individual subject is observed more than once.</strong>
Obviously, if the number of observations <span class="math">\(N\)</span> was equal to the number of individuals <span class="math">\(i \in M\)</span>, we would exhaust the degrees of freedom in our model simply by adding <span class="math">\(M\)</span> intercept terms, <span class="math">\(\alpha_i + ... + \alpha_M\)</span>.
With as few as two observations <span class="math">\((t \in [1,2])\)</span> of each subject, however, we've doubled the number of observations and the individual intercept terms now correspond to any time-invariant, idiosyncratic change between those two observations.</p>
<p>We can specify our model in two different ways; though very different, they have the same interpretation and will produce the same parameter estimates in a least-squares regression.
Compared to the general specification, above, we exchange the index of groups, <span class="math">\(j\)</span>, for an index of time periods, <span class="math">\(t\)</span>.
The first specification is an ordinary least squares (OLS) regression in which a fixed intercept, <span class="math">\(\alpha_i\)</span> is fit for every subject <span class="math">\(i\)</span>.
</p>
<div class="math">$$
y_{it} = \alpha_i + X_{it}\beta + \varepsilon_{it}
$$</div>
<p>The second specification subtracts the subject-specific mean values of our dependent variable, <span class="math">\(y\)</span>, and independent variables, <span class="math">\(X\)</span>, from the values at each period of observation, <span class="math">\(t\)</span>, for every subject, <span class="math">\(i\)</span>.
</p>
<div class="math">$$
y_{it} - \bar{y_i} = \left(X_{it} - \bar{X_i}\right)\beta + \varepsilon_{it}
$$</div>
<p><strong>These two specifications are equivalent because fitting a subject-specific intercept, <span class="math">\(\alpha_i\)</span>, effectively reduces the variation in each subject's <span class="math">\(y_i\)</span> and <span class="math">\(X_i\)</span> to variation around its long-term mean.</strong>
In the first, or fixed-intercept specification, <span class="math">\(\alpha_i\)</span> represents each subject's long-term mean.
In the second, or demeaned specification, subtracting the subject-specific mean values of the dependent and independent variables is called <em>centering</em> the data within subjects.
This is because the resulting values now have a mean value of zero.</p>
<p>As with everything in statistics, a diverse set of terms have been created to describe the same thing, and the terms used often depend on the <em>lingua franca</em> of a particular discipline.
Subtracting the subject-specific means can be variously referred to as <em>centering the data within subjects</em> or <em>time-demeaning the data</em> (subtracting the long-term mean); the centered values themselves can also be referred to as <em>deviations from the (subject-specific) mean.</em></p>
<h3>Interpretation</h3>
<p>Setting aside issues of causal inference, how do we interpret a fixed effects regression?
Because of the way the data have been transformed (into deviations from subject-specific means), we cannot interpret the coefficients in the same was as for a cross-sectional OLS regression.
In the cross-sectional case, we interpret a regression coefficient, <span class="math">\(\beta\)</span>, as the change in our dependent variable per unit change in the corresponding independent variable <em>across</em> or <em>between subjects;</em> in a sense, we are estimating the effect of a difference between two subjects, one average in every way, and the other different by one unit in the corresponding independent variable.
In a one-way fixed effects regression, because the dependent and independent variables have been transformed to deviations from the subject-specific means, <span class="math">\(\beta\)</span> is instead interpreted as the change in our dependent variable, <span class="math">\(y\)</span>, per unit change in the corresponding independent variable, <span class="math">\(x\)</span>, <em>within each subject.</em>
In this sense, the regression coefficients tell us about the relationship between <span class="math">\(x\)</span> and <span class="math">\(y\)</span> as the subject's <span class="math">\(x\)</span> changes over time.
If we accept weak causal inference is justified, the model can be interpreted as: a unit change in <span class="math">\(x\)</span> <em>drives</em> an estimated change in <span class="math">\(y\)</span>.</p>
<h3>Including Time Period Fixed Effects</h3>
<p>This model can be extended further to include both <em>individual</em> fixed effects (as above) and <em>time</em> fixed effects (the "two-ways" model):</p>
<div class="math">$$
y_{it} = \alpha_i + X_{it}\beta + \mu_t + \varepsilon_{it}
$$</div>
<p>Here, <span class="math">\(\mu_t\)</span> is an intercept term specific to the time period of observation; it represents any change over time that affects all observational units in the same way (e.g., the weather or the news in an outpatient study).
These effects can also be thought of as "transitory and idiosyncratic forces acting upon [observational] units (i.e., disturbances)" [<a href="#refs">5</a>].</p>
<p><strong>However, including time period fixed effects changes the interpretation of our model considerably.</strong>
In the individual fixed effects (only) model, <span class="math">\(\beta\)</span> represented the "within" effect: the effect of a change in <span class="math">\(X_i\)</span> on <span class="math">\(y\)</span> <em>within</em> each individual <span class="math">\(i\)</span>.
Now, the time period fixed effect functions as an additional grouping in which the data are centered (in our time-demeaning framework, described above).
With both time and individual fixed effects, <span class="math">\(\beta\)</span> essentially represents a weighted average between the pooled estimator, <span class="math">\(\beta_{OLS}\)</span> (from an OLS regression without fixed effects), the within estimator from our individual effects model, and a <em>between</em> effect from a model with time fixed effects (only) and no individual effects [<a href="#refs">6</a>].</p>
<p>As Kropko and Kubinec (2018) write, regarding a similar econometric model to the one we investigate here:</p>
<blockquote>
<p>This interpretation will often be difficult to communicate and to understand. The difficulty arises because the interpretation requires two dimensions of comparison, not just one. GDP per capita is negative relative to the country’s over-time average, so we compare a country to itself as it changes over time. But then, by regressing relative democracy on relative GDP per capita for the six countries, the two-way FE coefficient ultimately expresses how one country’s GPD per capita and democracy, relative to itself, compares to another country’s GDP per capita and democracy, relative to itself. If this interpretation does not match the question the model is intended to answer, then we suggest that applied researchers employ methods with interpretations that directly answer the research question.</p>
</blockquote>
<h2>Implementation in R</h2>
<p>Let's load in the Gapminder dataset for the following examples.
Since my previous article, I've discovered there is a <code>gapminder</code> package available for R that makes it easy to load these data into an R session.</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">gapminder</span><span class="p">)</span>
<span class="nf">data</span><span class="p">(</span><span class="n">gapminder</span><span class="p">)</span>
</code></pre></div>
<p>Now let's take a brief look at the data. We're interested in modeling the effect of per-capita GDP on life expectancy. Let's first observed that per-capita GDP has a log-linear relationship with life-expectancy. Taking a base-10 logarithm of right-skewed dollar values is generally good practice, and the plot below shows that doing so here improves the linear relationship with life expectancy.</p>
<div class="highlight"><pre><span></span><code><span class="nf">with</span><span class="p">(</span><span class="n">gapminder</span><span class="p">,</span> <span class="nf">plot</span><span class="p">(</span><span class="nf">log10</span><span class="p">(</span><span class="n">gdpPercap</span><span class="p">),</span> <span class="n">lifeExp</span><span class="p">,</span>
<span class="n">main</span> <span class="o">=</span> <span class="s">'Life Expectancy vs. Log10 Per-Capita GDP'</span><span class="p">))</span>
</code></pre></div>
<p><img alt="Model residuals vs. fitted values" src="http://karthur.org/images/20190330_lifeExp_vs_gdpPercap.png"></p>
<p>However, for now, we'll model the effect of per-capita GDP without a log transformation because it is simpler.</p>
<p><strong>There are at least three ways to run a fixed effects (FE) regression in R and it's important to be familiar with your options.</strong></p>
<h3>With R's Built-in Ordinary Least Squares Estimation</h3>
<p>First, it's clear from the first specification above that an FE regression model can be implemented in with R's OLS regression function, <code>lm()</code>, simply by fitting an intercept for each level of a factor that indexes each subject in the data.</p>
<div class="highlight"><pre><span></span><code><span class="n">m1.ols</span> <span class="o"><-</span> <span class="nf">lm</span><span class="p">(</span><span class="n">lifeExp</span> <span class="o">~</span> <span class="n">country</span> <span class="o">+</span> <span class="n">gdpPercap</span> <span class="o">+</span> <span class="n">pop</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">gapminder</span><span class="p">)</span>
</code></pre></div>
<p>One disadvantage of this approach becomes clear as soon as you call <code>summary(m1.ols)</code>; the subject-specific (here, country-specific) intercepts are reported for 140+ countries in this dataset!
That's a lot to scroll through to get to the coefficients we're actually interested in.</p>
<div class="highlight"><pre><span></span><code><span class="nf">summary</span><span class="p">(</span><span class="n">m1.ols</span><span class="p">)</span><span class="o">$</span><span class="n">coefficients</span><span class="p">[</span><span class="nf">c</span><span class="p">(</span><span class="s">'gdpPercap'</span><span class="p">,</span> <span class="s">'pop'</span><span class="p">),]</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code> Estimate Std. Error t value Pr(>|t|)
gdpPercap 3.936623e-04 2.973936e-05 13.23708 5.512379e-38
pop 6.196916e-08 4.838246e-09 12.80819 8.746824e-36
</code></pre></div>
<p>We'll interpret these coefficients later.
For now, let's convince ourselves that this model produces the same results if we use centered data and no country-level intercepts.</p>
<p>The initial challenge is in centering the data.
I'm going to use a relatively sophisticated tool to do this, simply because I don't know of a reasonable way to do it with base R.
Using the <code>dplyr</code> library's <code>mutate_at()</code> function, we'll calculate a new, centered variable for our dependent variable, <code>lifeExp</code>, and each of the three independent variables.
This new variable has the suffix <code>_dm</code> at the end of its name, which is my abbreviation for "de-meaned" as in the mean has been subtracted from the variable; you can call it whatever you want.</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span>
<span class="n">gapminder.centered</span> <span class="o"><-</span> <span class="n">gapminder</span> <span class="o">%>%</span>
<span class="nf">group_by</span><span class="p">(</span><span class="n">country</span><span class="p">)</span> <span class="o">%>%</span>
<span class="nf">mutate_at</span><span class="p">(</span><span class="n">.vars</span> <span class="o">=</span> <span class="nf">vars</span><span class="p">(</span><span class="n">year</span><span class="p">,</span> <span class="n">lifeExp</span><span class="p">,</span> <span class="n">pop</span><span class="p">,</span> <span class="n">gdpPercap</span><span class="p">),</span> <span class="n">.funs</span> <span class="o">=</span> <span class="nf">funs</span><span class="p">(</span><span class="s">'dm'</span> <span class="o">=</span> <span class="n">.</span> <span class="o">-</span> <span class="nf">mean</span><span class="p">(</span><span class="n">.</span><span class="p">)))</span>
<span class="nf">summary</span><span class="p">(</span><span class="n">gapminder.centered</span><span class="o">$</span><span class="n">lifeExp_dm</span><span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code> Min. 1st Qu. Median Mean 3rd Qu. Max.
-20.8647 -4.2138 0.4733 0.0000 4.5696 17.1973
</code></pre></div>
<p>We can see from the above that the overall mean, across all subjects, is zero.
This is a consequence of the fact that the mean of each subject's measures is zero, so the mean of means is also zero.
To fit the second or demeaned specification of the model using <code>lm()</code>, we plug in each of these centered or demeaned variables.</p>
<div class="highlight"><pre><span></span><code><span class="n">m2.ols</span> <span class="o"><-</span> <span class="nf">lm</span><span class="p">(</span><span class="n">lifeExp</span> <span class="o">~</span> <span class="n">gdpPercap_dm</span> <span class="o">+</span> <span class="n">pop_dm</span><span class="p">,</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">gapminder.centered</span><span class="p">)</span>
</code></pre></div>
<p>Now, let's compare the coefficients.</p>
<div class="highlight"><pre><span></span><code><span class="nf">cbind</span><span class="p">(</span>
<span class="nf">coef</span><span class="p">(</span><span class="n">m1.ols</span><span class="p">)[</span><span class="nf">c</span><span class="p">(</span><span class="s">'gdpPercap'</span><span class="p">,</span> <span class="s">'pop'</span><span class="p">)],</span>
<span class="nf">coef</span><span class="p">(</span><span class="n">m2.ols</span><span class="p">)[</span><span class="nf">c</span><span class="p">(</span><span class="s">'gdpPercap_dm'</span><span class="p">,</span> <span class="s">'pop_dm'</span><span class="p">)]</span>
<span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code> [,1] [,2]
gdpPercap 3.936623e-04 3.936623e-04
pop 6.196916e-08 6.196916e-08
</code></pre></div>
<p>As you can see, the specifications are equivalent.
These coefficients are also correct point estimates for the "within" effect of each independent variable on the outcome.
<strong>However, as we'll discuss, the standard errors are not correct; this OLS model fails to account for the fact that we have repeat measures of each subject.</strong>
This is a violation of the assumption of independence of errors.
With our data, where we have multiple measures for the same country, some elements of the error term, <span class="math">\(\varepsilon\)</span>, are not independent.
Clustering the standard errors within countries is one solution, which I won't detail here.
The more sophisticated approaches to this model discussed below will deliver the correct standard errors.</p>
<p>If you have a very large number of subjects, the <code>lm()</code> function will cease to work for the first specification of our model, with subject-specific intercepts; it simply wasn't designed to fit thousands of intercepts and it will either take a long time to compute or will fail utterly.
This is where the centering approach comes in handy: it is much easier (on the computer) to work with deviations from the mean instead of computing all those subject-specific intercepts.
The <code>plm</code> and <code>lfe</code> libraries, which we'll we discuss next, have no issue with a large number of subjects in your data, and you don't need to think about the two specifications we discussed when you're using those libraries.</p>
<h3>With Dedicated Approaches for Mean Deviations</h3>
<p>Now let's see how the dedicated packages <code>plm</code> and <code>lfe</code> are used.
I'll give more time to <code>plm</code> because it is my preferred tool, but they both work very well.
Neither of these packages fits intercepts directly, because this doesn't scale well for a large number of subjects.
In cases where there is a very large subject population such an approach, which we tried with OLS, above, could lead to a failure to identify the model.
Instead, these packages have tools to fit FE regression models to data that have been transformed into deviations from subject-specific means or from more complicated deviation measures (in the case of two-ways fixed effects models).</p>
<p>With the <code>lfe</code> package [<a href="#refs">7</a>], our fixed effects regression of life expectancy on time, per-capita GDP, and total population can be expressed with a syntax similar to the of the popular <code>lme4</code> and <code>nlme</code> packages.
The <code>felm()</code> function is what we want to use to fit fixed effects models with <code>lfe</code>.</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">lfe</span><span class="p">)</span>
<span class="n">m1.lfe</span> <span class="o"><-</span> <span class="n">lfe</span><span class="o">::</span><span class="nf">felm</span><span class="p">(</span><span class="n">lifeExp</span> <span class="o">~</span> <span class="n">gdpPercap</span> <span class="o">+</span> <span class="n">pop</span> <span class="o">|</span> <span class="n">country</span><span class="p">,</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">gapminder</span><span class="p">)</span>
</code></pre></div>
<p>The <code>| country</code> syntax indicates we wish to fit a fixed intercept for each level of <code>country</code>.
If we compare the coefficient estimates of this model to those of both of our prior OLS models, we'll see that we are indeed fitting exactly the same mean structure in all three approaches.</p>
<div class="highlight"><pre><span></span><code><span class="nf">cbind</span><span class="p">(</span>
<span class="nf">coef</span><span class="p">(</span><span class="n">m1.ols</span><span class="p">)[</span><span class="nf">c</span><span class="p">(</span><span class="s">'gdpPercap'</span><span class="p">,</span> <span class="s">'pop'</span><span class="p">)],</span>
<span class="nf">coef</span><span class="p">(</span><span class="n">m2.ols</span><span class="p">)[</span><span class="nf">c</span><span class="p">(</span><span class="s">'gdpPercap_dm'</span><span class="p">,</span> <span class="s">'pop_dm'</span><span class="p">)],</span>
<span class="nf">coef</span><span class="p">(</span><span class="n">m1.lfe</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div>
<p>If we examine the standard errors, however, we'll see that they are different in the demeaned OLS (or "OLS on mean deviations") model.</p>
<div class="highlight"><pre><span></span><code><span class="n">std.errs</span> <span class="o"><-</span> <span class="nf">cbind</span><span class="p">(</span>
<span class="nf">summary</span><span class="p">(</span><span class="n">m1.ols</span><span class="p">)</span><span class="o">$</span><span class="n">coefficients</span><span class="p">[</span><span class="nf">c</span><span class="p">(</span><span class="s">'gdpPercap'</span><span class="p">,</span> <span class="s">'pop'</span><span class="p">),</span><span class="m">2</span><span class="p">],</span>
<span class="nf">summary</span><span class="p">(</span><span class="n">m2.ols</span><span class="p">)</span><span class="o">$</span><span class="n">coefficients</span><span class="p">[</span><span class="nf">c</span><span class="p">(</span><span class="s">'gdpPercap_dm'</span><span class="p">,</span> <span class="s">'pop_dm'</span><span class="p">),</span><span class="m">2</span><span class="p">],</span>
<span class="nf">summary</span><span class="p">(</span><span class="n">m1.lfe</span><span class="p">)</span><span class="o">$</span><span class="n">coefficients</span><span class="p">[,</span><span class="m">2</span><span class="p">]</span>
<span class="p">)</span>
<span class="nf">colnames</span><span class="p">(</span><span class="n">std.errs</span><span class="p">)</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="s">'OLS w/ Intercepts'</span><span class="p">,</span> <span class="s">'OLS on Mean Deviations'</span><span class="p">,</span> <span class="s">'felm Model'</span><span class="p">)</span>
<span class="n">std.errs</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code> OLS w/ Intercepts OLS on Mean Deviations felm Model
gdpPercap 2.973936e-05 6.052637e-05 2.973936e-05
pop 4.838246e-09 9.846931e-09 4.838246e-09
</code></pre></div>
<p>In general, you cannot rely on OLS to deliver the correct standard errors when you have a dependence structure like repeat measures in your data.
In this case, the standard errors for the OLS model with intercepts (first column) are the same as estimated by <code>lfe</code> (third column), because our OLS model does have a dummy variable for each country-specific intercept.
The correction for standard errors, in this case, is straightforward, as the <code>felm()</code> documentation describes:</p>
<blockquote>
<p>The standard errors are adjusted for the reduced degrees of freedom coming from the dummies which are implicitly present.</p>
</blockquote>
<p>In the second column, we can see that the standard errors for the demeaned OLS model are not correct.
The advantage of <code>lfe</code> (and <code>plm</code>) is that it achieves the computational efficiency of a mean-deviations approach and is also able to estimate the correct standard errors.
<strong>Now let's see how <code>plm</code> handles the same model.</strong>
In <code>plm</code>, the function we'll use to fit FE regression models is also called <code>plm</code> [<a href="#refs">8</a>].
Below, the <code>index</code> argument indicates which column has levels corresponding to the subjects, for which individual, subject-level intercepts will be fit implicitly.</p>
<div class="highlight"><pre><span></span><code><span class="n">m1.plm</span> <span class="o"><-</span> <span class="nf">plm</span><span class="p">(</span><span class="n">lifeExp</span> <span class="o">~</span> <span class="n">gdpPercap</span> <span class="o">+</span> <span class="n">pop</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">gapminder</span><span class="p">,</span>
<span class="n">model</span> <span class="o">=</span> <span class="s">'within'</span><span class="p">,</span> <span class="n">index</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="s">'country'</span><span class="p">))</span>
<span class="nf">summary</span><span class="p">(</span><span class="n">m1.plm</span><span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code>Oneway (individual) effect Within Model
Call:
plm(formula = lifeExp ~ gdpPercap + pop, data = gapminder, model = "within",
index = c("country"))
Balanced Panel: n = 142, T = 12, N = 1704
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-30.23943 -3.25287 0.31427 3.54819 19.85916
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
gdpPercap 3.9366e-04 2.9739e-05 13.237 < 2.2e-16 ***
pop 6.1969e-08 4.8382e-09 12.808 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 73973
Residual Sum of Squares: 59768
R-Squared: 0.19203
Adj. R-Squared: 0.11796
F-statistic: 185.378 on 2 and 1560 DF, p-value: < 2.22e-16
</code></pre></div>
<p>If you compare these coefficient estimates to all three previous models, you'll see they're the same.
Other things to note in the summary of <code>plm</code> include:</p>
<ul>
<li>We have fit a "Oneway (individual) effect Within Model;" that is, we only fit fixed effects for the individual subjects (countries). In <code>plm</code>, this is the default.</li>
<li>Our panel data are <em>balanced,</em> that is, every subject (country) has the same number of observations. Here, we have <code>n = 142</code> countries observed in <code>T = 12</code> time periods (12 different years) for a total number of <code>N = 1704</code> country-year observations. In general, fixed effects regression models are better understood and more reliable for balanced panels.</li>
</ul>
<p>The R-squared and adjusted R-squared estimated by <code>plm</code> are for the "full" model, i.e., including the country-level fixed effects.
If you called <code>summary()</code> on our <code>lfe</code> model, you'll see that it reports both the full model R-squared and that of the "projected" model.
The "projected" model R-squared refers to the <em>within</em> R-squared, or the proportion of the variation <em>over time</em> explained by the time-varying covariates.</p>
<h2>Diagnostics and Inference in R</h2>
<h3>Assessing Multicollinearity in Fixed Effects Regression Models</h3>
<p>Multicollinearity arises when two or more independent variables are highly correlated with one another.
It poses a serious problem for explanatory models of all kinds, including non-parametric and statistical learning approaches, because if the correlation between <span class="math">\(x_i\)</span> and <span class="math">\(x_j\)</span> is large, and both are similarly correlated with the outcome of interest, <span class="math">\(y\)</span>, then the model cannot determine which of the two, <span class="math">\(x_i\)</span> or <span class="math">\(x_j\)</span>, is explaining the observed variation in <span class="math">\(y\)</span>.
If multicollinearity exists, linear regression coefficients can be unstable or biased.</p>
<p>A popular way of diagnosing multicollinearity is through the calculation of variance inflation factors (VIFs).
The VIF score indicates the proportion by which the variance of an estimator increases due to the inclusion of a particular covariate.
Calculating VIF scores with fitted models other than those produced by <code>lm()</code> can be tricky (it won't work with <code>plm</code> or <code>lfe</code> models) so the easiest way to calculate VIF scores for a one-way fixed effects regression model is to calculated them over the the corresponding fitted OLS model.
We saw that we could get the time-demeaned panel of the Gapminder data easily enough with <code>dplyr</code> and <code>mutate_at()</code>; below is a second way to do this using <code>plm</code>.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Assuming we've already fit our plm() model...</span>
<span class="n">design.matrix</span> <span class="o"><-</span> <span class="nf">as.data.frame</span><span class="p">(</span><span class="nf">model.matrix</span><span class="p">(</span><span class="n">m1.plm</span><span class="p">))</span>
<span class="c1"># Get the time-demeaned response variable, lifeExp</span>
<span class="n">design.matrix</span><span class="o">$</span><span class="n">lifeExp</span> <span class="o"><-</span> <span class="n">plm</span><span class="o">::</span><span class="nf">Within</span><span class="p">(</span>
<span class="n">plm</span><span class="o">::</span><span class="nf">pdata.frame</span><span class="p">(</span><span class="n">gapminder</span><span class="p">,</span> <span class="n">index</span> <span class="o">=</span> <span class="s">'country'</span><span class="p">)</span><span class="o">$</span><span class="n">lifeExp</span><span class="p">)</span>
<span class="c1"># Fit the OLS model on the demeaned dataset</span>
<span class="n">m3.ols</span> <span class="o"><-</span> <span class="nf">lm</span><span class="p">(</span><span class="n">lifeExp</span> <span class="o">~</span> <span class="n">gdpPercap</span> <span class="o">+</span> <span class="n">pop</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">design.matrix</span><span class="p">)</span>
<span class="c1"># Calculate VIF scores</span>
<span class="n">car</span><span class="o">::</span><span class="nf">vif</span><span class="p">(</span><span class="n">m3.ols</span><span class="p">)</span>
</code></pre></div>
<p>Here, the VIF scores are all very low, so multicollinearity is not an issue.</p>
<h3>Linear Model Assumptions: Homoscedasticity</h3>
<p>We can assess homoscedasticity, or constant variance in the residuals, by examining a plot of the model residuals against the fitted values.
Once again, R's <code>fitted()</code> doesn't know how to work with <code>plm</code> model objects, however, we can calculate the fitted values as the difference between the observed values and the model residuals.</p>
<div class="highlight"><pre><span></span><code><span class="nf">par</span><span class="p">(</span><span class="n">bg</span> <span class="o">=</span> <span class="s">'#eeeeee'</span><span class="p">)</span>
<span class="n">fitted.values</span> <span class="o"><-</span> <span class="nf">as.numeric</span><span class="p">(</span><span class="n">gapminder</span><span class="o">$</span><span class="n">lifeExp</span> <span class="o">-</span> <span class="nf">residuals</span><span class="p">(</span><span class="n">m1.plm</span><span class="p">))</span>
<span class="nf">plot</span><span class="p">(</span><span class="n">fitted.values</span><span class="p">,</span> <span class="nf">residuals</span><span class="p">(</span><span class="n">m1.plm</span><span class="p">),</span>
<span class="n">bty</span> <span class="o">=</span> <span class="s">'n'</span><span class="p">,</span> <span class="n">xlab</span> <span class="o">=</span> <span class="s">'Fitted Values'</span><span class="p">,</span> <span class="n">ylab</span> <span class="o">=</span> <span class="s">'Residuals'</span><span class="p">,</span>
<span class="n">main</span> <span class="o">=</span> <span class="s">'Residuals vs. Fitted'</span><span class="p">)</span>
<span class="nf">abline</span><span class="p">(</span><span class="n">h</span> <span class="o">=</span> <span class="m">0</span><span class="p">,</span> <span class="n">col</span> <span class="o">=</span> <span class="s">'red'</span><span class="p">,</span> <span class="n">lty</span> <span class="o">=</span> <span class="s">'dashed'</span><span class="p">)</span>
</code></pre></div>
<p><img alt="Model residuals vs. fitted values" src="http://karthur.org/images/20190330_residuals_v_fitted.png"></p>
<p>There certainly seems to be some heteroscedasticity present, particularly in the presence of relatively large, negative residuals or high fitted values.
Studentized residuals are one way of assessing the magnitude of residual in standardized units [<a href="#refs">8</a>].
To get studentized residuals, we first have to derive <em>hat matrix</em> (or "projection matrix") from our linear model.
This is the matrix given by the linear transformation by which we obtained the estimated coefficients for our model, <span class="math">\(\beta\)</span>.</p>
<div class="math">$$
X\hat{\beta} = X(X^T X)^{-1} X^Ty = Py
$$</div>
<p>Where <span class="math">\(X\)</span> is the design matrix (matrix of explanatory variables) and <span class="math">\(y\)</span> is the vector of our observed response values.
The hat (or projection) matrix is denoted by <span class="math">\(P\)</span>.
The diagonal of this <span class="math">\(N\times N\)</span> matrix (<code>diag(P)</code> in R) contains the leverages for each observation point.
In R, we use matrix multiplication and the <code>solve()</code> function (to obtain the inverse of a matrix).</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Calculate projection matrix</span>
<span class="n">X</span> <span class="o"><-</span> <span class="nf">model.matrix</span><span class="p">(</span><span class="n">m1.plm</span><span class="p">)</span>
<span class="n">P</span> <span class="o"><-</span> <span class="n">X</span> <span class="o">%*%</span> <span class="nf">solve</span><span class="p">(</span><span class="nf">t</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="o">%*%</span> <span class="n">X</span><span class="p">)</span> <span class="o">%*%</span> <span class="nf">t</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="c1"># Internally studentized residuals</span>
<span class="n">sigma.sq</span> <span class="o"><-</span> <span class="p">(</span><span class="m">1</span> <span class="o">/</span> <span class="n">m1.plm</span><span class="o">$</span><span class="n">df.residual</span><span class="p">)</span> <span class="o">*</span> <span class="nf">sum</span><span class="p">(</span><span class="nf">residuals</span><span class="p">(</span><span class="n">m1.plm</span><span class="p">)</span><span class="o">^</span><span class="m">2</span><span class="p">)</span>
<span class="n">student.resids</span> <span class="o"><-</span> <span class="nf">residuals</span><span class="p">(</span><span class="n">m1.plm</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">sigma.sq</span> <span class="o">*</span> <span class="p">(</span><span class="m">1</span> <span class="o">-</span> <span class="nf">diag</span><span class="p">(</span><span class="n">P</span><span class="p">)))</span>
<span class="nf">plot</span><span class="p">(</span><span class="n">fitted.values</span><span class="p">,</span> <span class="n">student.resids</span><span class="p">,</span> <span class="n">bty</span> <span class="o">=</span> <span class="s">'n'</span><span class="p">,</span>
<span class="n">xlab</span> <span class="o">=</span> <span class="s">'Life Expectancy (Fitted Values)'</span><span class="p">,</span> <span class="n">ylab</span> <span class="o">=</span> <span class="s">'Residuals'</span><span class="p">,</span>
<span class="n">main</span> <span class="o">=</span> <span class="s">'Studentized Model Residuals v. Fitted Values'</span><span class="p">)</span>
<span class="nf">abline</span><span class="p">(</span><span class="n">h</span> <span class="o">=</span> <span class="m">0</span><span class="p">,</span> <span class="n">lty</span> <span class="o">=</span> <span class="s">'dashed'</span><span class="p">,</span> <span class="n">col</span> <span class="o">=</span> <span class="s">'red'</span><span class="p">)</span>
</code></pre></div>
<p>The apparent (perceived) distribution of the residuals is the same, but the y-axis now shows standardized units.</p>
<h3>Checking for Influential Observations</h3>
<p>Sometimes, a linear relationship can be dominated by a small number of highly influential observations.
One way this can happen is if the domain of a certain <span class="math">\(X_i\)</span>, say, per-capita GDP, is relatively small for most observations (e.g., most countries in a given sample have per-capita GDP in the range of $1,000-2,000) but there are a few countries which have very high per-capita GDP, say, around $5,000.
The relationship between per-capita GDP and some outcome like life expectancy, for the group of countries with per-capita GDP in the range $1,000-2,000 might be nothing: the slightly wealthier countries don't have significantly higher life expectancies.
However, if the <em>very</em> wealthy countries, with per-capita GDP around $5,000, have considerably higher life expectancy, then a positive relationship will be found between the two even though, if the very wealthy countries were removed, no such relationship would be found.</p>
<p>We can calculate the <em>leverage</em> that a particular observation (country) exerts on a linear relationship; it is like a measure of how sensitive that relationship is to a particular observation.
With the <code>faraway</code> package, we can draw a half-normal plot which sorts the observations by their leverage.
The <code>labs</code> argument will ensure that they are labeled by their row index, and <code>nlab</code> indicates how many points to label (to avoid visual clutter), starting with the most highly influential observation.</p>
<div class="highlight"><pre><span></span><code><span class="n">X</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">model</span><span class="p">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">m1</span><span class="p">.</span><span class="n">plm</span><span class="p">)</span><span class="w"></span>
<span class="n">P</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">t</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w"></span>
<span class="k">require</span><span class="p">(</span><span class="n">faraway</span><span class="p">)</span><span class="w"></span>
<span class="c1"># Create `labs` (labels) for 1 through 1704 observations</span><span class="w"></span>
<span class="n">halfnorm</span><span class="p">(</span><span class="n">diag</span><span class="p">(</span><span class="n">P</span><span class="p">),</span><span class="w"> </span><span class="n">labs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="o">:</span><span class="mi">1704</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Leverages'</span><span class="p">,</span><span class="w"> </span><span class="n">nlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">)</span><span class="w"></span>
</code></pre></div>
<p><img alt="Checking for influential observations" src="http://karthur.org/images/20190330_leverage.png"></p>
<p>It does seem like there are a few observations that may be driving the relationship.
If we index the Gapminder data, we see that India's survey in 2007 is the most influential.</p>
<div class="highlight"><pre><span></span><code><span class="n">gapminder</span><span class="p">[</span><span class="m">708</span><span class="p">,]</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code># A tibble: 1 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 India Asia 2007 64.7 1110396331 2452.
</code></pre></div>
<p>It's helpful to look at the data in all years to understand why.
It seems that India's per-capita GDP rose quite fast from 2002 to 2007, along with its life expectancy.
If we think India's change in this period is an outlier, we may want to remove India from our panel dataset and run the model again.</p>
<div class="highlight"><pre><span></span><code><span class="n">gapminder</span><span class="p">[</span><span class="n">gapminder</span><span class="o">$</span><span class="n">country</span> <span class="o">==</span> <span class="s">'India'</span><span class="p">,]</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code># A tibble: 12 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 India Asia 1952 37.4 372000000 547.
2 India Asia 1957 40.2 409000000 590.
3 India Asia 1962 43.6 454000000 658.
4 India Asia 1967 47.2 506000000 701.
5 India Asia 1972 50.7 567000000 724.
6 India Asia 1977 54.2 634000000 813.
7 India Asia 1982 56.6 708000000 856.
8 India Asia 1987 58.6 788000000 977.
9 India Asia 1992 60.2 872000000 1164.
10 India Asia 1997 61.8 959000000 1459.
11 India Asia 2002 62.9 1034172547 1747.
12 India Asia 2007 64.7 1110396331 2452.
</code></pre></div>
<h3>Partial Effects Plots</h3>
<p>What is the marginal effect of <span class="math">\(X\)</span> on <span class="math">\(Y\)</span>? This can be read directly from the <code>summary()</code> table, but sometimes it is nicer to visualize as a plot. Such plots are often referred to as <em>partial effects plots</em> [<a href="#refs">9</a>].</p>
<p><strong>First, let's switch to a different model of life expectancy.</strong> We recognized earlier that per-capita GDP really has a log-linear relationship with life expectancy.</p>
<div class="highlight"><pre><span></span><code><span class="n">m2.plm</span> <span class="o"><-</span> <span class="nf">plm</span><span class="p">(</span><span class="n">lifeExp</span> <span class="o">~</span> <span class="nf">I</span><span class="p">(</span><span class="nf">log10</span><span class="p">(</span><span class="n">gdpPercap</span><span class="p">))</span> <span class="o">+</span> <span class="n">pop</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">gapminder</span><span class="p">,</span>
<span class="n">model</span> <span class="o">=</span> <span class="s">'within'</span><span class="p">,</span> <span class="n">index</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="s">'country'</span><span class="p">))</span>
<span class="nf">summary</span><span class="p">(</span><span class="n">m2.plm</span><span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code>...
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
I(log10(gdpPercap)) 2.1008e+01 6.9613e-01 30.1779 < 2.2e-16 ***
pop 3.2964e-08 4.1982e-09 7.8521 7.553e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 73973
Residual Sum of Squares: 41976
R-Squared: 0.43255
Adj. R-Squared: 0.38053
F-statistic: 594.56 on 2 and 1560 DF, p-value: < 2.22e-16
</code></pre></div>
<p>This model fits much better, which is obvious when we look at plots of the raw data but also when we examine how the goodness-of-fit score (R-squared) has changed.</p>
<p><strong>First, let's decide what range of predictor values we want to test.</strong> We're interested in the effect of per-capita GDP on life expectancy. Let's look at rising per-capita GDP; this will be our parameter sweep. <strong>Recall that in our subject-centered model, these values are <em>deviations</em> from the mean,</strong> i.e., positive values represent per-capita GDP above the mean for the average subject.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Create a test matrix</span>
<span class="n">gdpPercap.sweep</span> <span class="o"><-</span> <span class="nf">log10</span><span class="p">(</span><span class="nf">seq</span><span class="p">(</span><span class="m">100</span><span class="p">,</span> <span class="m">10e3</span><span class="p">,</span> <span class="m">1e2</span><span class="p">))</span> <span class="c1"># A parameter sweep of per-capita GDP change</span>
</code></pre></div>
<p><strong>Next, we want to get an empty design or model matrix.</strong> By "empty" I mean that the values in all columns are zero. This is because, in our subject-centered model, the mean value of any variable is going to be zero (because we subtracted the mean within each subject).</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Get the columns in the same order</span>
<span class="n">new.X</span> <span class="o"><-</span> <span class="nf">model.frame</span><span class="p">(</span><span class="n">m2.plm</span><span class="p">)</span>
<span class="n">new.X</span> <span class="o"><-</span> <span class="n">new.X</span><span class="p">[,</span><span class="nf">names</span><span class="p">(</span><span class="nf">coef</span><span class="p">(</span><span class="n">m2.plm</span><span class="p">))]</span>
<span class="nf">stopifnot</span><span class="p">(</span><span class="nf">colnames</span><span class="p">(</span><span class="n">new.X</span><span class="p">)</span> <span class="o">==</span> <span class="nf">names</span><span class="p">(</span><span class="nf">coef</span><span class="p">(</span><span class="n">m2.fe</span><span class="p">)))</span>
<span class="c1"># Get a short representation of design matrix</span>
<span class="n">test.X</span> <span class="o"><-</span> <span class="nf">colMeans</span><span class="p">(</span><span class="n">new.X</span><span class="p">)</span>
<span class="n">test.X</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">test.X</span><span class="p">)]</span> <span class="o"><-</span> <span class="m">0</span> <span class="c1"># Mean value for subject-centered data is always zero)</span>
<span class="c1"># Fill out an empty test matrix</span>
<span class="n">test.X</span> <span class="o"><-</span> <span class="nf">matrix</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="n">test.X</span><span class="p">,</span> <span class="n">each</span> <span class="o">=</span> <span class="nf">length</span><span class="p">(</span><span class="n">gdpPercap.sweep</span><span class="p">)),</span> <span class="n">nrow</span> <span class="o">=</span> <span class="nf">length</span><span class="p">(</span><span class="n">gdpPercap.sweep</span><span class="p">))</span>
</code></pre></div>
<p><strong>Then, we need to insert our parameter sweep into this empty design matrix.</strong> All other variables will have zero everywhere, but the variable we're interested in (per-capita GDP) needs to meaningfully change so we can visualize its effect on the outcome (life expectancy).</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Insert the parameter sweep</span>
<span class="n">test.X</span><span class="p">[,</span><span class="nf">which</span><span class="p">(</span><span class="nf">colnames</span><span class="p">(</span><span class="n">new.X</span><span class="p">)</span> <span class="o">==</span> <span class="s">'I(log10(gdpPercap))'</span><span class="p">)]</span> <span class="o"><-</span> <span class="n">gdpPercap.sweep</span>
</code></pre></div>
<p><strong>Finally, we're ready to calculate the partial effect and the confidence band.</strong> The partial effect is the predicted value of <span class="math">\(Y\)</span> (life expectancy) for a given value of <span class="math">\(X\)</span> (per-capita GDP). We want to visualize the uncertainty around this prediction, so we'll also calculate the standard error of the prediction and use this to derive a 95% confidence interval around the prediction.</p>
<div class="highlight"><pre><span></span><code><span class="n">level</span> <span class="o"><-</span> <span class="m">0.95</span> <span class="c1"># For a 95% confidence interval</span>
<span class="c1"># Get the covariance matrix of the predictions</span>
<span class="n">vcov.prediction</span> <span class="o"><-</span> <span class="n">test.X</span> <span class="o">%*%</span> <span class="nf">vcov</span><span class="p">(</span><span class="n">m2.plm</span><span class="p">)</span> <span class="o">%*%</span> <span class="nf">t</span><span class="p">(</span><span class="n">test.X</span><span class="p">)</span>
<span class="c1"># The standard error of the prediction is then on the diagonal</span>
<span class="n">se.prediction</span> <span class="o"><-</span> <span class="nf">sqrt</span><span class="p">(</span><span class="nf">diag</span><span class="p">(</span><span class="n">vcov.prediction</span><span class="p">))</span>
<span class="c1"># Get the predicted value by multiplying the design matrix</span>
<span class="c1"># against our coefficient estimates</span>
<span class="n">predicted</span> <span class="o"><-</span> <span class="p">(</span><span class="n">test.X</span> <span class="o">%*%</span> <span class="nf">coef</span><span class="p">(</span><span class="n">m2.plm</span><span class="p">))</span>
<span class="c1"># Calculate the t-statistic corresponding to a 95% confidence level and</span>
<span class="c1"># the appropriate num. of degrees of freedom</span>
<span class="n">t.stat</span> <span class="o"><-</span> <span class="nf">qt</span><span class="p">(</span><span class="m">1</span> <span class="o">-</span> <span class="p">(</span><span class="m">1</span> <span class="o">-</span> <span class="m">0.95</span><span class="p">)</span><span class="o">/</span><span class="m">2</span><span class="p">,</span> <span class="n">m2.plm</span><span class="o">$</span><span class="n">df.residual</span><span class="p">)</span>
<span class="c1"># Calculate the lower and upper bounds of the confidence interval</span>
<span class="n">lower.bound</span> <span class="o"><-</span> <span class="nf">as.numeric</span><span class="p">(</span><span class="n">predicted</span> <span class="o">-</span> <span class="n">t.stat</span> <span class="o">*</span> <span class="n">se.prediction</span><span class="p">)</span>
<span class="n">upper.bound</span> <span class="o"><-</span> <span class="nf">as.numeric</span><span class="p">(</span><span class="n">predicted</span> <span class="o">+</span> <span class="n">t.stat</span> <span class="o">*</span> <span class="n">se.prediction</span><span class="p">)</span>
</code></pre></div>
<p>We can quickly verify we're getting reasonable results.</p>
<div class="highlight"><pre><span></span><code><span class="nf">head</span><span class="p">(</span><span class="nf">cbind</span><span class="p">(</span><span class="n">lower.bound</span><span class="p">,</span> <span class="n">predicted</span><span class="p">,</span> <span class="n">upper.bound</span><span class="p">))</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code> lower.bound upper.bound
[1,] 39.28448 42.01537 44.74627
[2,] 45.19739 48.33932 51.48125
[3,] 48.65621 52.03859 55.42096
[4,] 51.11029 54.66326 58.21623
[5,] 53.01382 56.69912 60.38441
[6,] 54.56912 58.36253 62.15595
</code></pre></div>
<p>Recall that the first row corresponds to a $100 increase in per-capita GDP <em>within a given country;</em> our model predicts that the life expectancy would increase by about 42 years for such an increase in per-capita GDP! This can also be derived from the slope of the <code>gdpPercap</code> coefficient:</p>
<div class="highlight"><pre><span></span><code><span class="nf">coef</span><span class="p">(</span><span class="n">m2.plm</span><span class="p">)[</span><span class="m">1</span><span class="p">]</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code>I(log10(gdpPercap))
21.00769
</code></pre></div>
<p>A 2-unit increase in the log of per-capita GDP corresponds to a $100 increase, so an increase of life expectancy of <span class="math">\(2\times 21 = 42\)</span> years is expected. From the 95% confidence interval we just calculated, we can see that a better estimate of the effect could be given as a range between 39.3 and 44.7 years.</p>
<p>This seems really large; likely our model is too simple. Our model controls for all time-invariant confounding factors, but there are likely other factors that change with time (i.e., not time-invariant) that we are missing. Thus, even though our current model is robust against heterogeneity, the results are likely still biased. Our results could also be affected by mixing very poor and very wealthy countries together; small increases in per-capita GDP may indeed have outsized effects in very poor countries, but certainly not in wealthy nations.</p>
<p>Without recourse to greater model realism, we can at least make the partial effects plot using <code>ggplot2</code>. We note that larger and larger increases in per-capita GDP do have a diminishing effect on life expectancy, which is reasonable.</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span>
<span class="nf">library</span><span class="p">(</span><span class="n">scales</span><span class="p">)</span>
<span class="n">df</span> <span class="o"><-</span> <span class="nf">data.frame</span><span class="p">(</span><span class="n">gdpPercap</span> <span class="o">=</span> <span class="n">gdpPercap.sweep</span><span class="p">,</span>
<span class="n">prediction</span> <span class="o">=</span> <span class="n">predicted</span><span class="p">,</span>
<span class="n">lower</span> <span class="o">=</span> <span class="n">lower.bound</span><span class="p">,</span>
<span class="n">upper</span> <span class="o">=</span> <span class="n">upper.bound</span><span class="p">)</span>
<span class="nf">ggplot</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">mapping</span> <span class="o">=</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="m">10</span><span class="o">^</span><span class="n">gdpPercap</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">prediction</span><span class="p">))</span> <span class="o">+</span>
<span class="nf">geom_line</span><span class="p">()</span> <span class="o">+</span>
<span class="nf">geom_ribbon</span><span class="p">(</span><span class="n">mapping</span> <span class="o">=</span> <span class="nf">aes</span><span class="p">(</span><span class="n">ymin</span> <span class="o">=</span> <span class="n">lower</span><span class="p">,</span> <span class="n">ymax</span> <span class="o">=</span> <span class="n">upper</span><span class="p">),</span> <span class="n">alpha</span> <span class="o">=</span> <span class="m">0.4</span><span class="p">)</span> <span class="o">+</span>
<span class="nf">scale_x_continuous</span><span class="p">(</span><span class="n">labels</span> <span class="o">=</span> <span class="n">dollar</span><span class="p">)</span> <span class="o">+</span>
<span class="nf">labs</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="s">'Per-Capita GDP'</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="s">'Predicted Increase in Life Expectancy'</span><span class="p">)</span> <span class="o">+</span>
<span class="nf">theme_minimal</span><span class="p">()</span>
</code></pre></div>
<p><img alt="Partial effects plot" src="http://karthur.org/images/20190330_partial_effects_plot.png"></p>
<!--TODO Goodness of fit-->
<!--Cross-sectional dependence-->
<h2 id="refs">References</h2>
<ol>
<li>Clark, T. S., & Linzer, D. A. (2014). Should I Use Fixed or Random Effects? <em>Political Science Research and Methods,</em> <strong>3</strong>(02), 399–408.</li>
<li>Gelman, A., and J. Hill. 2007. <span style="text-decoration:underline">Data Analysis Using Regression and Multilevel/Hierarchical Models.</span> New York, New York, USA: Cambridge University Press.</li>
<li>Bolker, B. M., M. E. Brooks, C. J. Clark, S. W. Geange, J. R. Poulsen, M. H. H. Stevens, and J. S. S. White. 2009. Generalized linear mixed models: a practical guide for ecology and evolution. <em>Trends in Ecology and Evolution</em> <strong>24</strong>(3):127–135.</li>
<li>Allison, P. D. 2009. <span style="text-decoration:underline">Fixed Effects Regression Models</span> ed. T. F. Liao. Thousand Oaks, California, U.S.A.: SAGE.</li>
<li>Halaby, C. N. 2004. Panel Models in Sociological Research: Theory into Practice. <em>Annual Review of Sociology</em> <strong>30</strong>(1):507–544.</li>
<li>Kropko, J., & Kubinec, R. (2018). <a href="https://doi.org/10.2139/ssrn.3062619">Why the Two-Way Fixed Effects Model Is Difficult to Interpret, and What to Do About It.</a> SSRN, 1–27.</li>
<li>Gaure, S. (2013). lfe: Linear Group Fixed Effects. The R Journal, 5(2), 104–117. Retrieved from <a href="http://journal.r-project.org/archive/2013-2/gaure.pdf">http://journal.r-project.org/archive/2013-2/gaure.pdf</a></li>
<li>Croissant, Y., & Millo, G. (2008). Panel data econometrics in R: The plm package. Journal of Statistical Software, 27(2), 1–43.</li>
<li>Faraway, J. J. (n.d.). Linear Models with R (2nd ed.). Boca Raton, U.S.A.; London, England; New York, U.S.A.: Chapman & Hall/CRC Texts in Statistical Science.</li>
</ol>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Parallel processing of raster arrays in Python with NumPy2018-07-30T15:00:00+02:002018-07-30T15:00:00+02:00K. Arthur Endsleytag:karthur.org,2018-07-30:/2018/parallel-processing-rasters-python.html<p>I've been using <a href="https://earthengine.google.com/">Google Earth Engine</a> recently to scale up my remote sensing analyses, particularly by leveraging the full Landsat archive available on Google's servers.
I've followed Earth Engine's development for years, <a href="http://onlinelibrary.wiley.com/doi/10.1890/15-1759.1/full">and published results from the platform</a>, but, before now, never had a compelling reason to use it.
Now …</p><p>I've been using <a href="https://earthengine.google.com/">Google Earth Engine</a> recently to scale up my remote sensing analyses, particularly by leveraging the full Landsat archive available on Google's servers.
I've followed Earth Engine's development for years, <a href="http://onlinelibrary.wiley.com/doi/10.1890/15-1759.1/full">and published results from the platform</a>, but, before now, never had a compelling reason to use it.
Now, without hubris, I can say that some of the methods I'm using (radiometric rectification of thousands of images in multi-decadal time series) are already straining the limits of the freely available computing resources on Google's platform.
After an intensive pipeline that merely normalizes the time series data I want to work with, I don't seem to have the resources to perform, say, <a href="https://developers.google.com/earth-engine/reducers_regression">a pixel-level time series regression</a> on my image stack.
Whatever the underlying issue (is it never quite clear with Earth Engine), regressions as the scale of 30 meters (a Landsat pixel) for the study area I'm working on, following the necessary pre-processing, hasn't been working.</p>
<p>I started wondering if I could calculate the regressions myself in (client-side) Python.
Image exports from Earth Engine used to be infeasible but they have vastly improved and, recently, I've been able <a href="https://developers.google.com/earth-engine/exporting">to schedule export "tasks,"</a> monitor them <a href="https://developers.google.com/earth-engine/command_line">using a command-line interface</a>, and download the results directly from Google Drive.
With the pre-processed rasters downloaded to my computer, I turned to <a href="http://www.numpy.org/">NumPy</a> to develop a vectorized regression over each pixel in a time series image stack.
<strong>Here, I describe the general procedure I used and how it can be scaled up using Python's concurrency support, pointing out some potential pitfalls associated with using multiple processes in Python.</strong></p>
<h2>A Note on Concurrency</h2>
<p>I recently attended part of a workshop run by <a href="https://www.xsede.org/about/what-we-do">XSEDE</a>, a collaborative organization funded by the National Science Foundation to further high performance computing (HPC) projects.
The introductory material was very interesting to me, not because I was unfamiliar with HPC, but because of the many compelling reasons for learning and using concurrency in scientific computing applications.</p>
<p>First and perhaps best known among these reasons, is that <a href="https://en.wikipedia.org/wiki/Moore%27s_law">Moore's Law</a>, which predicts a doubling in the number of transistors on a commercially available chip every two years, started to level off at around 2004.
The rate is slowing, and <a href="https://en.wikipedia.org/wiki/Moore%27s_law">in the graph at the top of Wikipedia's page</a> on the Law, you can already see the right-hand turn this trajectory is making.
This is largely due to real physical limitations that chip developers are starting to encounter.
The gains in the number of transistors per chip have come from making transistors progressively smaller; the smaller the transistor, the more heat (density) that must be dissipated.</p>
<p><img alt="Slide from XSEDE's 2018 Summer Bootcamp presentation." src="/images/20180730_Moores_Law.png"></p>
<p>According to one of XSEDE's instructors during <a href="https://www.psc.edu/hpc-workshop-series/summer-bootcamp-2018">the recent "Summer Bootcamp"</a> that I attended, computer chips today are tasked with dissipating heat, in terms of watts-per-square meter, on the order of a nuclear reactor!
<strong>Keeping things from melting has become the main concern of chip design.</strong>
But, as we reduce the clock rate and the voltage required to run the chip, we can reduce the amount of heat generated.
There is a reduction in performance, <strong>but if we add a second chip and run both at a lower voltage, we can get more performance for the same total amount of voltage used by the single chip.</strong>
Lower voltage means less power consumed and less heat generated [1].</p>
<p>In short, because of physical limitations to chip design and the demand for both power and heat dissipation, commercially available computers today almost exclusively ship with two or more central processing units (CPUs or "cores") running at a clock speed (measured in GHz) than many chips sold just a few years ago.
When I was building computers as a kid in the early 2000s, I could buy Intel chips with clock speeds in excess of 3 GHz.
A quick search on any vendor's website (say, Dell Corporation's) today, however, reveals that clock speeds haven't really budged: their new desktops ship with cores clocked at 3-4 GHz.
The new computers are still "faster" for many applications, however, for the reasons I just discussed: multiple cores running simultaneously.</p>
<h3>Multiple Threads, Multiple Processes</h3>
<p>However, if an application isn't designed to take advantage of multiple cores, you won't see that performance gain: you'll have a <em>single-threaded</em> or <em>single-core</em> (more on this) or, more generally, <em>serial</em> set of instructions (computer code) running on a single, slower core.
<strong>So, how do take advantage of multiple cores?</strong>
First, Clay Breshear [1] helpfully disambiguates some terminology for us: the author argues that "parallel" programming should be considered a special case of "concurrent" programming.
Whereas concurrent programming specifies any practice where multiple tasks can be "in progress" at the same time, parallel programming describes a practice where the tasks are <em>proceeding simultaneously.</em>
More concretely, a single-core system can be said to allow concurrency if the concurrent code can queue tasks (e.g., as threads) but it cannot be said to allow parallel computation.</p>
<p>Threads?
Suffice to say, whereas a program on your computer typically starts a single <em>process,</em> that process can spawn multiple <em>threads,</em> each representing a potentially concurrent task.
To get at "parallel programming" and increase the performance of your application, you can leverage <em>either</em> threads or processes.
Threads share resources, particularly memory, and thus are able to communicate shared memory through message-passing interfaces.
<strong>Processes do not share memory. This sounds like a disadvantage, but it's actually easier to get started using multiple processes than multiple threads; this article will focus on using multiple processes.</strong>
Another reason I favor multiple processes here is due to the nature of concurrency in Python...</p>
<h3>On Concurrency in Python</h3>
<p>Because threads are somewhat complicated and, perhaps, also because of historical developments I'm not aware of, CPython (the standard Python implementation) has in place a feature termed the <strong>Global Interpreter Lock (GIL).</strong>
If your multi-threaded Python program were like a discussion group with multiple participants (threads), the GIL is like a ball, stone, or other totem that participants <em>must</em> be holding before they can speak.
If you don't have the ball, you're not allowed to speak; the ball can be passed from one person to another to allow that person to speak.
Similarly, a thread without the GIL cannot be executed [2].</p>
<!--Sounds polite.
The problem occurs when your discussion (program) spills over to multiple rooms (cores).
When a participant (thread) tries to pass the ball (the GIL) to another, a participant in the adjacent room might want to pick it up.
However, when the ball's already in Room A, it takes longer for a participant in Room B to find out that the ball is available than for a different participant in Room A.
I think the analogy is breaking down here, but if you're interested in the details you can read David Beazley's excellent and thorough study.-->
<p><strong>In short, spawning multiple threads in Python does not improve performance for CPU-intensive tasks because only one thread can run in the interpreter at a time.</strong>
Multiple processes, however, can be spun up from a single Python program and, by running simultaneously, get the total amount of work done in a shorter span of time.
You can see examples both of using multiple threads and of using multiple cores in Python in <a href="https://medium.com/@bfortuner/python-multithreading-vs-multiprocessing-73072ce5600b">this article by Brendan Fortuner</a> [3].
In this example of linear regression at the pixel-level of a raster image, our parallel execution is that of linear regression on a single pixel: with more cores/ processes, we can compute more linear regressions simultaneously, thereby more quickly exhausting the finite number of regressions (and pixels) we have to do.</p>
<h2>Example</h2>
<p>To get to a pixel-wise regression, in either a serial or multi-process pipeline, we need to:</p>
<ol>
<li>Read in the dependent variable data (here, maximum NDVI) from each file and combine them into a single array;</li>
<li>Get an ordered array of the dates (years) that are the independent variables in our regression;</li>
<li>Ravel both arrays into a shape that allows a vectorized function to be executed over each pixel's time series;</li>
<li>For each pixel's time series, calculate the slope of a regression line.</li>
</ol>
<h3>Getting Started</h3>
<p>I have a time series of maximum NDVI from 1995 through 2015 (21 images) covering the City of Detroit.
So, to start with, I need to read in each raster file (for each year) and concatenate them together into a single, large, <span class="math">\(N\)</span>-dimensional array, where <span class="math">\(N=21\)</span> for 21 years, in this case.
Strictly speaking, my array ends up being of a size <span class="math">\(N\times Y\times X\)</span> where <span class="math">\(Y,X\)</span> are the number of rows and columns in any given image, respectively (note: each image must have the same number of rows and columns).
Below, I demonstrate this directly; I use the <code>glob</code> library to get a list of files that match a certain pattern: <code>*.tiff</code>.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">glob</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">osgeo</span> <span class="kn">import</span> <span class="n">gdal</span>
<span class="c1"># Create a list (generator) of the years, 1995-2015</span>
<span class="n">years_list</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1995</span><span class="p">,</span> <span class="mi">2016</span><span class="p">)</span>
<span class="n">years_array</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">years_list</span><span class="p">)</span>
<span class="c1"># Make it a 2-dimensional array to start</span>
<span class="n">years_array</span> <span class="o">=</span> <span class="n">years_array</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span> <span class="n">years_array</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
<span class="c1"># Get a list of the relevant files; sort in place</span>
<span class="n">ordered_files</span> <span class="o">=</span> <span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">'*.tiff'</span><span class="p">)</span>
<span class="n">ordered_files</span><span class="o">.</span><span class="n">sort</span><span class="p">()</span>
</code></pre></div>
<p><strong>Note that calling the file list's <code>sort()</code> method is essential: if the files are not in the right order, our regression line won't be fit properly</strong> (we will be assigning dependent variables to the wrong independent variable/ wrong year).</p>
<h3>Combining Array Data from Multiple Files</h3>
<p>Our target array from regression over <span class="math">\(N\)</span> years (files) is a <span class="math">\(P\times N\)</span> array, where <span class="math">\(P = Y\times X\)</span>.
That is, <span class="math">\(P\)</span> is the product of the number of rows and the number of columns.
This array can be thought of as a collection of 1-D subarrays with <span class="math">\(N\)</span> items: the measured outcome (here, maximum NDVI) in each year.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Iterate through each file, combining them in order as a single array</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">each_file</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">ordered_files</span><span class="p">):</span>
<span class="n">year</span> <span class="o">=</span> <span class="n">years_list</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="c1"># Open the file, read in as an array</span>
<span class="n">ds</span> <span class="o">=</span> <span class="n">gdal</span><span class="o">.</span><span class="n">Open</span><span class="p">(</span><span class="n">each_file</span><span class="p">)</span>
<span class="n">arr</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">ReadAsArray</span><span class="p">()</span>
<span class="n">ds</span> <span class="o">=</span> <span class="kc">None</span>
<span class="n">shp</span> <span class="o">=</span> <span class="n">arr</span><span class="o">.</span><span class="n">shape</span>
<span class="c1"># Ravel the array to a 1-D shape</span>
<span class="n">arr_flat</span> <span class="o">=</span> <span class="n">arr</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">shp</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">shp</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="mi">1</span><span class="p">))</span>
<span class="c1"># For the very first array, use it as the base</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">base_array</span> <span class="o">=</span> <span class="n">arr_flat</span>
<span class="k">continue</span> <span class="c1"># Skip to the next year</span>
<span class="c1"># Stack the arrays from each year</span>
<span class="n">base_array</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">((</span><span class="n">base_array</span><span class="p">,</span> <span class="n">arr_flat</span><span class="p">),</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div>
<p>Now that we have a suitably shaped array of our dependent variable data, maximum NDVI, we want to generate an array of identical shape of our independent variable, the year.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Create an array for the X data, or independent variable, i.e., the year</span>
<span class="n">shp</span> <span class="o">=</span> <span class="n">base_array</span><span class="o">.</span><span class="n">shape</span>
<span class="n">years_array</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">repeat</span><span class="p">(</span><span class="n">years_array</span><span class="p">,</span> <span class="n">shp</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>\
<span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">shp</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">shp</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="mi">1</span><span class="p">))</span>
<span class="n">base_array</span> <span class="o">=</span> <span class="n">base_array</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">shp</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">shp</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="mi">1</span><span class="p">))</span>
</code></pre></div>
<p>Finally, we can combine both our dependent and independent variables into a single <span class="math">\((P\times N\times 2)\)</span> array.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Now, combine X and Y data</span>
<span class="n">base_array</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">((</span><span class="n">years_array</span><span class="p">,</span> <span class="n">base_array</span><span class="p">),</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">2</span><span class="p">)</span>
</code></pre></div>
<h3>Sidebar: Linear Regression in SciPy</h3>
<p>Let's take a moment to examine how linear regression works in SciPy (the collection of scientific computing tools that extend from NumPy).
The linear regression function is found as <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html"><code>scipy.stats.linregress()</code></a>; there are at least two ways to specify a linear regression and I opted for the approach that requires a single array argument to the function, e.g.:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="n">stats</span>
<span class="n">stats</span><span class="o">.</span><span class="n">linregress</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="c1"># e.g., x is a Nx2 array for N regression cases</span>
</code></pre></div>
<p>In this approach, <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html">as the SciPy documentation states:</a></p>
<blockquote>
<p>If only x is given (and y=None), then it must be a two-dimensional array where one dimension has length 2. The two sets of measurements are then found by splitting the array along the length-2 dimension.</p>
</blockquote>
<p>This is why we created a combined <span class="math">\((P\times N\times 2)\)</span> array; each pixel is then a <span class="math">\((N\times 2)\)</span> subarray that is already set up for the <code>stats.linregress()</code> function.
The last part, since we want to calculate regressions on every one of <span class="math">\(P\)</span> total pixels is to map the <code>stats.linregress()</code> function over all pixels.
We'll define a function to do just this, as below.
The first (zeroth) element in the sequence that is returned is the slope, which is what we want.</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">linear_trend</span><span class="p">(</span><span class="n">array</span><span class="p">):</span>
<span class="n">N</span> <span class="o">=</span> <span class="n">array</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">result</span> <span class="o">=</span> <span class="p">[</span><span class="n">stats</span><span class="o">.</span><span class="n">linregress</span><span class="p">(</span><span class="n">array</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="o">...</span><span class="p">])[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">N</span><span class="p">)]</span>
<span class="k">return</span> <span class="n">result</span>
</code></pre></div>
<h3>Calculating Regressions on Subarrays across Multiple Processes</h3>
<p><strong>Potential pitfall:</strong> It may seem straightforward now to farm out a range of pixels, e.g., <span class="math">\([i,j] \in [0\cdots P]\)</span> for <span class="math">\(P\)</span> total pixels. However, with multiple processes, each process gets a complete copy of the resources required to get the job done. For instance, if you spin up 2 processes asking one to take pixel indices <span class="math">\(0\cdots P/2\)</span> and the other to take pixel indices <span class="math">\(P/2\cdots P\)</span>, then each process needs a complete copy of the master (<span class="math">\(P\times N\times 2\)</span>) array. See the issue here? For large rasters, that's a huge duplication of working memory. <strong>A better practice is to literally divide the master array into chunks <em>before</em> farming out those pixel (ranges) to each process.</strong></p>
<p>Dividing rectangular arrays in Python based on the number of processes you want to spin up may seem tricky at first.
Below, I use some idiomatic Python to calculate the range of pixel indices each process would get based on <code>P</code> processes (note: here, <code>P</code> is the number of processes, whereas earlier I referred <span class="math">\(P\)</span> as the number of pixels).</p>
<div class="highlight"><pre><span></span><code><span class="n">N</span> <span class="o">=</span> <span class="n">base_array</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">P</span> <span class="o">=</span> <span class="p">(</span><span class="n">NUM_PROCESSES</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="c1"># Number of breaks (number of partitions + 1)</span>
<span class="c1"># Break up the indices into (roughly) equal parts</span>
<span class="n">partitions</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">P</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">int</span><span class="p">)[:</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span>
<span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">P</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">int</span><span class="p">)[</span><span class="mi">1</span><span class="p">:]))</span>
<span class="c1"># Final range of indices should end +1 past last index for completeness</span>
<span class="n">work</span> <span class="o">=</span> <span class="n">partitions</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">work</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">partitions</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">],</span> <span class="n">partitions</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span><span class="p">))</span>
</code></pre></div>
<p>It might be useful if you see what <code>work</code> contains.
In my example, I had <code>730331</code> total pixels and I wanted to farm them out, evenly, to 4 processes.
Note that the last range ends on <code>730332</code>, since the Python <code>range()</code> function does not include the ending number (that is, we want to make sure we count up to, <em>but not including</em> pixel <code>730332</code>).</p>
<div class="highlight"><pre><span></span><code>>>> work
[(0, 182582), (182582, 365165), (365165, 547748), (547748, 730332)]
</code></pre></div>
<h3>Concurrency in Python 3</h3>
<p>Finally, to farm out these subarrays to multiple processes, we need to use the <code>ProcessPoolExecutor</code> that ships with Python 3, available in the <code>concurrent.futures</code> module.</p>
<p><strong>Potential pitfall:</strong> You might be tempted to use a <code>lambda</code> function in place of the <code>linear_trend()</code> function we defined above, for any similar pixel-wise calcualtion you want to perform.
<a href="https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor">Because Python multi-process concurrency requires that every object farmed out to multiple processes be "picklable,"</a>, you can't use <code>lambda</code> functions. Instead, you must define a global function, as we did, above, with <code>linear_trend()</code>.
<strong>What does "picklable" mean?</strong>
It means that the object can be pickled using Python's <code>pickle</code> library; pickled objects are binary representations of Python state, i.e., of Python data, functions, classes, etc.
<strong>Why does each process' state need to be pickled?</strong>
I'll let the Python <code>concurrent.futures</code> library answer that directly:</p>
<blockquote>
<p>The <code>ProcessPoolExecutor</code> class is an <code>Executor</code> subclass that uses a pool of processes to execute calls asynchronously. <code>ProcessPoolExecutor</code> uses the <code>multiprocessing</code> module, which allows it to side-step the Global Interpreter Lock but also means that only picklable objects can be executed and returned.</p>
</blockquote>
<p><strong>If you do need an <em>anonymous</em> or <em>dynamically created</em> function,</strong> like a <code>lambda</code> function, you can still use such a pattern with concurrency in Python; <a href="https://stackoverflow.com/questions/49899351/restrictions-on-dynamically-created-functions-with-concurrent-futures-processpoo">you just need to use the <code>partial()</code> function as a wrapper</a>.</p>
<p><strong>The <code>ProcessPoolExecutor</code> creates a <em>context</em> in which we can map a (globally defined, picklable) function over a subset of data.</strong>
Because it creates a context, we invoke it using the <code>with</code> statement.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">concurrent.futures</span> <span class="kn">import</span> <span class="n">ProcessPoolExecutor</span>
<span class="c1"># Split the master array, base_array, into subarrays defined by the</span>
<span class="c1"># starting and ending, i and j, indices</span>
<span class="k">with</span> <span class="n">ProcessPoolExecutor</span><span class="p">(</span><span class="n">max_workers</span> <span class="o">=</span> <span class="n">NUM_PROCESSES</span><span class="p">)</span> <span class="k">as</span> <span class="n">executor</span><span class="p">:</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">executor</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">linear_trend</span><span class="p">,</span> <span class="p">[</span>
<span class="n">base_array</span><span class="p">[</span><span class="n">i</span><span class="p">:</span><span class="n">j</span><span class="p">,</span><span class="o">...</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">work</span>
<span class="p">])</span>
</code></pre></div>
<p>After the processes terminate, their results are stored as a sequence which we can coerce to a <code>list</code> using the <code>list()</code> function.
In our case, in particular, because we split up the <span class="math">\(P\)</span> pixels into 4 sets (for 4 processes), we want to <code>concatenate()</code> them back together as a single array.</p>
<div class="highlight"><pre><span></span><code><span class="n">regression</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">(</span><span class="n">regression</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
</code></pre></div>
<p>And, ultimately, if we want to write the pixel-wise regression out as a raster file, we need to reshape it to a 2-dimensional, <span class="math">\(Y\times X\)</span> raster, for <span class="math">\(Y\)</span> rows and <span class="math">\(X\)</span> columns.</p>
<div class="highlight"><pre><span></span><code><span class="n">output_array</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">result</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">num_rows</span><span class="p">,</span> <span class="n">num_cols</span><span class="p">))</span>
</code></pre></div>
<h2>Performance Metrics</h2>
<p>No discussion of concurrency would be complete without an analysis of the performance gain.
If you're not already aware, <a href="https://docs.python.org/3/library/timeit.html">Python's built-in <code>timeit</code> module</a> is <code>de rigeur</code> for timing Python code; below, I use it to time our pixel-wise regression in both its serial and parallel (multiple-process) forms.</p>
<p><strong>With 4 processes:</strong></p>
<div class="highlight"><pre><span></span><code>$ python -m timeit -s <span class="s2">"from my_regression_example import main"</span> -n <span class="m">3</span> -r <span class="m">3</span> <span class="s2">"main('~/Desktop/*.tiff')"</span>
<span class="m">3</span> loops, best of <span class="m">3</span>: <span class="m">46</span>.1 sec per loop
</code></pre></div>
<p><strong>With 1 process (serial):</strong></p>
<div class="highlight"><pre><span></span><code>$ python -m timeit -s <span class="s2">"from my_regression_example import main"</span> -n <span class="m">3</span> -r <span class="m">3</span> <span class="s2">"main('~/Desktop/*.tiff')"</span>
<span class="m">3</span> loops, best of <span class="m">3</span>: <span class="m">152</span> sec per loop
</code></pre></div>
<p>As you can see, with 4 processes we're finishing the work in about one-third of the time as it takes with only one process.
You might have expected us to finish in a quarter of the time, but because of the overhead associated with spinning up 4 processes and collecting their results, we never quite get a <span class="math">\(1/P\)</span> reduction in time for <span class="math">\(P\)</span> processes.
This speed-up is still quite an achievement, however.
No matter how many processes we use, the regression results are, of course, the same; below is the image we created, with colors mapped to regression slope quintiles.</p>
<p><a href="/images/20180730_Detroit_NDVI_regression_map.png"><img style="float:left;" src="/images/thumbs/20180730_Detroit_NDVI_regression_map_thumbnail_square.png" /></a></p>
<div style="clear:both;"></div>
<h2>In Summary</h2>
<p>By raveling a 2-D raster array into a collection of pixel-level subarrays, we can easily pass any vectorized function over them, allowing us to do far more than regression.
With a vectorized function, we can farm out the work to multiple processes to finish our total work faster.
In addition, because we split our raster time series up into multiple chunks, the memory required is no more than we need to store the entire raster time series.
<strong>Keep in mind these two pitfalls that are commonly encountered with multi-processing in Python:</strong></p>
<ul>
<li><strong>DON'T</strong> use a <code>lambda</code> function as the function to map over the array; instead, use regular, globally defined Python functions, with or without <code>functools.partial</code>, as needed.</li>
<li><strong>DO</strong> chunk up the array into subarrays, passing each process only its respective subarray.</li>
</ul>
<p>In case my walkthrough above was overwhelming, <strong>a general pattern for parallel processing of raster array chunks is presented below.</strong></p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">glob</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">osgeo</span> <span class="kn">import</span> <span class="n">gdal</span>
<span class="kn">from</span> <span class="nn">concurrent.futures</span> <span class="kn">import</span> <span class="n">ProcessPoolExecutor</span>
<span class="c1"># Example file list; filenames should have some numeric date/year</span>
<span class="n">ordered_files</span> <span class="o">=</span> <span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">'*.tiff'</span><span class="p">)</span>
<span class="n">ordered_files</span><span class="o">.</span><span class="n">sort</span><span class="p">()</span>
<span class="c1"># A function that maps whatever you want to do over each pixel;</span>
<span class="c1"># needs to be a global function so it can be pickled</span>
<span class="k">def</span> <span class="nf">do_something</span><span class="p">(</span><span class="n">array</span><span class="p">):</span>
<span class="n">N</span> <span class="o">=</span> <span class="n">array</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">result</span> <span class="o">=</span> <span class="p">[</span><span class="n">my_function</span><span class="p">(</span><span class="n">array</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="o">...</span><span class="p">])</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">N</span><span class="p">)]</span>
<span class="k">return</span> <span class="n">result</span>
<span class="c1"># Iterate through each file, combining them in order as a single array</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">each_file</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">ordered_files</span><span class="p">):</span>
<span class="c1"># Open the file, read in as an array</span>
<span class="n">ds</span> <span class="o">=</span> <span class="n">gdal</span><span class="o">.</span><span class="n">Open</span><span class="p">(</span><span class="n">each_file</span><span class="p">)</span>
<span class="n">arr</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">ReadAsArray</span><span class="p">()</span>
<span class="n">ds</span> <span class="o">=</span> <span class="kc">None</span>
<span class="n">shp</span> <span class="o">=</span> <span class="n">arr</span><span class="o">.</span><span class="n">shape</span>
<span class="n">arr_flat</span> <span class="o">=</span> <span class="n">arr</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">shp</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">shp</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="mi">1</span><span class="p">))</span> <span class="c1"># Ravel array to 1-D shape</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">base_array</span> <span class="o">=</span> <span class="n">arr_flat</span> <span class="c1"># The very first array is the base</span>
<span class="k">continue</span> <span class="c1"># Skip to the next year</span>
<span class="c1"># Stack the arrays from each year</span>
<span class="n">base_array</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">((</span><span class="n">base_array</span><span class="p">,</span> <span class="n">arr_flat</span><span class="p">),</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="c1"># Break up the indices into (roughly) equal parts, e.g.,</span>
<span class="c1"># partitions = [(0, 1000), (1000, 2000), ..., (9000, 10001)]</span>
<span class="n">partitions</span> <span class="o">=</span> <span class="p">[</span><span class="o">...</span><span class="p">]</span>
<span class="c1"># NUM_PROCESSES is however many cores you want to use</span>
<span class="k">with</span> <span class="n">ProcessPoolExecutor</span><span class="p">(</span><span class="n">max_workers</span> <span class="o">=</span> <span class="n">NUM_PROCESSES</span><span class="p">)</span> <span class="k">as</span> <span class="n">executor</span><span class="p">:</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">executor</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">linear_trend</span><span class="p">,</span> <span class="p">[</span>
<span class="n">base_array</span><span class="p">[</span><span class="n">i</span><span class="p">:</span><span class="n">j</span><span class="p">,</span><span class="o">...</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">partitions</span>
<span class="p">])</span>
<span class="n">combined_results</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">result</span><span class="p">)</span> <span class="c1"># List of array chunks...</span>
<span class="n">final</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">(</span><span class="n">regression</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="c1"># ...Now a single array</span>
<span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">final</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">num_rows</span><span class="p">,</span> <span class="n">num_cols</span><span class="p">))</span> <span class="c1"># ...In the original shape</span>
</code></pre></div>
<h2>References</h2>
<ol>
<li>Breshears, C. 2009. <u>The Art of Concurrency: A Thread Monkey's Guide to Writing Parallel Applications.</u> O'Reilly Media Inc. Sebastopol, CA, U.S.A.</li>
<li>Beazley, D. 2010. <a href="https://www.dabeaz.com/python/UnderstandingGIL.pdf">"Understanding the Python GIL."</a> PyCon 2010. Atlanta, Georgia.</li>
<li>Fortuner, B. 2017. <a href="https://medium.com/@bfortuner/python-multithreading-vs-multiprocessing-73072ce5600b">"Intro to Threads and Processes in Python."</a></li>
</ol>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Unsupervised learning for time series data: Singular spectrum versus principal components analysis2017-09-19T10:00:00+02:002017-09-19T10:00:00+02:00K. Arthur Endsleytag:karthur.org,2017-09-19:/2017/learning-for-time-series-ssa-vs-pca.html<p>Recently, I was working with a colleague on a project involving time series observations of neighborhoods in Los Angeles.
We wanted to see if there were patterns in the time series data that described how similar neighborhoods evolved in time.
For multivariate data, this is a great application for unsupervised …</p><p>Recently, I was working with a colleague on a project involving time series observations of neighborhoods in Los Angeles.
We wanted to see if there were patterns in the time series data that described how similar neighborhoods evolved in time.
For multivariate data, this is a great application for unsupervised learning: we wish to discover subgroups among either among the variables (obtaining a more parsimonious description of the data) or among the observations (grouping similar samples together) [1].
My colleague described what we needed as a "PCA for time series."</p>
<p>Though I was familiar with principal components analysis (PCA), I didn't know what to expect from applying PCA to a time series.</p>
<ul>
<li>How should the data be structured?</li>
<li>How does the approach compare to digital signal processing techniques like singular spectrum analysis (SSA)?</li>
<li>How can PCA and SSA be implemented in the R environment?</li>
</ul>
<p><strong>Here, I attempt to provide a brief introduction to PCA, SSA, and their implementations in R, along with the relevant considerations of their similarities and differences.</strong>
It may seem that SSA is coming out of left field; what the hell is it?
This investigation was prompted by a question about PCA, after all.
However, SSA and PCA have interesting similarities and differences that I think merit their joint discussion, at least at the level I'm capable of here.</p>
<h2>Principal Components Analysis</h2>
<p>The objective of PCA is to "provide the best <span class="math">\(m\)</span>-dimensional approximation (in terms of Euclidean distance)" [1] to each observation in a <span class="math">\(p\)</span>-dimensional dataset, where <span class="math">\(p > m\)</span>.
This characterization places PCA in a list of other "dimesionality reduction" techniques that seek to describe a set of data using fewer variables (or dimensions/ degrees of freedom) than were measured.
A lower-dimensional description of a dataset has obvious benefits for data compression---fewer variables used to describe the data means fewer columns in a table or fewer tables in a data cube---but it can also reveal hidden structure in the data.</p>
<p>What we obtain from PCA is a <em>coordinate transformation;</em> our data are projected from their original coordinate system (spanned by the variables we measured) onto a new, orthogonal basis.
In this way, correlation that may have existed between the columns in our original data is eliminated.
The new variables, referred to as principal components, in this new coordinate system can be hard to interpret and may not have a physical meaning.
They are intrinsicaly ordered by the amount of variance in the data they explain.</p>
<p>PCA can be implemented in one of two ways:</p>
<ul>
<li>Through a <strong>spectral decomposition</strong> of the covariance ("unstandarized" PCA) or correlation matrix ("standardized" PCA) [1];</li>
<li>Through a <strong>singular value decomposition (SVD)</strong> of the data matrix, <span class="math">\(X\)</span>.</li>
</ul>
<p>PCA is sometimes referred to as being "standardized" or "unstandardized" [2].
In standardized PCA, the correlation matrix is used in place of the covariance matrix of unstandardized PCA.
If the variables in <span class="math">\(X\)</span> are measured on different scales, their corresponding variances will also be scaled differently, which can put unequal weight on one or more of our original variables.
This weight is often unjustified.
James et al. (2013), for example, describe a dataset where one of the variables in an urban crime dataset, the rate of criminal assaults, has a much larger variance than the other variables, including the rates of rape and murder, simply because they occur far more often than other crimes.
Figure 10.3 in their book is an excellent visual aid here.
In general, we want to both mean-center and standardize the variables in our data matrix, <span class="math">\(X\)</span>, prior to PCA.
<strong>If we're using the SVD approach, mean-centering and standardizing our data matrix, <span class="math">\(X\)</span>, is equivalent to using the correlation matrix, rather than the covariance matrix, in spectral decomposition.</strong></p>
<p>It's important for both the description of PCA and, later, of SSA, to provide some background on matrix decomposition.</p>
<h3>Spectral or Eigenvalue Decomposition</h3>
<p>Spectral decomposition, also referred to as eigenvalue decomposition, factors any diagonalizable (square) matrix into a canonical form.
The oft-described "canonical form" makes intuitive sense, here, in a larger discussion of PCA because a canonical form is precisely what we wish to find in our original, messy dataset.</p>
<p>Given a multivariate dataset <span class="math">\(X\)</span> as an <span class="math">\(n\times p\)</span> matrix, where the columns <span class="math">\(X_i,\, i\in \{1,\cdots ,p\}\)</span> represent distinct variables and the rows represent different observations or samples of each of the variables, the spectral decomposition of the covariance matrix can be written as:
</p>
<div class="math">$$
Q^{-1}AQ = \Lambda
$$</div>
<p>Where <span class="math">\(A\)</span> is a real, symmetric matrix and the columns of <span class="math">\(Q\)</span> are the orthonormal eigenvectors of <span class="math">\(A\)</span> [3]. <span class="math">\(\Lambda\)</span> is a diagonal matrix and the non-zero values correspond to the eigenvalues. Elsner and Tsonis (1996) provide a concise introduction to the intuition behind eigenvectors and eigenvalues.</p>
<h3>Singular Value Decomposition</h3>
<p>Singular value decomposition is a generalization of spectral decomposition.
Any <span class="math">\(m\times n\)</span> matrix <span class="math">\(X\)</span> can be factored into a composition of orthogonal matrices, <span class="math">\(U\)</span> and <span class="math">\(V^T\)</span>, and a diagonalizable matrix <span class="math">\(\Sigma\)</span>:
</p>
<div class="math">$$
X = U\Sigma V^T
$$</div>
<p>The columns of the <span class="math">\(m\times m\)</span> matrix <span class="math">\(U\)</span> are the eigenvectors of <span class="math">\(XX^T\)</span> while the columns of the <span class="math">\(n\times n\)</span> matrix <span class="math">\(V\)</span> are the eigenvectors of <span class="math">\(X^TX\)</span>.
The columns of <span class="math">\(U\)</span> and <span class="math">\(V\)</span> are also called the "left" and "right" singular vectors.
The singular values on the diagonal of <span class="math">\(\Sigma\)</span> are the square-roots of the non-zero eigenvalues of both the <span class="math">\(XX^T\)</span> and <span class="math">\(X^TX\)</span> matrices [5].</p>
<!-- Justify!
For positive, semi-definite normal matrices, the eigenvectors of a spectral decomposition and the left-singular vectors of SVD are identical.
The covariance matrix of a multivariate dataset is is always positive (all elements are greater than zero), is always semi-definite, and is always normal (because it always symmetric).
-->
<p>In PCA, the "right singular vectors," the columns of the <span class="math">\(V\)</span> matrix, of an SVD are equivalent to the eigenvectors of the covariance matrix [4].
Also, the eigenvalues of the covariance matrix correspond to the variance explained by each respective principal component.</p>
<p>PCA can be implemented in R in a few different ways for a data matrix <code>X</code>.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># SVD of the (scaled) data matrix; the name `v` is the matrix of PCs as column vectors</span>
<span class="nf">svd</span><span class="p">(</span><span class="nf">scale</span><span class="p">(</span><span class="n">X</span><span class="p">))</span><span class="o">$</span><span class="n">v</span>
<span class="c1"># Spectral decomp. of the covariance/ correlation matrix;</span>
<span class="c1"># `vectors` has matrix of PCs as column vectors</span>
<span class="nf">eigen</span><span class="p">(</span><span class="nf">cor</span><span class="p">(</span><span class="n">X</span><span class="p">))</span><span class="o">$</span><span class="n">vectors</span>
<span class="c1"># Built-in tool for PCA; `rotation` has matrix of PCs as column vectors</span>
<span class="nf">prcomp</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">scale.</span> <span class="o">=</span> <span class="bp">T</span><span class="p">)</span><span class="o">$</span><span class="n">rotation</span>
</code></pre></div>
<h2>Singular Spectrum Analysis</h2>
<p>Singular spectrum analysis (SSA) is a technique used to discover oscillation series of any length within a longer (univariate) time series.
Oscillations are of interest, generally, because they are associated with various signals of interest: in ecology, it could be seasonal/ phenological change; in physics or engineering, it could be a mechanical or electrical wave.</p>
<blockquote>
<p>"An oscillatory series is a periodic or quasi-periodic series which can be either pure or amplitude-modulated. Noise is any aperiodic series. The trend of the series is, roughly speaking, a slowly varying additive component of the series with all the oscillations removed." - Golyandina et al. (2001)</p>
</blockquote>
<p>Unlike PCA, SSA is generally performed on a univariate dataset: a single variable observed at multiple points in time.
There is a multivariate form of SSA, sometimes called M-SSA, but it is beyond my current understanding.
Univariate SSA is a lot like PCA for univariate time series data.</p>
<p>There are a couple of different approaches to setting up SSA.
One way, described in Elsner and Tsonis' (1996) <em>excellent</em> and very accessible book, begins by constructing the <em>trajectory matrix.</em>
The trajectory matrix is the <span class="math">\(n\times m\)</span> matrix whose row vectors are every consecutive <span class="math">\(m\)</span>-tuple, or every window of length <span class="math">\(m\)</span>, in the time series.</p>
<blockquote>
<p>"By using lagged copies of a single time series, we can define the coordinates of the phase space that will approximate the dynamics of the system from which the time record was sampled. The number of lags is called the embedding dimension." - Elsner and Tsonis (1996)</p>
</blockquote>
<p>A general formula for the number of rows in the trajectory matrix is <span class="math">\(n = n_t-m+1\)</span>, where <span class="math">\(n_t\)</span> is the length of the original time series vector. For instance, a time series of 6 observations with an <em>embedding dimension</em> of <span class="math">\(m=3\)</span> will have <span class="math">\(n=4\)</span> possible combinations of 3 consecutive values from among 6 ordered values. This means there are 4 rows in the trajectory matrix. <strong>The trajectory matrix is definitely not linearly independent.</strong></p>
<blockquote>
<p>"[The trajectory matrix] contains the complete record of patterns that have occurred within a window of size [m]." - Elsner and Tsonis (1996)</p>
</blockquote>
<p>The trajectory matrix is the matrix whose <span class="math">\((i,j)\)</span> element is defined:
</p>
<div class="math">$$
[X]_{ij} = x_{i+j-1}
$$</div>
<p>Where <span class="math">\(x\)</span> is some ordered, time series vector.</p>
<p>It is customary to normalize the elements of the trajectory matrix by <span class="math">\(n\)</span>, the number of windows. That is, for a single time series record <span class="math">\(v = \{v_1, v_2, \cdots, v_n\}\)</span>, the trajectory matrix for an embedding dimension <span class="math">\(m\)</span> could be written as:
</p>
<div class="math">$$
X = \frac{1}{\sqrt{n}} \left[\begin{array}{ccc}
v_1 & \cdots & v_{m}\\
v_2 & \cdots & v_{m+1}\\
& \ddots &\\
v_{n-m+1} & \cdots & v_{n}\\
\end{array}\right]
$$</div>
<p>The <strong>lagged-covariance matrix</strong> is then defined:
</p>
<div class="math">$$
S = X^TX
$$</div>
<p><strong>There are two ways to perform SSA as a matrix decomposition:</strong></p>
<ul>
<li>Through the spectral decomposition of the (normalized) lagged-covariance matrix, <span class="math">\(S = X^TX\)</span>. Here, <span class="math">\(X\)</span> is not the data matrix, but the trajectory matrix; the matrix formed by all possible time windows of length <span class="math">\(m\)</span>. The normalization factor used is <span class="math">\(1/\sqrt{n}\)</span>, where <span class="math">\(n = n_t-m+1\)</span> is the number of time windows of length <span class="math">\(m\)</span> taken from a time series vector of length <span class="math">\(n_t\)</span>.</li>
<li>Through the singular value decomposition (SVD) of the trajectory matrix [4], <span class="math">\(X\)</span>, taking the right singular vectors, or <span class="math">\(V\)</span> in the SVD given by <span class="math">\(S + UG(V^T)\)</span>.</li>
</ul>
<p>It's easy to see why these two methods are equivalent, if we recall that the right singular vectors of an SVD on <span class="math">\(X\)</span> correspond to the eigenvectors of the matrix <span class="math">\(X^TX\)</span>. The spectral decomposition approach is described in detail by Elsner and Tsonis (1996) while the SVD approach is described by Golyandina et al. (2001).</p>
<h3>SSA, Stationarity, and Autocorrelation</h3>
<p>If the underlying signal is contaminated only by white noise (AR0 noise), then the dominant eigenvalutes will be associated with oscillations in the time series record. If the noise is autocorrelated (red or AR1 noise), however, "then dominant eigenvalues in the singular spectrum will characterize both the noise and signal components of the record" [3]. It should also be noted that higher-order autocorrelation structures, AR2 and AR3, do produce oscillations.</p>
<h2>SSA versus PCA</h2>
<p>You should immediately notice one similarity between PCA and SSA.
They both can be computed through either a spectral decomposition of a covariance matrix or through an SVD of the data matrix, taking careful note that the data matrix in SSA is the trajectory matrix for a single time series record.</p>
<p>Elsner and Tsonis (1996) claim that aside from the difference between the composition of <span class="math">\(X\)</span>, i.e., between the trajectory matrix (containing lagged windows of a univariate time series) in SSA and the data matrix of PCA (containing multivariate time series records), "there is no difference between the expansion [of the data set] used in classical PCA and the expansion [of the data set] in SSA."
More specifically, <strong>PCA can be defined as the spectral decomposition of the covariance matrix <span class="math">\(X^TX\)</span> whereas SSA can be defined as the spectral decomposition of the (normalized) lagged-covariance matrix,</strong> which is also designated <span class="math">\(X^TX\)</span>; however, in PCA, <span class="math">\(X\)</span> is the data matrix while in SSA <span class="math">\(X\)</span> is the trajectory matrix.</p>
<p><strong>The difference between the structure of the matrix <span class="math">\(X\)</span> in PCA versus SSA is precisely what contributes to their different behaviors.</strong></p>
<p>In PCA, the matrix consists of a single variable observed at multiple locations (a multivariate time series dataset, where the variables are different spatial locations) and the resuling components are termed <em>spatial principal components</em> [3].
In SSA, the matrix consists of a single time series (a single variable observed in a single location) "observed" at different time windows; the resulting components derived through SSA could be termed <em>temporal principal components.</em></p>
<p>In the digital signal processing community, PCA (as a spectral decomposition of the <em>correlation,</em> not covariance, matrix) is also known as Karhunen–Loève (K-L) expansion.</p>
<h2>Implementation in R</h2>
<p>I'll demonstrate PCA and SSA in R using two different time series datasets, as they do have two, central assumptions about time series data that differ:</p>
<ul>
<li>In PCA, we necessarily have one variable observed over time <em>at multiple locations</em> or among <em>multiple sample units;</em></li>
<li>In (univariate) SSA, we necessarily have one variable observed over time as a single record, or in a single location.</li>
</ul>
<p>For SSA, I'll use the <code>LakeHuron</code> dataset, bundled with R and <a href="https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/LakeHuron.html">described here</a>; it consists of observations of the water level in Lake Huron (one place) over time.
For PCA, I couldn't find a built-in dataset that was adequate, so I'm using a series of home sale price observations in multiple neighborhoods in Los Angeles from 1989 to 2010.</p>
<h3>PCA for Time Series Data in R</h3>
<p>The first thing we want to do with time series data in R is create a time plot to look at the (mean) behavior over time.
Here, a time plot of the price-per-square foot data indicates there is an overall regional oscillation in prices.
In Los Angeles, it appears that prices peak just before the subprime mortgage crisis of 2006-2007.</p>
<p><img alt="Time plot of the first-differenced water level data for Lake Huron" src="http://karthur.org/images/20170919_LA_prices_time_plot.png"></p>
<p>The set up our data for PCA, we need to make sure the data frame is in "wide format," i.e., the years span the columns.
</p>
<div class="math">$$
X = \left[\begin{array}{ccc}
x_{1,1989} & \cdots & x_{1,2010}\\
& \ddots & \\
& & x_{N,2010}
\end{array}\right]
$$</div>
<p>More formally, the elements of the <span class="math">\(X\)</span> matrix are generated from sampling a time series <span class="math">\(T\)</span> at location <span class="math">\(i\)</span> and at time <span class="math">\(j\)</span>:
</p>
<div class="math">$$
\left[X\right]_{ij} = T(\,\mathrm{Location}\,\, i , \mathrm{Time}\,\, j\,)
$$</div>
<h4>Deriving the Spatial Principal Components</h4>
<p>The PCA here is performed using a singular value decomposition of the mean-centered and scaled data matrix, <span class="math">\(X\)</span>.
We then create a <em>screeplot,</em> which shows the proportion of variance explained by each principal component.</p>
<div class="highlight"><pre><span></span><code><span class="n">pca.price.only</span> <span class="o"><-</span> <span class="nf">svd</span><span class="p">(</span><span class="nf">scale</span><span class="p">(</span><span class="n">my.data</span><span class="p">)))</span>
<span class="nf">plot</span><span class="p">((</span><span class="n">pca.price.only</span><span class="o">$</span><span class="n">d</span><span class="o">^</span><span class="m">2</span> <span class="o">/</span> <span class="nf">sum</span><span class="p">(</span><span class="n">pca.price.only</span><span class="o">$</span><span class="n">d</span><span class="o">^</span><span class="m">2</span><span class="p">))[</span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">],</span> <span class="n">type</span> <span class="o">=</span> <span class="s">'b'</span><span class="p">,</span>
<span class="n">log</span> <span class="o">=</span> <span class="s">'y'</span><span class="p">,</span> <span class="n">main</span> <span class="o">=</span> <span class="s">'Screeplot (Log10): PCA on Price Data Only'</span><span class="p">,</span>
<span class="n">ylab</span> <span class="o">=</span> <span class="s">'Proportion of Variance Explained'</span><span class="p">,</span> <span class="n">xlab</span> <span class="o">=</span> <span class="s">'No. of Principal Components'</span><span class="p">)</span>
</code></pre></div>
<p><img alt="Time plot of the first-differenced water level data for Lake Huron" src="http://karthur.org/images/20170919_LA_prices_screeplot.png"></p>
<p>Recall that the variance explained is proportional to the corresponding eigenvalue, among <span class="math">\(P\)</span> total eigenvalues.
The proportion of total variance, <span class="math">\(d_i\)</span>, attributable to principal component <span class="math">\(i\)</span>, can thus be calculated from the SVD, <span class="math">\(X = U\Sigma V^T\)</span>, as:
</p>
<div class="math">$$
d_i = \frac{\mathrm{diag}(\Sigma)_i^2}{\sum_{i=1}^P\mathrm{diag}(\Sigma)_i^2}
$$</div>
<p><strong>The screeplot is helpful for identifying how many distinct components to the variance exist.</strong>
If our aim with PCA is dimensionality reduction or compression, we can use this plot to decide how many principal components are needed to approximate the original data.</p>
<h4>Visualizing the Spatial Principal Components</h4>
<p>We'll examine the first four (4) principal components.
By plotting the right singular vectors, we can visualize the <em>loadings</em> of each variable on each of the principal components.
Recall that because the "variables" in this time-series dataset are different years of observation, what the loadings here represent is the contribution of each year to the <em>spatial</em> pattern of variance.
That is why Elsner and Tsonis (1996) refer to these as <em>spatial principal components.</em></p>
<div class="highlight"><pre><span></span><code><span class="n">pca.price.components</span> <span class="o"><-</span> <span class="nf">data.frame</span><span class="p">(</span>
<span class="n">year</span> <span class="o">=</span> <span class="nf">seq.int</span><span class="p">(</span><span class="m">1989</span><span class="p">,</span> <span class="m">2010</span><span class="p">),</span>
<span class="n">Price.PC1</span> <span class="o">=</span> <span class="n">pca.price.only</span><span class="o">$</span><span class="n">v</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">22</span><span class="p">,</span><span class="m">1</span><span class="p">],</span>
<span class="n">Price.PC2</span> <span class="o">=</span> <span class="n">pca.price.only</span><span class="o">$</span><span class="n">v</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">22</span><span class="p">,</span><span class="m">2</span><span class="p">],</span>
<span class="n">Price.PC3</span> <span class="o">=</span> <span class="n">pca.price.only</span><span class="o">$</span><span class="n">v</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">22</span><span class="p">,</span><span class="m">3</span><span class="p">],</span>
<span class="n">Price.PC4</span> <span class="o">=</span> <span class="n">pca.price.only</span><span class="o">$</span><span class="n">v</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">22</span><span class="p">,</span><span class="m">4</span><span class="p">])</span>
<span class="nf">require</span><span class="p">(</span><span class="n">stringr</span><span class="p">)</span>
<span class="n">pca.price.components</span> <span class="o">%>%</span>
<span class="nf">gather</span><span class="p">(</span><span class="n">key</span> <span class="o">=</span> <span class="s">'PCs'</span><span class="p">,</span> <span class="n">value</span> <span class="o">=</span> <span class="s">'loading'</span><span class="p">,</span> <span class="o">-</span><span class="n">year</span><span class="p">)</span> <span class="o">%>%</span>
<span class="nf">mutate</span><span class="p">(</span>
<span class="n">variable</span> <span class="o">=</span> <span class="nf">substr</span><span class="p">(</span><span class="n">PCs</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">5</span><span class="p">),</span>
<span class="n">PCs</span> <span class="o">=</span> <span class="nf">str_replace</span><span class="p">(</span><span class="n">PCs</span><span class="p">,</span> <span class="s">'(Price|Loans)\\.'</span><span class="p">,</span> <span class="s">''</span><span class="p">))</span> <span class="o">%>%</span>
<span class="nf">ggplot</span><span class="p">(</span><span class="n">mapping</span> <span class="o">=</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">year</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">loading</span><span class="p">))</span> <span class="o">+</span>
<span class="nf">geom_line</span><span class="p">(</span><span class="n">size</span> <span class="o">=</span> <span class="m">1</span><span class="p">)</span> <span class="o">+</span>
<span class="nf">facet_wrap</span><span class="p">(</span><span class="o">~</span> <span class="n">PCs</span><span class="p">)</span> <span class="o">+</span>
<span class="nf">labs</span><span class="p">(</span><span class="n">title</span> <span class="o">=</span> <span class="s">'Loadings on Principal Components: Log10 Price-per-Square Foot'</span><span class="p">,</span>
<span class="n">x</span> <span class="o">=</span> <span class="s">''</span><span class="p">)</span> <span class="o">+</span>
<span class="nf">theme_linedraw</span><span class="p">()</span>
</code></pre></div>
<p><a href="/images/20170919_LA_prices_PCs_plot.png"><img style="float:left;margin-right:20px;" src="/images/thumbs/20170919_LA_prices_PCs_plot_thumbnail_tall.png" /></a></p>
<div style="clear:both;"></div>
<p>Eastman and Fulk (1993) conducted a standardized PCA on vegetation in Africa and provide an excellent interpretation of the loadings on the spatial principal components:</p>
<blockquote>
<p>"If a [year] shows a strong positive correlation with a specific component, it indicates that that [year] contains a latent (i.e., to some extent hidden or unappearnt) spatial pattern that has strong similarity to the one depicted in teh component image. Similarly, a strong negative corerlation indicates that the monthly image has a latent pattern that is the inverse of that shown (i.e., with positive and negative anomalies reversed." [2]</p>
</blockquote>
<p>When the first principal component is essentially constant over time, as it is in the case of PC1 in this example, it indicates that the dominant variation in the data occurs over space.
In this example, it means that there is more variation in sale prices between L.A. neighborhoods in any year than over time for all neighborhoods.</p>
<p>Mapping the spatial principal component <em>scores,</em> or the original values projected onto the principal components, might aid intepretation.
The scores can be obtained for a one or more principal components, up to <span class="math">\(m\)</span> total principal components, as the product of a subset of the columns, <span class="math">\(W\)</span>, and the mean-centered and scaled time-series data, <span class="math">\(Z\)</span>:
</p>
<div class="math">$$
T = ZW\quad\mbox{where}\quad W = \left[\begin{array}{cccc}
V_1 & V_2 & \cdots & V_p
\end{array}\right]\quad\mbox{for}\quad V_i\in V,\, p \le m
$$</div>
<div class="highlight"><pre><span></span><code><span class="c1"># Mean-center and scale the original data values</span>
<span class="n">var.price.scaled</span> <span class="o"><-</span> <span class="nf">as.matrix</span><span class="p">(</span><span class="nf">scale</span><span class="p">(</span><span class="nf">select</span><span class="p">(</span><span class="n">var.by.year.clean</span><span class="p">,</span> <span class="nf">starts_with</span><span class="p">(</span><span class="s">'price'</span><span class="p">))))</span>
<span class="c1"># Calculate the rotated values</span>
<span class="n">pca.price.spatial</span> <span class="o"><-</span> <span class="nf">matrix</span><span class="p">(</span><span class="n">nrow</span> <span class="o">=</span> <span class="nf">nrow</span><span class="p">(</span><span class="n">var.price.scaled</span><span class="p">),</span>
<span class="n">ncol</span> <span class="o">=</span> <span class="nf">ncol</span><span class="p">(</span><span class="n">pca.price.only</span><span class="o">$</span><span class="n">v</span><span class="p">))</span>
<span class="nf">for </span><span class="p">(</span><span class="n">i</span> <span class="n">in</span> <span class="m">1</span><span class="o">:</span><span class="nf">ncol</span><span class="p">(</span><span class="n">pca.price.spatial</span><span class="p">))</span> <span class="p">{</span>
<span class="n">pca.price.spatial</span><span class="p">[,</span><span class="n">i</span><span class="p">]</span> <span class="o"><-</span> <span class="n">var.price.scaled</span> <span class="o">%*%</span> <span class="nf">as.matrix</span><span class="p">(</span><span class="n">pca.price.only</span><span class="o">$</span><span class="n">v</span><span class="p">[,</span><span class="n">i</span><span class="p">])</span>
<span class="p">}</span>
</code></pre></div>
<p>When we map the scores, here presented as the number of standard deviations around the mean score, for the first principal component, we see the dominant, time-invariant (or aggregate) spatial variation in price.
If we make a similar map for PC2, we can compare it to the loadings plot above.
The areas with positive correlations in the map follow the time trend indicated in the loadings for PC2; the areas with negative corerlations in the map follow the inverse of that PC2 time trend.</p>
<p><a href="/images/20170919_LA_prices_scores_map.png"><img style="float:left;margin-right:20px;" src="/images/thumbs/20170919_LA_prices_scores_map_thumbnail_wide.png" /></a></p>
<div style="clear:both;"></div>
<h3>SSA for Time Series Data in R</h3>
<p>For SSA, which assumes weak stationarity, we want to look at the <em>first differences</em> of the <code>LakeHuron</code> data.
Differencing is easy in R with the <code>diff()</code> function.</p>
<div class="highlight"><pre><span></span><code><span class="nf">data</span><span class="p">(</span><span class="n">LakeHuron</span><span class="p">)</span>
<span class="nf">plot</span><span class="p">(</span><span class="n">LakeHuron</span><span class="p">,</span> <span class="n">main</span> <span class="o">=</span> <span class="s">'Lake Huron Water Levels'</span><span class="p">,</span> <span class="n">xlab</span> <span class="o">=</span> <span class="s">'Lake Level (ft)'</span><span class="p">)</span>
<span class="nf">plot</span><span class="p">(</span><span class="nf">diff</span><span class="p">(</span><span class="n">LakeHuron</span><span class="p">),</span> <span class="n">main</span> <span class="o">=</span> <span class="s">'Lake Huron Water Levels: First Difference'</span><span class="p">,</span>
<span class="n">xlab</span> <span class="o">=</span> <span class="s">'Lake Level (ft)'</span><span class="p">)</span>
</code></pre></div>
<p><img alt="Time plot of the first-differenced water level data for Lake Huron" src="http://karthur.org/images/20170919_LakeHuron_water_levels.png"></p>
<p>We might take a moment to confirm that first-differencing of the water levels data is adequate to produce a time series that is first-order stationary.
The <code>acf()</code> function in R is a good tool for visual inspection of time series data.
The resulting plot shows very low correlation after the zeroth lag (lag <span class="math">\(= 0\)</span>), which is encouraging.</p>
<div class="highlight"><pre><span></span><code><span class="nf">acf</span><span class="p">(</span><span class="nf">diff</span><span class="p">(</span><span class="n">LakeHuron</span><span class="p">),</span> <span class="n">type</span> <span class="o">=</span> <span class="s">'covariance'</span><span class="p">)</span>
</code></pre></div>
<h4>Construction of the Trajectory Matrix</h4>
<p>We'll construct two different trajectory matrices, investigating 10- and 25-year windows.
Recall that the trajectory matrix is an <span class="math">\((n-m+1)\times m\)</span> matrix for an embedding dimension of <span class="math">\(m\)</span>.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Construct the trajectory matrix (Elsner and Tsonis 1996, p.44)</span>
<span class="n">traj10</span> <span class="o"><-</span> <span class="nf">matrix</span><span class="p">(</span><span class="n">nrow</span> <span class="o">=</span> <span class="nf">length</span><span class="p">(</span><span class="nf">diff</span><span class="p">(</span><span class="n">LakeHuron</span><span class="p">))</span> <span class="o">-</span> <span class="m">10</span> <span class="o">+</span> <span class="m">1</span><span class="p">,</span> <span class="n">ncol</span> <span class="o">=</span> <span class="m">10</span><span class="p">)</span>
<span class="n">traj25</span> <span class="o"><-</span> <span class="nf">matrix</span><span class="p">(</span><span class="n">nrow</span> <span class="o">=</span> <span class="nf">length</span><span class="p">(</span><span class="nf">diff</span><span class="p">(</span><span class="n">LakeHuron</span><span class="p">))</span> <span class="o">-</span> <span class="m">25</span> <span class="o">+</span> <span class="m">1</span><span class="p">,</span> <span class="n">ncol</span> <span class="o">=</span> <span class="m">25</span><span class="p">)</span>
</code></pre></div>
<p>To populate the matrices, we can use a <code>for</code> loop to calculate all the windows of length <span class="math">\(m\)</span> in the dataset.
Recall that the <span class="math">\((i,j)\)</span> element of the trajectory matrix is given by <span class="math">\(x_{i+j-1}\)</span>, where <span class="math">\(x\)</span> is our first-differenced <code>LakeHuron</code> time series.</p>
<div class="highlight"><pre><span></span><code><span class="nf">for </span><span class="p">(</span><span class="n">i</span> <span class="n">in</span> <span class="m">1</span><span class="o">:</span><span class="nf">nrow</span><span class="p">(</span><span class="n">traj10</span><span class="p">))</span> <span class="p">{</span>
<span class="nf">for </span><span class="p">(</span><span class="n">j</span> <span class="n">in</span> <span class="m">1</span><span class="o">:</span><span class="nf">ncol</span><span class="p">(</span><span class="n">traj10</span><span class="p">))</span> <span class="p">{</span>
<span class="n">traj10</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o"><-</span> <span class="nf">diff</span><span class="p">(</span><span class="n">LakeHuron</span><span class="p">)[</span><span class="n">i</span> <span class="o">+</span> <span class="n">j</span> <span class="o">-</span> <span class="m">1</span><span class="p">]</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div>
<h4>The Lagged-Covariance Matrix</h4>
<p>We next construct the lagged-covariance matrix by performing a spectral decomposition of the lagged covariance matrix.
Recall that this is formed from the trajectory matrix <span class="math">\(X\)</span> as <span class="math">\(X^TX\)</span>.
Note that we normalize each matrix by one over the square-root of the number of time windows (the number of rows in the trajectory matrix).</p>
<div class="highlight"><pre><span></span><code><span class="n">S.traj10</span> <span class="o"><-</span> <span class="p">(</span><span class="nf">t</span><span class="p">(</span><span class="n">traj10</span><span class="p">)</span> <span class="o">*</span> <span class="m">1</span><span class="o">/</span><span class="nf">sqrt</span><span class="p">(</span><span class="nf">nrow</span><span class="p">(</span><span class="n">traj10</span><span class="p">)))</span> <span class="o">%*%</span> <span class="p">(</span><span class="n">traj10</span> <span class="o">*</span> <span class="m">1</span><span class="o">/</span><span class="nf">sqrt</span><span class="p">(</span><span class="nf">nrow</span><span class="p">(</span><span class="n">traj10</span><span class="p">)))</span>
<span class="n">S.traj25</span> <span class="o"><-</span> <span class="p">(</span><span class="nf">t</span><span class="p">(</span><span class="n">traj25</span><span class="p">)</span> <span class="o">*</span> <span class="m">1</span><span class="o">/</span><span class="nf">sqrt</span><span class="p">(</span><span class="nf">nrow</span><span class="p">(</span><span class="n">traj25</span><span class="p">)))</span> <span class="o">%*%</span> <span class="p">(</span><span class="n">traj25</span> <span class="o">*</span> <span class="m">1</span><span class="o">/</span><span class="nf">sqrt</span><span class="p">(</span><span class="nf">nrow</span><span class="p">(</span><span class="n">traj25</span><span class="p">)))</span>
</code></pre></div>
<h4>Derivation of the Eigenvectors</h4>
<p>Recall that we can get the eigenvectors in one of two ways.
The first way, perhaps more straightforward, is through a spectral (eigenvalue) decomposition of the lagged-covariance matrix.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Spectral decomposition of the lagged covariance matrix (columns are eigenvectors)</span>
<span class="n">S.traj10.eigen</span> <span class="o"><-</span> <span class="nf">eigen</span><span class="p">(</span><span class="n">S.traj10</span><span class="p">,</span> <span class="n">symmetric</span> <span class="o">=</span> <span class="bp">T</span><span class="p">)</span><span class="o">$</span><span class="n">vectors</span>
</code></pre></div>
<p>Alternatively, we could take an SVD of the trajectory matrix, keeping the right singular vectors.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># SVD of the trajectory matrix</span>
<span class="n">S.traj10.by.svd</span> <span class="o"><-</span> <span class="nf">svd</span><span class="p">(</span><span class="n">traj10</span><span class="p">)</span>
</code></pre></div>
<p>We can confirm these are equivalent <em>up to a sign change</em> as follows.</p>
<div class="highlight"><pre><span></span><code><span class="nf">all.equal</span><span class="p">(</span><span class="nf">sapply</span><span class="p">(</span><span class="n">S.traj10.by.svd</span><span class="o">$</span><span class="n">v</span><span class="p">,</span> <span class="n">abs</span><span class="p">),</span> <span class="nf">sapply</span><span class="p">(</span><span class="n">S.traj10.eigen</span><span class="p">,</span> <span class="n">abs</span><span class="p">))</span>
</code></pre></div>
<p>The two approaches may show sign differences in the resulting eigenvectors because the definition of the <em>direction</em> of the coordinate system is arbitrary.
The same is true in PCA [1].</p>
<blockquote>
<p>"Each principal component loading vector is unique, up to a sign flip...The signs may differ because each principal component loading vector specifies a direction in p-dimensional space: flipping the sign has no effect as the direction does not change." - James et al. (2013)</p>
</blockquote>
<h4>Visualizing the Temporal Principal Components</h4>
<p>The temporal principal components [3], which correspond to the eigenvectors of the lagged-covariance matrix, are more easily visualized if we wrap up our results in an R data frame.</p>
<div class="highlight"><pre><span></span><code><span class="n">S.traj10.eigen</span> <span class="o"><-</span> <span class="nf">eigen</span><span class="p">(</span><span class="n">S.traj10</span><span class="p">,</span> <span class="n">symmetric</span> <span class="o">=</span> <span class="bp">T</span><span class="p">)</span><span class="o">$</span><span class="n">vectors</span>
<span class="n">dat.traj10</span> <span class="o"><-</span> <span class="nf">as.data.frame</span><span class="p">(</span><span class="n">S.traj10.eigen</span><span class="p">)</span>
<span class="nf">colnames</span><span class="p">(</span><span class="n">dat.traj10</span><span class="p">)</span> <span class="o"><-</span> <span class="m">1</span><span class="o">:</span><span class="nf">ncol</span><span class="p">(</span><span class="n">dat.traj10</span><span class="p">)</span>
<span class="n">dat.traj10</span><span class="o">$</span><span class="n">time</span> <span class="o"><-</span> <span class="m">1</span><span class="o">:</span><span class="nf">ncol</span><span class="p">(</span><span class="n">dat.traj10</span><span class="p">)</span>
<span class="nf">require</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span>
<span class="nf">require</span><span class="p">(</span><span class="n">tidyr</span><span class="p">)</span>
<span class="nf">require</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span>
<span class="n">dat.traj10</span> <span class="o">%>%</span>
<span class="nf">gather</span><span class="p">(</span><span class="n">key</span> <span class="o">=</span> <span class="s">'eigenvector'</span><span class="p">,</span> <span class="n">value</span> <span class="o">=</span> <span class="s">'value'</span><span class="p">,</span> <span class="o">-</span><span class="n">time</span><span class="p">)</span> <span class="o">%>%</span>
<span class="nf">mutate</span><span class="p">(</span><span class="n">eigenvector</span> <span class="o">=</span> <span class="nf">ordered</span><span class="p">(</span><span class="n">eigenvector</span><span class="p">,</span> <span class="n">levels</span> <span class="o">=</span> <span class="m">1</span><span class="o">:</span><span class="m">1000</span><span class="p">))</span> <span class="o">%>%</span>
<span class="nf">ggplot</span><span class="p">(</span><span class="n">mapping</span> <span class="o">=</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">time</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">value</span><span class="p">))</span> <span class="o">+</span>
<span class="nf">geom_line</span><span class="p">(</span><span class="n">size</span> <span class="o">=</span> <span class="m">0.8</span><span class="p">)</span> <span class="o">+</span>
<span class="nf">facet_wrap</span><span class="p">(</span><span class="o">~</span> <span class="n">eigenvector</span><span class="p">)</span> <span class="o">+</span>
<span class="nf">theme_linedraw</span><span class="p">()</span>
</code></pre></div>
<p><a href="/images/20170919_LakeHuron_SSA.png"><img style="float:left;margin-right:20px;" src="/images/thumbs/20170919_LakeHuron_SSA_thumbnail_tall.png" /></a></p>
<div style="clear:both;"></div>
<p>Like PCA, interpretation of SSA results can be subjective.
SSA results may be even harder to interpet because every temporal principal component is some oscillation.
A more straightforward goal with SSA is smoothing, achieved in the <em>reconstruction</em> of the original signal using a subset of the components.</p>
<h2>Conclusion</h2>
<p>This is my take on SSA, PCA, and how they compare for different applications.</p>
<p>PCA is a well-established tool for exploratory visualization and analysis of multivariate data. It's particularly valuable for high-dimensional data (lots of columns). For time series data, it may be less useful if there is more variation between spatial units/ sample units than over time.</p>
<p>SSA is a neat technique for discovering oscillations in time series data but it is tricky to get right. Oscillations may correspond either to signal or to noise and you need to know more about the data generating mechanism in order to distinguish the two. The assumption of first-order stationarity might also pose a problem for certain time series datasets. As a result, SSA may be better for smoothing and forecasting time series data than discovering canonical trajectories. It's also clear that SSA requires longer time series than PCA for practical use.</p>
<h2>References</h2>
<ol>
<li>James, G., D. Witten, R. Tibshirani, and T. Hastie. 2013. <u>An Introduction to Statistical Learning with Applications in R.</u> New York, New York, USA: Springer Texts in Statistics.</li>
<li>Eastman, R., and M. Fulk. 1993. Long sequence time series evaluation using standardized principal components. <em>Photogrammatic Engineering & Remote Sensing</em> <strong>59</strong>(6):991–996.</li>
<li>Elsner, J., and A. Tsonis. 1996. Singular spectrum analysis: a new tool in time series analysis. New York and London: Plenum Press.</li>
<li>Golyandina, N., V. Nekrutkin, and A. Zhigljavsky. 2001. <u>Analysis of Time Series Structure: SSA and Related Techniques.</u> Washington, D.C., U.S.A.: Chapman and Hall/CRC.</li>
<li>Strang, G. 1988. <u>Linear Algebra and its Applications.</u> Orlando, Florida. Harcourt Brace Jovanovich, Inc. 3rd ed.</li>
</ol>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>A visual tool for analyzing trends among group means in R2017-07-19T17:00:00+02:002017-07-19T17:00:00+02:00K. Arthur Endsleytag:karthur.org,2017-07-19:/2017/visual-tool-for-analyzing-trends-in-r.html<p>Here, I present a simple graphical tool in R for visual detection and statistical testing of a trend among group means, where the groups are (usually) quantiles. I argue this is a good tool for exploratory data analysis as it allows non-linear trends to be detected.</p><p>Exploratory data analysis is a topic that doesn't get enough attention in courses, formal or otherwise, on statistical analysis or so-called "data science."
While some scientists or data professionals may be approaching a problem with what they feel is a great amount of well-established theory behind them, in many cases an <em>a priori</em> understanding of the system under consideration is lacking.
In other cases, what is thought to be well-understood about a system is actually ripe for skepticism and further inquiry.</p>
<p>Either way, approaching a dataset with an open mind is a good thing.
While any system of study with two or more variables can exhibit complex behavior, including thresholds, non-linearity, and feedbacks, problems in both the physical and social sciences are frequently modeled as purely linear relationships.
There are a couple of reasons for this and both are understandable.
First, the general linear model (and generalized linear models) are relatively uncomplicated.
A second, less compelling reason, is that linear relationships may be thought to be more useful, or at least easier to understand.
In exploratory data analysis (EDA), we want to get a sense of the range of possibilities in the system under study, to the extent that our available data accurately represent it.</p>
<p><strong>Here, I present a visual tool and R code to help guide the detection of bivariate trends among groups means (or among quantiles of one continuous variable),</strong> which also applies and displays an analysis of variance (ANOVA) test, <strong>while also controlling for the influence of a third variable.</strong>
I've found this script useful for approaching a variety of datasets and, in the interest of cleaning up my code when I have multiple relationships to test, turned it into a re-usable R function.</p>
<h2>Example Using American Community Survey</h2>
<p>For this example, I'll use data from the 2012 5-Year American Community Survey (ACS) for the Detroit-area counties of Wayne, Oakland, and Macomb; survey data are at the block-group level.
I obtained the ACS data from <a href="https://www.socialexplorer.com/">SocialExplorer.com</a>, which is a lot more convenient than going to the U.S. Census Bureau website.
With these data, we might ask questions similar to those posed for a general, tabular dataset:</p>
<ul>
<li>What is the relationship between housing vacancy and median household income, controlling for housing density?</li>
<li>Is there a relationship between median household income and white population proportion, controlling for population density?</li>
</ul>
<p>First, I need to process the ACS 2012 data such that:</p>
<ul>
<li>Variables have meaningful names;</li>
<li>Only block groups with non-zero population and housing totals are considered;</li>
<li>Housing density and population density are calculated for every block group;</li>
<li>Median household income is log-transformed;</li>
<li>My other variables of interest are appropriately normalized;</li>
<li>Calculate quantiles for my variables of interest.</li>
</ul>
<p>I present this example using the R programming environment (version 3.3.2).
For most of the processing, I'll use <code>dplyr</code> and pipes.
I'm going to choose to cut my data into quintiles (<span class="math">\(n=5\)</span> quantiles) but this approach works for any kind of discretization.</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span>
<span class="c1"># Load the 2012 ACS data</span>
<span class="n">acs2012.raw</span> <span class="o"><-</span> <span class="nf">read.csv</span><span class="p">(</span><span class="s">'acs2012_5year_by_block_groups.csv'</span><span class="p">,</span>
<span class="n">header</span> <span class="o">=</span> <span class="bp">T</span><span class="p">,</span> <span class="n">skip</span> <span class="o">=</span> <span class="m">1</span><span class="p">,</span> <span class="n">colClasses</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="s">'Geo_FIPS'</span> <span class="o">=</span> <span class="s">'character'</span><span class="p">))</span>
<span class="n">quintile</span> <span class="o"><-</span> <span class="nf">function </span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="p">{</span>
<span class="c1"># Also, rename levels so they fit inside plot labels</span>
<span class="nf">cut</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">breaks</span> <span class="o">=</span> <span class="nf">quantile</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">probs</span> <span class="o">=</span> <span class="nf">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">0.2</span><span class="p">),</span> <span class="n">na.rm</span> <span class="o">=</span> <span class="bp">T</span><span class="p">),</span>
<span class="n">include.lowest</span> <span class="o">=</span> <span class="bp">T</span><span class="p">,</span> <span class="c1"># Must include for cases of zero vacant housing, etc.</span>
<span class="n">labels</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="s">'1st'</span><span class="p">,</span> <span class="s">'2nd'</span><span class="p">,</span> <span class="s">'3rd'</span><span class="p">,</span> <span class="s">'4th'</span><span class="p">,</span> <span class="s">'5th'</span><span class="p">))</span>
<span class="p">}</span>
<span class="n">acs2012</span> <span class="o"><-</span> <span class="n">acs2012.raw</span> <span class="o">%>%</span>
<span class="nf">select</span><span class="p">(</span>
<span class="n">FIPS</span> <span class="o">=</span> <span class="n">Geo_FIPS</span><span class="p">,</span>
<span class="n">county.code</span> <span class="o">=</span> <span class="n">Geo_COUNTY</span><span class="p">,</span>
<span class="n">area.land.sq.m</span> <span class="o">=</span> <span class="n">Geo_AREALAND</span><span class="p">,</span>
<span class="n">Pop.Total</span> <span class="o">=</span> <span class="n">SE_T001_001</span><span class="p">,</span>
<span class="n">Pop.White</span> <span class="o">=</span> <span class="n">SE_T013_002</span><span class="p">,</span>
<span class="n">Pop.Black</span> <span class="o">=</span> <span class="n">SE_T013_003</span><span class="p">,</span>
<span class="n">Median.Hhold.Income</span> <span class="o">=</span> <span class="n">SE_T057_001</span><span class="p">,</span>
<span class="n">Housing.Units.Total</span> <span class="o">=</span> <span class="n">SE_T093_001</span><span class="p">,</span>
<span class="n">Housing.Units.Vacant</span> <span class="o">=</span> <span class="n">SE_T096_001</span><span class="p">)</span> <span class="o">%>%</span>
<span class="c1"># Consider only block groups with non-zero housing, population</span>
<span class="nf">filter</span><span class="p">(</span><span class="n">Pop.Total</span> <span class="o">></span> <span class="m">0</span><span class="p">)</span> <span class="o">%>%</span>
<span class="nf">filter</span><span class="p">(</span><span class="n">Housing.Units.Total</span> <span class="o">></span> <span class="m">0</span><span class="p">)</span> <span class="o">%>%</span>
<span class="nf">filter</span><span class="p">(</span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">Median.Hhold.Income</span><span class="p">))</span> <span class="o">%>%</span>
<span class="c1"># Normalize variables</span>
<span class="nf">mutate</span><span class="p">(</span>
<span class="n">Log10.Median.Hhold.Income</span> <span class="o">=</span> <span class="nf">log10</span><span class="p">(</span><span class="n">Median.Hhold.Income</span><span class="p">),</span>
<span class="n">Housing.Density</span> <span class="o">=</span> <span class="n">Housing.Units.Total</span> <span class="o">/</span> <span class="n">area.land.sq.m</span><span class="p">,</span>
<span class="n">Pop.Density</span> <span class="o">=</span> <span class="n">Housing.Units.Total</span> <span class="o">/</span> <span class="n">area.land.sq.m</span><span class="p">,</span>
<span class="n">Prop.White</span> <span class="o">=</span> <span class="n">Pop.White</span> <span class="o">/</span> <span class="n">Pop.Total</span><span class="p">,</span>
<span class="n">Prop.Black</span> <span class="o">=</span> <span class="n">Pop.Black</span> <span class="o">/</span> <span class="n">Pop.Total</span><span class="p">,</span>
<span class="n">Prop.Vacant</span> <span class="o">=</span> <span class="n">Housing.Units.Vacant</span> <span class="o">/</span> <span class="n">Housing.Units.Total</span><span class="p">)</span> <span class="o">%>%</span>
<span class="nf">mutate_each</span><span class="p">(</span><span class="nf">funs</span><span class="p">(</span><span class="s">'quintile'</span><span class="p">),</span> <span class="n">Pop.Density</span><span class="p">,</span> <span class="n">Housing.Density</span><span class="p">,</span>
<span class="n">Prop.White</span><span class="p">,</span> <span class="n">Prop.Black</span><span class="p">,</span> <span class="n">Prop.Vacant</span><span class="p">)</span>
</code></pre></div>
<p>We lost just 16 block groups by our decision to exclude those block groups with zero population or zero housing.
It turns out that median household income (<code>Median.Hhold.Income</code>) is missing for 5 more block groups, so I also remove these cases.</p>
<p><strong>The annotated source code for the <code>levels.plot</code> function seen in subsequent code snippets is at the bottom of this post.</strong></p>
<h3>First Plot: Income versus Vacant Housing</h3>
<p>Let's answer the first question with data: What is the relationship between housing vacancy and median household income, controlling for housing density?
Here, it makes more sense (to me) to have median household income on the y-axis, as a continuous variable; thus, it should be the first variable name passed to <code>levels.plot</code>.
The (proportion of) vacant housing should then be our second variable.
Finally, we want to control for the effect of housing density; that is, we want to examine this relationship <em>at different levels of housing density.</em></p>
<div class="highlight"><pre><span></span><code><span class="nf">levels.plot</span><span class="p">(</span><span class="n">acs2012</span><span class="p">,</span> <span class="s">'Log10.Median.Hhold.Income'</span><span class="p">,</span> <span class="s">'Prop.Vacant'</span><span class="p">,</span>
<span class="s">'Housing.Density'</span><span class="p">,</span> <span class="n">y1</span> <span class="o">=</span> <span class="m">5.6</span><span class="p">)</span>
</code></pre></div>
<p>The function will automatically determine where to place both the p-value statistic associated with the ANOVA (within each density level) and the numbers of observations in each bin.
However, I felt the bin counts were getting overplotted by the high-end outliers for the boxplots, so I added a custom value for the <code>y1</code> argument, the position of these labels; it's a little higher, now, than the maximum value for <code>Log10.Median.Hhold.Income</code> (5.60 instead of 5.37).</p>
<p><img alt="The levels plot of median household income versus vacant housing proportion, controlling for housing density" src="/images/20170725_income_vs_vacant.png"></p>
<p><strong>How do we read this plot?</strong>
The subplots with gray boxes as titles (the <em>facets</em> in <code>ggplot2</code> parlance) correspond to each of the five density quintiles (from lowest to highest housing density, in this case).
Within each of those subplots, the five quintiles of the second variable (<code>var2</code>), the proportion of vacant housing (<code>Prop.Vacant</code>), in this case, are each plotted against the log of median household income.</p>
<p>We can see from this plot that there is a clear, statistically significant negative relationship between vacant housing and median household income at all density levels.
This is just as expected; presumably, people with higher incomes can avoid living in neighborhoods with high vacancy rates.
Conversely, high vacancy rates do not tend to occur where people have higher incomes (because they are more likely to be able to afford to pay their mortgage and property taxes).</p>
<p>We also see that, with the logarithmic transformation of median household income, there is a slight saturation effect at low levels of vacant housing.
That is, when the proportion of housing that is vacant is small, we see very small differences in median household income.
But at the highest three quintiles of vacant housing density, the difference is much larger.</p>
<h3>Second Plot: Income and White Population</h3>
<p>Our second question: Is there a relationship between median household income and white population proportion, controlling for population density?
The call to <code>levels.plot</code> is very similar, but we're controlling for population density this time.</p>
<p>We see a clear, statistically significant, positive trend in the log of median household income as the white proportion rises.
At the highest population density level, the relationship is linear; at lower population densities, it exhibits the exponential behavior we saw in the relationship between vacant housing and income.</p>
<div class="highlight"><pre><span></span><code><span class="nf">levels.plot</span><span class="p">(</span><span class="n">acs2012</span><span class="p">,</span> <span class="s">'Log10.Median.Hhold.Income'</span><span class="p">,</span> <span class="s">'Prop.White'</span><span class="p">,</span>
<span class="s">'Pop.Density'</span><span class="p">,</span> <span class="n">y1</span> <span class="o">=</span> <span class="m">5.6</span><span class="p">)</span>
</code></pre></div>
<p><img alt="The levels plot of median household income versus white population proportion, controlling for population density" src="/images/20170725_income_vs_white.png"></p>
<h2>The Levels Plot Function</h2>
<p>There are many other examples I could provide, but you've seen enough of what this function does.
It is a fairly simple, yet effective, visualization tool.
The R code I provide below demonstrates how simple it is.
Because R is not a very well-designed language (e.g., awful tracebacks, pollution of the global namespace, proliferation of global functions that are very similar, factors), I don't profess to be any good at writing base R code; it is quite likely this function definition could be written with more sophistication.
But I think this attempt is more than adequate.
I welcome any suggested changes.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Levels plot function</span>
<span class="c1"># -- Presents a plot of one continuous variable (`var1`) across quintiles of</span>
<span class="c1"># a discrete variable (`var2`), faceted by a third density variable (`dens`)</span>
<span class="nf">require</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span>
<span class="nf">require</span><span class="p">(</span><span class="n">grid</span><span class="p">)</span>
<span class="nf">require</span><span class="p">(</span><span class="n">reshape2</span><span class="p">)</span>
<span class="n">levels.plot</span> <span class="o"><-</span> <span class="nf">function </span><span class="p">(</span><span class="n">q.dat</span><span class="p">,</span> <span class="n">var1</span><span class="p">,</span> <span class="n">var2</span><span class="p">,</span> <span class="n">dens</span><span class="p">,</span> <span class="n">yaxis</span> <span class="o">=</span> <span class="s">'fixed'</span><span class="p">,</span> <span class="n">y0</span> <span class="o">=</span> <span class="kc">NULL</span><span class="p">,</span> <span class="n">y1</span> <span class="o">=</span> <span class="kc">NULL</span><span class="p">,</span> <span class="n">y.label</span> <span class="o">=</span> <span class="kc">NULL</span><span class="p">)</span> <span class="p">{</span>
<span class="c1"># Calculate cross-tabulation</span>
<span class="n">ctab</span> <span class="o"><-</span> <span class="n">reshape2</span><span class="o">::</span><span class="nf">melt</span><span class="p">(</span><span class="nf">table</span><span class="p">(</span><span class="nf">subset</span><span class="p">(</span><span class="n">q.dat</span><span class="p">,</span> <span class="n">select</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="n">var2</span><span class="p">,</span> <span class="n">dens</span><span class="p">))),</span> <span class="n">id.vars</span> <span class="o">=</span> <span class="n">var2</span><span class="p">)</span>
<span class="c1"># Set the y-axis position of quantile N labels</span>
<span class="n">ctab</span><span class="o">$</span><span class="n">y</span> <span class="o"><-</span> <span class="nf">rep</span><span class="p">(</span><span class="nf">ifelse</span><span class="p">(</span><span class="nf">is.null</span><span class="p">(</span><span class="n">y1</span><span class="p">),</span> <span class="nf">max</span><span class="p">(</span><span class="n">q.dat</span><span class="p">[,</span><span class="n">var1</span><span class="p">]),</span> <span class="n">y1</span><span class="p">),</span> <span class="nf">dim</span><span class="p">(</span><span class="n">ctab</span><span class="p">)[</span><span class="m">1</span><span class="p">])</span>
<span class="c1"># var1 must be a continuous variable</span>
<span class="nf">stopifnot</span><span class="p">(</span><span class="nf">class</span><span class="p">(</span><span class="n">q.dat</span><span class="p">[,</span><span class="n">var1</span><span class="p">])</span> <span class="o">%in%</span> <span class="nf">c</span><span class="p">(</span><span class="s">'integer'</span><span class="p">,</span> <span class="s">'numeric'</span><span class="p">))</span>
<span class="c1"># Iterate over the density levels...</span>
<span class="n">tests</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">()</span>
<span class="nf">for </span><span class="p">(</span><span class="n">q</span> <span class="n">in</span> <span class="nf">levels</span><span class="p">(</span><span class="n">q.dat</span><span class="p">[,</span><span class="n">dens</span><span class="p">]))</span> <span class="p">{</span>
<span class="n">test</span> <span class="o"><-</span> <span class="nf">aov</span><span class="p">(</span><span class="nf">as.formula</span><span class="p">(</span><span class="nf">paste0</span><span class="p">(</span><span class="n">var1</span><span class="p">,</span> <span class="s">' ~ '</span><span class="p">,</span> <span class="n">var2</span><span class="p">)),</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">q.dat</span><span class="p">[</span><span class="n">q.dat</span><span class="p">[,</span><span class="n">dens</span><span class="p">]</span> <span class="o">==</span> <span class="n">q</span><span class="p">,])</span>
<span class="n">tests</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="n">tests</span><span class="p">,</span> <span class="nf">sprintf</span><span class="p">(</span><span class="s">'p-value: ~%.4f'</span><span class="p">,</span> <span class="nf">summary</span><span class="p">(</span><span class="n">test</span><span class="p">)[[</span><span class="m">1</span><span class="p">]][[</span><span class="s">'Pr(>F)'</span><span class="p">]][[</span><span class="m">1</span><span class="p">]]))</span>
<span class="p">}</span>
<span class="c1"># Set the y-position of the p-value text</span>
<span class="n">tests</span> <span class="o"><-</span> <span class="nf">data.frame</span><span class="p">(</span><span class="n">p.value</span> <span class="o">=</span> <span class="n">tests</span><span class="p">,</span> <span class="n">dens</span> <span class="o">=</span> <span class="nf">levels</span><span class="p">(</span><span class="n">q.dat</span><span class="p">[,</span><span class="n">dens</span><span class="p">]),</span>
<span class="n">x</span> <span class="o">=</span> <span class="m">1</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="nf">ifelse</span><span class="p">(</span><span class="nf">is.null</span><span class="p">(</span><span class="n">y0</span><span class="p">),</span> <span class="nf">min</span><span class="p">(</span><span class="n">q.dat</span><span class="p">[,</span><span class="n">var1</span><span class="p">]),</span> <span class="n">y0</span><span class="p">))</span>
<span class="c1"># MUST rename the "dens" column to the variable name so it can be found by ggplot2</span>
<span class="n">.names</span> <span class="o"><-</span> <span class="nf">names</span><span class="p">(</span><span class="n">tests</span><span class="p">)</span>
<span class="n">.names</span><span class="p">[</span><span class="m">2</span><span class="p">]</span> <span class="o"><-</span> <span class="n">dens</span>
<span class="nf">names</span><span class="p">(</span><span class="n">tests</span><span class="p">)</span> <span class="o"><-</span> <span class="n">.names</span>
<span class="c1"># The initial plot object</span>
<span class="nf">ggplot</span><span class="p">(</span><span class="n">q.dat</span><span class="p">,</span> <span class="nf">aes_string</span><span class="p">(</span><span class="n">y</span> <span class="o">=</span> <span class="n">var1</span><span class="p">))</span> <span class="o">+</span>
<span class="nf">geom_boxplot</span><span class="p">(</span><span class="n">mapping</span> <span class="o">=</span> <span class="nf">aes_string</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">var2</span><span class="p">))</span> <span class="o">+</span>
<span class="nf">geom_text</span><span class="p">(</span><span class="nf">aes_string</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">var2</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="s">'y'</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'value'</span><span class="p">),</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">ctab</span><span class="p">,</span> <span class="n">vjust</span> <span class="o">=</span> <span class="m">0.7</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="m">4.5</span><span class="p">)</span> <span class="o">+</span>
<span class="nf">geom_text</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">y</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="n">p.value</span><span class="p">),</span> <span class="n">data</span> <span class="o">=</span> <span class="n">tests</span><span class="p">,</span>
<span class="n">hjust</span> <span class="o">=</span> <span class="m">0.1</span><span class="p">,</span> <span class="n">vjust</span> <span class="o">=</span> <span class="m">0.1</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="m">4.5</span><span class="p">)</span> <span class="o">+</span>
<span class="nf">xlab</span><span class="p">(</span><span class="nf">paste0</span><span class="p">(</span><span class="s">'2012 ACS '</span><span class="p">,</span> <span class="nf">gsub</span><span class="p">(</span><span class="s">'\\.'</span><span class="p">,</span> <span class="s">' '</span><span class="p">,</span> <span class="n">var2</span><span class="p">),</span> <span class="s">' Quintile'</span><span class="p">))</span> <span class="o">+</span>
<span class="nf">ylab</span><span class="p">(</span><span class="nf">ifelse</span><span class="p">(</span><span class="nf">is.null</span><span class="p">(</span><span class="n">y.label</span><span class="p">),</span> <span class="nf">gsub</span><span class="p">(</span><span class="s">'\\.'</span><span class="p">,</span> <span class="s">' '</span><span class="p">,</span> <span class="n">var1</span><span class="p">),</span> <span class="n">y.label</span><span class="p">))</span> <span class="o">+</span>
<span class="nf">labs</span><span class="p">(</span><span class="n">title</span> <span class="o">=</span> <span class="nf">paste0</span><span class="p">(</span><span class="nf">gsub</span><span class="p">(</span><span class="s">'\\.'</span><span class="p">,</span> <span class="s">' '</span><span class="p">,</span> <span class="n">var2</span><span class="p">),</span> <span class="s">' by '</span><span class="p">,</span> <span class="nf">gsub</span><span class="p">(</span><span class="s">'\\.'</span><span class="p">,</span> <span class="s">' '</span><span class="p">,</span> <span class="n">dens</span><span class="p">),</span> <span class="s">' Quintiles'</span><span class="p">))</span> <span class="o">+</span>
<span class="nf">facet_wrap</span><span class="p">(</span><span class="nf">as.formula</span><span class="p">(</span><span class="nf">paste0</span><span class="p">(</span><span class="s">'~ '</span><span class="p">,</span> <span class="n">dens</span><span class="p">)),</span> <span class="n">scales</span> <span class="o">=</span> <span class="n">yaxis</span><span class="p">)</span> <span class="o">+</span>
<span class="nf">theme_bw</span><span class="p">()</span> <span class="o">+</span>
<span class="nf">theme</span><span class="p">(</span><span class="n">text</span> <span class="o">=</span> <span class="nf">element_text</span><span class="p">(</span><span class="n">size</span> <span class="o">=</span> <span class="m">16</span><span class="p">),</span>
<span class="n">plot.margin</span> <span class="o">=</span> <span class="nf">unit</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">0.5</span><span class="p">,</span> <span class="m">0.2</span><span class="p">,</span> <span class="m">0.5</span><span class="p">,</span> <span class="m">0</span><span class="p">),</span> <span class="s">'cm'</span><span class="p">))</span>
<span class="p">}</span>
</code></pre></div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Teaching the Q Method in a class on urban sustainability2016-09-21T10:30:00+02:002016-09-21T10:30:00+02:00K. Arthur Endsleytag:karthur.org,2016-09-21:/2016/teaching-the-q-method-urban-sustainability.html<p>The Q Method is a mixed method that combines a survey of individuals with factor analysis to determine what distinct perspectives are embedded in a population. In a class on urban sustainability, I demonstrated how this method can be used to reveal students' diverse perspectives on issues about which we assume they mostly agree.</p><h2>On Deciding to Teach the Q Method</h2>
<p>While my background is in the natural sciences, I have a tendency for discovering and diving in to new methods, including those that originate or are typically practiced in the social sciences.
As I'm helping to design and teach a course on urban sustainability this semester, a course that must cater to students in a professional Master's program that will go on to careers as sustainability practitioners, I have struggled to devise skill-based instruction that serves their needs.
The Q Method is a relatively obscure approach to analyzing qualitative data on human subjectivity (although <a href="https://qmethod.org/resources/how2q/">there is an active research community</a> promoting its wider use).
Our students are learning that urban sustainability is not an exact science; it is a confluence of discourses and untested proposals for how to make our cities more efficient, healthy, and just.
Can the Q Method help them to understand the diverse perspectives on contemporary sustainability issues?</p>
<h2>The Q Method</h2>
<p><strong>The Q Method is a mixed method that combines a survey of individuals with factor analysis to determine what distinct perspectives are embedded in a population.</strong>
In the words of van Exel and de Graaf [1], who paraphrase Brown [2]:</p>
<blockquote>
<p>Q methodology provides a foundation for the systematic study of subjectivity, a person's viewpoint, opinion, beliefs, attitude, and the like...By Q sorting people give their subjective meaning to the statements, and by doing so reveal their subjective viewpoint...or personal profile.</p>
</blockquote>
<p>It is a useful tool for analyzing human subjectivity on a variety of social or technical issues, whether the respondents are experts in a particular field or are drawn from a more general population.</p>
<p>Q Method was devised by the psychologist William Stephenson, who was very critical of the classic statistical analysis advanced by Karl Pearson.
In particular, Stephenson's Q Method questions the single, objective reality that is assumed in classical statistics, where the goal is to develop a best-fitting model that tests one of a prescribed set of hypotheses.
With the Q Method, no <em>a priori</em> hypotheses are specified.
Its practitioners assume the existence of multiple subjective realities.
The origin story for the Q Method holds that because classical statistics was a discipline associated with Pearson and, in particular, Pearson's correlation coefficient, denoted <span class="math">\(r\)</span>, Stephenson decided that his method should be called "Q" as Q comes before R in the alphabet.</p>
<h3>Terminology and Methods</h3>
<p>We've all seen surveys that ask us to rate statements on a scale from "Agree" to "Disagree," perhaps from "Strongly Agree" to "Strongly Disagree."
These surveys have usually irritated me because this spectrum seems rather arbitrary, the number of gradations from one end to the other too numerous.
Am I a "7" or an "8" on this 10-point scale?
Do I agree or agree <em>strongly</em>?</p>
<p>The Q Method begins with such a survey but puts more thought into its design and execution.
When we survey a group of respondents, also called the <strong>P sample</strong> (or <strong>P set</strong>), each is asked to sort a collection of statements, photographs, or other discrete messages along an axis.
This axis, however, may be described in different terms from simple agreement and disagreement.
We might ask the respondents to sort the statements from "less like how I think" to "more like how I think;" from "less likely to motivate me" to "more likely to motivate me."
The statements (or photographs, audio clips, etc.) to be sorted consitute the <strong>concourse of communication.</strong>
The sorting of these statements is also typically constrained by a matrix that approximates a quasi-normal distribution (see below).
Each of the responses constitutes a <strong>Q sort</strong> and the collection of all Q sorts is referred to as the <strong>Q sample.</strong></p>
<p><a href="/images/20160921_q_sort.png"><img style="float:left;margin-right:20px;" src="/images/thumbs/20160921_q_sort_thumbnail_wide.png" /></a></p>
<div style="clear:both;"></div>
<p>Above, a Q sort in progress is depicted (from Ellingsen et al., 2014); the respondent is sorting the statements into a quasi-normal distribution according to her subjective viewpoint on each.</p>
<p>The "objective" part of the Q Method comes after the Q sample is generated.
It involves the application of factor analysis to the Q sample, with the goal of inducing factors that correspond to shared, subjective worldviews related to the councourse of communication.
The use of factor analysis belies the subjectivity that enters this part of the method, however.
The number of factors to induce is not easily determined.
Moreover, the resulting factors require interpretation, which is often highly subjective and even confusing.
Consequently, much of what is to be learned from the Q Method comes from the <em>process</em> of conducting the analysis and sharing the results with the P sample that produced the data.
When the P sample consists of practitioners, community members, or outside experts, then the Q Method becomes a tool for the <strong>co-production</strong> of knowledge, which is increasingly important in natural resouce issues.</p>
<h3>The Q Method in Practice</h3>
<p>The Q Method has been used in sustainability research before.
Zeemering (2009) survey San Francisco city officials as to what aspects of "sustainability" are most important to them and their work [4].
The statements for the concourse were drawn from a local non-profit group's report and were ranked by city officals from "least" to "most important in my community."
The Q Method has also been used with a concourse of photos, which participants' were asked to sort based on two prompts: 1) as to how the picture makes them feel climate change is important or unimportant; 2) as to how the picture makes them feel they can or cannot do something about climate change [5].</p>
<h2>Applying the Q Method to Urban Sustainability</h2>
<p>I assigned my students a Q sort exercise with statements that described multiple perspectives on sustainability.
The concourse was drawn from two sources: John Dryzek's <em>The Politics of the Earth</em> and another book, <em>Confronting Consumption</em>, edited by Thomas Princen, Michael Maniates, and Ken Conca.
These books, particularly the former, explore a number of ways of characterizing sustainability challenges and natural resource issues.
Students were asked <a href="http://karthur.org/static/docs/20160921_Q_Sort_Matrix_for_Urban_Sustainability.xlsx">to sort 21 statements into a matrix</a> based on their agreement with the statement.
A quasi-normal distribution was enforced.
<strong>What follows is a description of the analysis in R.</strong></p>
<h3>Organizing the Q Sample</h3>
<p>The data are organized as a CSV file with the respondents along the columns and the statements along the rows, as in the example below, whre <code>r1</code> through <code>r15</code> are each of the respondents.</p>
<div class="highlight"><pre><span></span><code><span class="n">q.sample</span> <span class="o"><-</span> <span class="nf">read.csv</span><span class="p">(</span><span class="s">'QMethod_results.csv'</span><span class="p">)</span>
<span class="nf">head</span><span class="p">(</span><span class="n">q.sample</span><span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code>## r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 r14 r15
## 1 -1 0 1 2 2 0 0 1 -1 0 0 2 3 3 2
## 2 2 0 1 1 1 2 -1 0 0 2 3 2 2 -3 2
## 3 1 2 2 1 1 1 1 -3 0 1 2 -2 1 1 1
## 4 -1 2 -2 0 -2 -3 -1 1 -2 -2 0 -2 -3 -2 -2
## 5 0 0 0 0 1 1 -2 1 1 -1 1 0 0 0 1
## 6 0 -1 1 -1 -1 -1 0 0 1 0 0 3 -2 1 -1
</code></pre></div>
<p>Because we used a forced (quasi-normal) distribution (i.e., respondents were constrained to placing statements inside a bell-shaped distribution), we want to make sure that the columns sum to zero (i.e., same number of statements on each side of the curve).</p>
<div class="highlight"><pre><span></span><code><span class="nf">apply</span><span class="p">(</span><span class="n">q.sample</span><span class="p">,</span> <span class="m">2</span><span class="p">,</span> <span class="n">sum</span><span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code>## r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 r14 r15
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
</code></pre></div>
<h3>Exploratory Analysis</h3>
<p>One question we can ask of the Q sample is which respondents are highly correlated with one another?
We can calculate the Pearson's correlations between each pair of respondents using the <code>cor()</code> function.</p>
<div class="highlight"><pre><span></span><code><span class="nf">cor</span><span class="p">(</span><span class="n">q.sample</span><span class="p">)</span>
</code></pre></div>
<p>However, it's more effective to visualize these correlations as a heatmap.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Calculate the correlations</span>
<span class="n">cc</span> <span class="o"><-</span> <span class="nf">as.data.frame</span><span class="p">(</span><span class="nf">cor</span><span class="p">(</span><span class="n">q.sample</span><span class="p">,</span> <span class="n">method</span> <span class="o">=</span> <span class="s">'pearson'</span><span class="p">))</span>
<span class="n">cc</span><span class="o">$</span><span class="n">X1</span> <span class="o"><-</span> <span class="nf">factor</span><span class="p">(</span><span class="nf">colnames</span><span class="p">(</span><span class="n">cc</span><span class="p">),</span> <span class="n">ordered</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span> <span class="n">levels</span> <span class="o">=</span> <span class="nf">rev</span><span class="p">(</span><span class="nf">names</span><span class="p">(</span><span class="n">q.sample</span><span class="p">)))</span>
<span class="nf">require</span><span class="p">(</span><span class="n">reshape2</span><span class="p">)</span>
<span class="c1"># Reorganize the table for ggplot2</span>
<span class="n">cc</span> <span class="o"><-</span> <span class="nf">melt</span><span class="p">(</span><span class="n">cc</span><span class="p">,</span> <span class="n">id.var</span> <span class="o">=</span> <span class="s">'X1'</span><span class="p">,</span> <span class="n">variable.name</span> <span class="o">=</span> <span class="s">'X2'</span><span class="p">,</span> <span class="n">value.name</span> <span class="o">=</span> <span class="s">'Correlation'</span><span class="p">)</span>
<span class="n">cc</span><span class="o">$</span><span class="n">Correlation</span> <span class="o"><-</span> <span class="n">cc</span><span class="o">$</span><span class="n">Correlation</span>
<span class="nf">require</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span>
<span class="nf">require</span><span class="p">(</span><span class="n">RColorBrewer</span><span class="p">)</span>
<span class="n">pal</span> <span class="o"><-</span> <span class="nf">brewer.pal</span><span class="p">(</span><span class="m">9</span><span class="p">,</span> <span class="s">'RdYlBu'</span><span class="p">)</span>
<span class="nf">ggplot</span><span class="p">(</span><span class="n">cc</span><span class="p">,</span>
<span class="n">mapping</span><span class="o">=</span><span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">X1</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">X2</span><span class="p">,</span> <span class="n">fill</span> <span class="o">=</span> <span class="n">Correlation</span><span class="p">))</span> <span class="o">+</span>
<span class="nf">geom_tile</span><span class="p">()</span> <span class="o">+</span>
<span class="nf">scale_fill_gradientn</span><span class="p">(</span><span class="n">colours</span> <span class="o">=</span> <span class="n">pal</span><span class="p">,</span> <span class="n">limits</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">-1</span><span class="p">,</span> <span class="m">1</span><span class="p">))</span> <span class="o">+</span>
<span class="nf">scale_x_discrete</span><span class="p">(</span><span class="n">expand</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">))</span> <span class="o">+</span>
<span class="nf">scale_y_discrete</span><span class="p">(</span><span class="n">expand</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">))</span> <span class="o">+</span>
<span class="nf">coord_equal</span><span class="p">()</span> <span class="o">+</span>
<span class="nf">xlab</span><span class="p">(</span><span class="s">''</span><span class="p">)</span> <span class="o">+</span> <span class="nf">ylab</span><span class="p">(</span><span class="s">''</span><span class="p">)</span> <span class="o">+</span>
<span class="nf">labs</span><span class="p">(</span><span class="n">title</span> <span class="o">=</span> <span class="s">"Pearson's Correlation Coefficients"</span><span class="p">)</span> <span class="o">+</span>
<span class="nf">theme_bw</span><span class="p">()</span> <span class="o">+</span>
<span class="nf">theme</span><span class="p">(</span><span class="n">axis.text</span> <span class="o">=</span> <span class="nf">element_text</span><span class="p">(</span><span class="n">size</span> <span class="o">=</span> <span class="m">16</span><span class="p">),</span>
<span class="n">axis.text.x</span> <span class="o">=</span> <span class="nf">element_text</span><span class="p">(</span><span class="n">angle</span> <span class="o">=</span> <span class="m">90</span><span class="p">,</span> <span class="n">hjust</span> <span class="o">=</span> <span class="m">1</span><span class="p">),</span>
<span class="n">plot.title</span> <span class="o">=</span> <span class="nf">element_text</span><span class="p">(</span><span class="n">vjust</span> <span class="o">=</span> <span class="m">1</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="m">16</span><span class="p">),</span>
<span class="n">legend.title</span> <span class="o">=</span> <span class="nf">element_text</span><span class="p">(</span><span class="n">size</span> <span class="o">=</span> <span class="m">12</span><span class="p">,</span> <span class="n">vjust</span> <span class="o">=</span> <span class="m">2</span><span class="p">),</span>
<span class="n">legend.text</span> <span class="o">=</span> <span class="nf">element_text</span><span class="p">(</span><span class="n">size</span> <span class="o">=</span> <span class="m">12</span><span class="p">),</span>
<span class="n">legend.text.align</span> <span class="o">=</span> <span class="m">1</span><span class="p">)</span>
</code></pre></div>
<p><img alt="The Pearson's correlation coefficients between each respondent." src="http://karthur.org/images/20160921_q_method_heatmap.png"></p>
<h3>Choosing the Number of Factors to Induce</h3>
<p>When using the Q method, we have to decide <em>a priori</em> how many factors to extract from the dataset.
Recall that the factors correspond to the distinct perspectives or worldviews that the population's Q sorts represent.
We may have some prior information or expert opinion as to the number of perspectives that exist.
For example, in a Q sort of statements related to immigration, we might expect that there would be both pro-immigration and anti-immigration perspectives; therefore, we would try to extract at least 2 factors.</p>
<p>If we don't have any prior information, we can attempt to determine the number of factors from the data alone.
This approach, letting the dataset speak for itself, might begin with a <strong>screeplot,</strong> as below.
The screeplot shows the amount of variance (which we can think of as information content) that is explained by an increasing number of factors.
Think about this plot in terms of moving to the right along the horizontal axis.
As we continue to add factors, we will come to a point where very little information is gained by adding an additional factor.
We want to choose the fewest number of factors as possible while explaining as much of the data as possible.</p>
<div class="highlight"><pre><span></span><code><span class="nf">screeplot</span><span class="p">(</span><span class="nf">prcomp</span><span class="p">(</span><span class="n">q.sample</span><span class="p">),</span> <span class="n">main</span> <span class="o">=</span> <span class="s">'Screeplot of unrotated factors'</span><span class="p">,</span>
<span class="n">type</span> <span class="o">=</span> <span class="s">'l'</span><span class="p">,</span> <span class="n">lwd</span> <span class="o">=</span> <span class="m">2</span><span class="p">,</span> <span class="n">cex</span> <span class="o">=</span> <span class="m">1.5</span><span class="p">,</span> <span class="n">cex.lab</span> <span class="o">=</span> <span class="m">1.4</span><span class="p">,</span> <span class="n">cex.axis</span> <span class="o">=</span> <span class="m">1.5</span><span class="p">,</span>
<span class="n">cex.main</span> <span class="o">=</span> <span class="m">1.5</span><span class="p">)</span>
</code></pre></div>
<p><img alt="The screeplot shows how much variance is explained (as a drop in total variance) by an increasing number of factors." src="http://karthur.org/images/20160921_q_method_screeplot.png"></p>
<p>Based on this screeplot, I would estimate there are about 4 or 5 factors in the data.</p>
<h3>Performing the Factor Analysis</h3>
<p>Once we've decided on how many factors to induce in the data, we can run the final analysis.
I'm using the <code>qmethod</code> package for this analysis.</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">qmethod</span><span class="p">)</span>
<span class="n">results</span> <span class="o"><-</span> <span class="nf">qmethod</span><span class="p">(</span><span class="n">q.sample</span><span class="p">,</span> <span class="n">nfactors</span> <span class="o">=</span> <span class="m">4</span><span class="p">)</span>
</code></pre></div>
<p>We can see the factor loadings with:</p>
<div class="highlight"><pre><span></span><code><span class="nf">summary</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="c1">## Q-method analysis.</span><span class="w"></span>
<span class="c1">## Finished on: Tue Sep 20 10:07:54 2016</span><span class="w"></span>
<span class="c1">## Original data: 21 statements, 15 Q-sorts</span><span class="w"></span>
<span class="c1">## Forced distribution: TRUE</span><span class="w"></span>
<span class="c1">## Number of factors: 4</span><span class="w"></span>
<span class="c1">## Rotation: varimax</span><span class="w"></span>
<span class="c1">## Flagging: automatic</span><span class="w"></span>
<span class="c1">## Correlation coefficient: pearson</span><span class="w"></span>
<span class="c1">##</span><span class="w"></span>
<span class="c1">## Factor scores</span><span class="w"></span>
<span class="c1">## fsc_f1 fsc_f2 fsc_f3 fsc_f4</span><span class="w"></span>
<span class="c1">## 1 1 0 1 1</span><span class="w"></span>
<span class="c1">## 2 3 2 0 0</span><span class="w"></span>
<span class="c1">## 3 2 1 -3 2</span><span class="w"></span>
<span class="c1">## 4 -2 -2 -1 -2</span><span class="w"></span>
<span class="o">...</span><span class="w"></span>
</code></pre></div>
<p>We can also visualize the results as a plot, where the statements are on the y-axis, ordered from highest consensus (at bottom) to highest disagreement (at top).</p>
<div class="highlight"><pre><span></span><code><span class="nf">require</span><span class="p">(</span><span class="n">RColorBrewer</span><span class="p">)</span>
<span class="n">colours</span> <span class="o">=</span> <span class="nf">brewer.pal</span><span class="p">(</span><span class="m">4</span><span class="p">,</span> <span class="s">'Set1'</span><span class="p">)</span>
<span class="nf">plot</span><span class="p">(</span><span class="n">results</span><span class="p">,</span>
<span class="n">ylab</span> <span class="o">=</span> <span class="s">'Statements'</span><span class="p">,</span> <span class="n">colours</span> <span class="o">=</span> <span class="n">colours</span><span class="p">,</span>
<span class="n">main</span> <span class="o">=</span> <span class="s">'Factor Loadings by Concourse Statement'</span><span class="p">)</span>
</code></pre></div>
<p><a href="/images/20160921_q_method_factor_loadings.png"><img style="float:left;margin-right:20px;" src="/images/thumbs/20160921_q_method_factor_loadings_thumbnail_wide.png" /></a></p>
<div style="clear:both;"></div>
<p>We can see that Statement 1 ("A sustainable society is one that has in place informational, social, and institutional mechanisms to keep in check the positive feedback loops that cause exponential population and capital growth") has the second-highest consensus among the perspectives; that is, no matter what worldview an individual respondent most closely aligns with, he or she was very likely to rate this statement the same way as everyone else.
<strong>What are the statements with the highest disagreement?</strong>
Statement 14 ("Growth has no set limits in terms of population or resource use beyond which lies ecological disaster") and Statement 18 ("There is no single pathway to a sustainable future; local experimentation that is pluralistic, incremental, and piecemeal must be allowed so that every possibility for enduring prosperity is considered") saw wide disagreement, indicating that people disagree about limits to growth and whether or not local control is an important component of achieving sustainability.</p>
<h3>Interpreting the Factors</h3>
<p>Using the summary and the plot outputs in the previous section, we can try to interpret the factors as follows.</p>
<h4>Factor 1: Market Reformers</h4>
<p>Factor 1 is distinguished by:</p>
<ul>
<li>A negative response to Statement 14 ("Growth has no set limits...");</li>
<li>A negative response to Statement 6 ("Resource exhaustion and environmental degradation are largely a matter of individuals and other actors pursuing their material interests in uncoordinated and decentralized systems").</li>
</ul>
<p><strong>This suggests that this perspective is skeptical of growth and also about individual responsibility for natural resource problems.</strong></p>
<h4>Factor 2: Radical Political Economists</h4>
<p>Factor 2 is distinguished by:</p>
<ul>
<li>A positive response to Statement 21 ("A focus on individual responses to environmental problems, such as planting trees or recycling, distracts us from the structural and institutional barriers to achieving sustainability");</li>
<li>A negative response to Statement 12 ("Environmental sustainability can be achieved largely through smart consumers making smart choices about where their food is sourced, how their clothing is made, where they live, and how much they buy");</li>
<li>A negative response to Statement 9 ("Environmental conservation is not just ecologically sound, it is good for business and economic growth everywhere").</li>
</ul>
<p><strong>These results suggest that respondents aligned with Factor 2 are conscious of how a broader political economy hampers sustainability and are not convinced that sustainability can be achived through individual consumer choices.</strong></p>
<h4>Factor 3: Pro-Market Eco-Modernists Acknowledging Sacrifice Zones</h4>
<p>Factor 3 is distinguished by two contradictory statements:</p>
<ul>
<li>A positive response to Statement 14 ("Growth has no set limits...");</li>
<li>A positive response to Statement 15 ("Humans have already appropriated more than our fair share of the earth's finite resource base and further economic growth is impossible").</li>
</ul>
<p>And also:</p>
<ul>
<li>A negative response to Statement 18 ("There is no single pathway to a sustainable future...");</li>
<li>A negative resposne to Statement 3 ("Advances in science and technology can enhance the carrying capacity of the earth and our resource base");</li>
<li>A positive response to Statement 7 ("A government-led sustainable development initiative is just another futile attempt to replace markets with political management; instead of trying to impose discipline on people's decisions, we should adjust the market's price system");</li>
<li>A negative response to Statement 10 ("Sustainability planning can work through government's creation and enforcement of environmental quality standards");</li>
<li>A negative response to Statement 13 ("Economic growth in all parts of the world is essential to improve the livelihoods of the poor, to sustain growing populations, and eventually to stabilize population levels");</li>
<li>A positive response to Statement 8 ("Inevitable population growth and desirable economic growth can never be accomodated by the earth's resources").</li>
</ul>
<p><strong>This is a confusing array of positions.</strong>
The only sense I can make of this is that respondents aligned with this view are market boosters who think that Western industrial society represents the peak of human efficiency and yet acknowledge that this reality has created ecological sacrifice zones in the rest of the world, which will never catch up.</p>
<h4>Factor 4: Pragmatic Social Engineers</h4>
<p>Factor 4 is distinguished by:</p>
<ul>
<li>A negative response to Statement 21 ("A focus on individual responses to environmental problems...distracts us from the structural and institutional barriers to achieving sustainability");</li>
<li>A positive response to Statement 17 ("In order to avoid the disruption of the earth's life support functions, a centrally coordinated plan for action is needed with enforcement administrated by a global governing body")</li>
<li>A positive response to Statement 10 ("Sustainability planning can work through government's creation and enforcement of environmental quality standards").</li>
</ul>
<p>These statements distinguish an interesting composite view of sustainability issues both at local and global scales.
<strong>This view suggests that individual behaviors should conform to a global development plan, enforced by central governments, for the proper management and conservation of natural resources; a confidence in the ability of governments to change human behavior.</strong>
They are pragmatic because they don't seem to believe that personal choice alone is enough to achieve sustainability.</p>
<h3>Checking the Factor Loadings</h3>
<p>Finally, we can see how each respondent loads onto each factor (i.e., how they align with each perspective).</p>
<div class="highlight"><pre><span></span><code><span class="n">results</span><span class="o">$</span><span class="n">loa</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code> f1 f2 f3 f4
r1 0.07740761 0.8594596 0.005231051 0.096248126
r2 0.11055604 0.2496759 -0.825336473 0.320306693
r3 0.35672108 0.2638716 -0.057400350 0.771305939
...
</code></pre></div>
<h2>Concluding Remarks</h2>
<p>Three of the four factors identified represent interesting and probably real world views.
Moreover, the loadings on these factors persists when I induce 5 factors instead of 4, suggesting they are stable.
Factor 3, however, is hard to interpret.
It likely indicates that the students did not understand Statements 14 and 15 (or, more precisely, did not understand them the same way I did when I chose them).</p>
<p><strong>There are two significant areas for improvement in this example that would be essential next steps were this a real study.</strong>
First, the concourse of communication was drawn from books the students have likely never read; they may not understand what the authors of these statements mean by them.
In general, the concourse of communication should be drawn from written or spoken statements by the individuals forming the P sample.
For urban sustainability studies, this might involve interviewing sustainability practitioners and selecting statements from across multiple individuals, who later will each conduct a Q sort.
Second, the Q sort matrix was probably not well suited for this study.
A larger body of statements and more slots to accomodate them in the tails of the distribution would likely have improved the results.
While there is little advice that been able to find on designing these matrices or selecting the elements of the concourse of communication, there is promise to this method, at least in the classroom.</p>
<h2>References</h2>
<ol>
<li>Exel, J. van, & Graaf, G. de. (2005). Q methodology: A sneak preview.</li>
<li>Brown, S. R. (1993). A primer on Q methodology. <em>Operant Subjectivity</em>, 16(3/4), 91–138.</li>
<li>Ellingsen, I. T., A. A. Thorsen, I. Størksen. (2014) "Revealing Children's Experiences and Emotions through Q Methodology."" <em>Child Development Research</em>. 2014. Article ID 910529.</li>
<li>Zeemering, E. S. (2009). What Does Sustainability Mean to City Officials? <em>Urban Affairs Review</em>, 45(2), 247–273.</li>
<li>O’Neill, S. J., M. Boykoff, S Niemeyer, S. A. Day. (2013) "On the use of imagery for climate change engagement." <em>Global Environmental Change</em> 23, 413–421.</li>
</ol>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Diagnostics for fixed effects panel models in R2016-04-14T15:00:00+02:002016-04-14T15:00:00+02:00K. Arthur Endsleytag:karthur.org,2016-04-14:/2016/fixed-effects-panel-models-in-r.html<p>In working with linear fixed-effects panel models, I discovered that I had to develop goodness-of-fit tests and diagnostics on my own, as the libraries for working with these kinds of models haven't progressed that far yet.</p><p><strong>Note:</strong> This post has been updated for clarity and to use the Gapminder dataset instead of my old, proprietary example.</p>
<p>I've recently been working with linear fixed-effects panel models for my research.
This class of models is a special case of more general multi-level or hierarchical models, which have wide applicability for a number of problems.
In hierarchical models, there may be fixed effects, random effects, or both (so-called <em>mixed models</em>); a discussion of the multiple definitions of "fixed effects" is beyond the scope of this post, but Gelman and Hill (2007) or Bolker et al. (2009) are good references for this [<a href="#refs">4,7</a>].
Fixed effects, in the sense of fixed-effects or panel regression, are basically just categorical indicators for each subject or individual in the model.
The way this works without exhausting all of our degrees of freedom is that we have at least two observations <em>over time</em> for each subject (hence: a panel dataset).
One further tweak that leads to the "within" estimator discussed in this post is that each subject's panel data are time-demeaned; that is, the long-term average within each subject is subtracted from all measurements for that subject.</p>
<p><strong>Although these models can be fit in R using the the built-in <code>lm()</code> function most users are familiar with, there are good reasons to use one of the two dedicated libraries discussed here:</strong></p>
<ul>
<li>For large numbers of fixed effects, the function can be intractable and return poor results.</li>
<li>In addition, they clutter in statistical <code>summary()</code> of your model because they're reported alongside any covariates of interest.</li>
<li>The "fixed-effects transformation" (time-demeaning) is applied automatically (and correctly) without you having to transform your data.</li>
</ul>
<p>In my work, I have about 4000-6000 fixed effects and, fortunately, the R community has delivered two excellent libraries for working with these models: <code>lfe</code> and <code>plm</code>.
A more detailed introduction to these packages can be found in [<a href="#refs">1</a>] and [<a href="#refs">2</a>], respectively.
Here, I'll summarize how to fit these models with each of these packages and how to develop goodness-of-fit tests and tests for the linear model assumptions, which are trickier when working with these packages (as of this writing).</p>
<p>I should state up front that I am going to gloss over much of the statistical red meat, writing, as I usually do, for practitioners rather than statisticians.
Also, there are a variety of flavors of models that can be estimated with this framework.
I'm going to focus on just one type of model, the panel model by the "within" estimator.</p>
<h2>Introduction to Fixed-Effects Panel Models</h2>
<p>Fixed-effects panel models have several salient features for investigating drivers of change.
They originate from the social sciences, where experimental setups allow for intervention-based prospective studies, and from economics, where intervention is typically impossible but inference is needed on observational data alone.
In these <em>prospective</em> studies, a panel of subjects (e.g., patients, children, families) are observed at multiple times (at least twice) over the study period.
<strong>The chief premise behind fixed effects panel models is that each observational unit or individual (e.g., a patient) is used as its own control,</strong> exploiting powerful estimation techniques that remove the effects of any unobserved, time-invariant heterogeneity [<a href="#refs">3,4</a>].
By estimating the effects of parameters of interest <em>within</em> an individual over time, we can eliminate the effects of all unobserved, time-invariant heterogeneity <em>between</em> the observational units [<a href="#refs">5</a>].
This feature has led some investigators to propose fixed-effects panel models for weak causal inference [<a href="#refs">3</a>] as the common problem of <em>omitted variable bias</em> (or "latent variable bias") is removed through differencing.
Causal inference with panel models still requires an assumption of <em>strong exogeneity</em> (simply put: no hidden variables and no feedbacks).</p>
<p>The linear fixed-effects panel model extends the well-known linear model, below.
The response of individual <span class="math">\(i\)</span>, denoted <span class="math">\(y_i\)</span>, is a function of some group mean effect or intercept, <span class="math">\(\alpha\)</span>, one or more predictors, <span class="math">\(\beta x_i\)</span>, and an error term, <span class="math">\(\varepsilon_i\)</span>.</p>
<div class="math">$$
y_i = \alpha + \beta x_i + \varepsilon_i
$$</div>
<p>The basic linear fixed-effect panel model can be formulated as follows, where we add an intercept term for each of the individual units of observation, <span class="math">\(i\)</span>, which are observed at two or more times, <span class="math">\(t\)</span>:</p>
<div class="math">$$
y_{it} = \alpha_i + \beta x_{it} + \varepsilon_{it}
$$</div>
<p><strong>It's important to note that this approach requires multiple observations of each individual.</strong>
Obviously, if the number of observations <span class="math">\(N\)</span> was equal to the number of individuals <span class="math">\(i \in M\)</span>, we would exhaust the degrees of freedom in our model simply by adding <span class="math">\(M\)</span> intercept terms, <span class="math">\(\alpha_i + ... + \alpha_M\)</span>.
With as few as two observations <span class="math">\((t \in [1,2])\)</span> of each subject, however, we've doubled the number of observations and the individual intercept terms now correspond to any time-invariant, idiosyncratic change between those two observations.</p>
<p>This model can be extended further to include both <em>individual</em> fixed effects (as above) and <em>time</em> fixed effects (the "two-ways" model):</p>
<div class="math">$$
y_{it} = \alpha_i + \beta x_{it} + \mu_t + \varepsilon_{it}
$$</div>
<p>Here, <span class="math">\(\mu_t\)</span> is an intercept term specific to the time period of observation; it represents any change over time that affects all observational units in the same way (e.g., the weather or the news in an outpatient study).
These effects can also be thought of as "transitory and idiosyncratic forces acting upon [observational] units (i.e., disturbances)" [<a href="#refs">3</a>].</p>
<p><strong>Thus, there are two basic kinds of fixed-effects panel models that can be estimated using the "within" estimator.</strong>
The "unobserved effects" or <em>individual</em> model accounts for unobserved heterogeneity between individuals by partitioning the error term into two parts.
One part is specific to the individual unit of observation and doesn't change over time while the other is the idiosyncratic error term, <span class="math">\(\varepsilon\)</span>, we are familiar with from basic linear models [<a href="#refs">2</a>].
The second type is an extension of the first, the <em>"two-ways"</em> panel model that includes both individual and time fixed effects.</p>
<p>A further extension of the panel model, one often seen in the literature, is given below.</p>
<div class="math">$$
y_{it} = \alpha_i + \beta x_{it} + \phi z_i + \mu_t + \varepsilon_{it}
$$</div>
<p>While <span class="math">\(\beta\)</span> and <span class="math">\(\epsilon\)</span> do not differ from the meanings in the basic linear model, <span class="math">\(\alpha_i\)</span> is the individual fixed effect and <span class="math">\(\phi\)</span> is a vector of coefficients for time-invariant, unit-specific effects.
These effects can be estimated in a linear model but are removed in some kinds of estimation of panel models (<span class="math">\(\phi \equiv 0\)</span>).
They are removed in estimation through differencing; since each observational unit is used as its own control, we are unable to distinguish between heterogeneity that we didn't observe (and can't account for)—the kind we wish to remove from our model—and known (observed) differences between observational units (e.g., race or sex in patients) [<a href="#refs">3</a>].</p>
<h2>Fitting Fixed-Effects Panel Models in R</h2>
<p>Let's look at <a href="http://www.gapminder.org/">the Gapminder dataset</a>, a somewhat well-known dataset (owing to <a href="http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen">the TED talk</a> on the subject) on global development indicators, including life expectancy and per-capita gross domestic product (GDP).
<a href="https://github.com/arthur-e/swc-workshop/blob/master/data/gapminder-surveys.csv">You can download the data I used for this example as a CSV file from here.</a>
Below is a sample of the Gapminder data.</p>
<div class="highlight"><pre><span></span><code>> head(panel.data)
country year lifeExp pop gdpPercap
1 Afghanistan 1952 28.801 8425333 779.4453
2 Afghanistan 1957 30.332 9240934 820.8530
3 Afghanistan 1962 31.997 10267083 853.1007
4 Afghanistan 1967 34.020 11537966 836.1971
5 Afghanistan 1972 36.088 13079460 739.9811
6 Afghanistan 1977 38.438 14880372 786.1134
</code></pre></div>
<p>In this example, the observational units are countries.
Here, the <code>country</code> name is the unique identifier for individual subjects (countries) and the <code>year</code> is the identifier for the time period; these are the individual and time fixed effects, respectively.
Both the individual and time fixed effects, <code>country</code> and <code>year</code>, <strong>must be factors where the levels correspond to the individual and time period identifiers, respectively.</strong>
I'll investigate to what extent change in life expectancy (<code>lifeExp</code>) is predicted by change in (<code>gdpPercap</code>).</p>
<p><strong>Let's start with <code>lfe</code>.</strong>
A basic panel model can be with using <code>lfe</code> with the provided <code>felm()</code> method.
This approach exploits the vertical-bar in the Wilkinson-Rogers syntax (R formulas) to specify the "levels" by which our panel data are organized.
Here, we</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">Matrix</span><span class="p">)</span>
<span class="nf">library</span><span class="p">(</span><span class="n">lfe</span><span class="p">)</span>
<span class="n">m1a</span> <span class="o"><-</span> <span class="nf">felm</span><span class="p">(</span><span class="n">lifeExp</span> <span class="o">~</span> <span class="n">gdpPercap</span> <span class="o">|</span> <span class="n">country</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">panel.data</span><span class="p">)</span>
</code></pre></div>
<p>We can be more explicit by specifying the <code>contrasts</code> of our model but the result is the same.</p>
<div class="highlight"><pre><span></span><code><span class="n">m1b</span> <span class="o"><-</span> <span class="nf">felm</span><span class="p">(</span><span class="n">lifeExp</span> <span class="o">~</span> <span class="n">gdpPercap</span> <span class="o">|</span> <span class="n">country</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">panel.data</span><span class="p">,</span>
<span class="n">contrasts</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="s">'country'</span><span class="p">,</span> <span class="s">'year'</span><span class="p">))</span>
</code></pre></div>
<p>The results are the same.
We can see that our predictor, (change in) per-capita GDP is highly significant and we are given two pairs of goodness-of-fit statistics: the multiple and adjusted R-squared for the "full" and "projected" models.
The full model is our model with the individual fixed effects included; the projected model is the estimated model where our fixed effects are not included.
The full model always performs better than the projected model because the individual fixed effects always explain additional variation in the response: they account for any idiosyncratic differences between each observational unit.</p>
<div class="highlight"><pre><span></span><code>> summary(m1a)
Call:
felm(formula = lifeExp ~ gdpPercap | country, data = panel.data)
Residuals:
Min 1Q Median 3Q Max
-31.697 -3.424 0.321 3.684 21.106
Coefficients:
Estimate Std. Error t value Pr(>|t|)
gdpPercap 4.260e-04 3.114e-05 13.68 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.505 on 1561 degrees of freedom
Multiple R-squared(full model): 0.7675 Adjusted R-squared: 0.7464
Multiple R-squared(proj model): 0.1071 Adjusted R-squared: 0.02583
F-statistic(full model): 36.3 on 142 and 1561 DF, p-value: < 2.2e-16
F-statistic(proj model): 187.2 on 1 and 1561 DF, p-value: < 2.2e-16
</code></pre></div>
<p>We can fit the two-ways fixed effects model in <code>lfe</code> simply by adding an additional contrast.</p>
<div class="highlight"><pre><span></span><code><span class="n">m2</span> <span class="o"><-</span> <span class="nf">felm</span><span class="p">(</span><span class="n">lifeExp</span> <span class="o">~</span> <span class="n">gdpPercap</span> <span class="o">|</span> <span class="n">country</span> <span class="o">+</span> <span class="n">year</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">panel.data</span><span class="p">)</span>
</code></pre></div>
<p><strong>Switching to <code>plm</code></strong>, we can fit the two-ways fixed effects model using the <code>plm()</code> function.
The <code>plm</code> library doesn't use the vertical bar to specify fixed effects, rather, it requires us to specify the <code>index</code> argument with the variable names of the individual and time fixed effects specified as a tuple (in that order).
We also indicate that the <code>model</code> we want to estimate is the <code>within</code> model and that we are estimating <code>twoways</code> effects.</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">Matrix</span><span class="p">)</span>
<span class="nf">library</span><span class="p">(</span><span class="n">plm</span><span class="p">)</span>
<span class="n">m2</span> <span class="o"><-</span> <span class="nf">plm</span><span class="p">(</span><span class="n">lifeExp</span> <span class="o">~</span> <span class="n">gdpPercap</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">panel.data</span><span class="p">,</span> <span class="n">model</span> <span class="o">=</span> <span class="s">'within'</span><span class="p">,</span>
<span class="n">effect</span> <span class="o">=</span> <span class="s">'twoways'</span><span class="p">,</span> <span class="n">index</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="s">'country'</span><span class="p">,</span> <span class="s">'year'</span><span class="p">))</span>
</code></pre></div>
<p>The output of <code>summary()</code> for <code>plm</code> is different and we get a little more detail in some areas.
We see that we have a balanced panel (same number of observations for each individual) over 142 subjects (countries) and 12 time periods (for a total of 1,704 observations).</p>
<div class="highlight"><pre><span></span><code>> summary(m2)
Twoways effects Within Model
Call:
plm(formula = lifeExp ~ gdpPercap, data = panel.data, effect = "twoways",
model = "within", index = c("country", "year"))
Balanced Panel: n = 142, T = 12, N = 1704
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-22.63728 -1.69344 -0.04944 2.00422 10.11005
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
gdpPercap -7.8017e-05 1.8431e-05 -4.2329 2.442e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 18529
Residual Sum of Squares: 18317
R-Squared: 0.011428
Adj. R-Squared: -0.086154
F-statistic: 17.9177 on 1 and 1550 DF, p-value: 2.442e-05
</code></pre></div>
<p>What we don't see is a goodness-of-fit statistic for the full model.
The value of the <code>R-Squared</code> statistic presented here, when compared to the <code>lfe</code> output, is obviously that of the projected model.</p>
<p><strong>Note that the individual effects model can be fit using <code>plm</code></strong> by removing the second variable from the tuple provided to <code>index</code> and specifying an <code>individual</code> effects model:</p>
<div class="highlight"><pre><span></span><code><span class="n">m2</span> <span class="o"><-</span> <span class="nf">plm</span><span class="p">(</span><span class="n">lifeExp</span> <span class="o">~</span> <span class="n">gdpPercap</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">panel.data</span><span class="p">,</span> <span class="n">model</span> <span class="o">=</span> <span class="s">'within'</span><span class="p">,</span>
<span class="n">effect</span> <span class="o">=</span> <span class="s">'individual'</span><span class="p">,</span> <span class="n">index</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="s">'country'</span><span class="p">))</span>
</code></pre></div>
<h2>Goodness-of-Fit for Panel Models</h2>
<p><strong>To get a goodness-of-fit metric for the full-model,</strong> we have to calculate various sums-of-squares.
Returning to our basic statistics, we note that:</p>
<div class="math">$$
R^2 = \frac{SSR}{SST} = \frac{\sum (\hat{y}_i - \bar{y})^2}{\sum (y_i - \bar{y})^2}
$$</div>
<p>Where <span class="math">\(\hat{y}_i\)</span> and <span class="math">\(\bar{y}_i\)</span> are the estimated and mean values, respectively, of <span class="math">\(y_i\)</span> and <span class="math">\(SSR\)</span> and <span class="math">\(SST\)</span> are, respectively, the residual sum-of-squared and total sum-of-squares, related by the following formula:</p>
<div class="math">$$
\sum_{i=1}^n (y_i - \bar{y})^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \sum_{i=1}^n (\hat{y}_i - \bar{y})^2\\
SST = SSE + SSR
$$</div>
<p>Thus, if we can calculate <span class="math">\(SSR\)</span> and <span class="math">\(SST\)</span>, we can calculated R-squared.
<span class="math">\(SST\)</span> is easily obtained from our model data, as it depends only on the observed values and mean value of our response:</p>
<div class="math">$$
SST = \sum_{i=1}^n (y_i - \bar{y})^2
$$</div>
<p>In R, this is:</p>
<div class="highlight"><pre><span></span><code><span class="n">sst</span> <span class="o"><-</span> <span class="nf">with</span><span class="p">(</span><span class="n">panel.data</span><span class="p">,</span> <span class="nf">sum</span><span class="p">((</span><span class="n">lifeExp</span> <span class="o">-</span> <span class="nf">mean</span><span class="p">(</span><span class="n">lifeExp</span><span class="p">))</span><span class="o">^</span><span class="m">2</span><span class="p">))</span>
</code></pre></div>
<p>Without deriving the fitted values, <span class="math">\(\hat{y}\)</span>, we can't calculate <span class="math">\(SSR\)</span> or <span class="math">\(SSE\)</span> directly.
I'll get into deriving fitted values later.
For now, we can exploit the fact that <span class="math">\(SST = SSE + SSR\)</span> so as to derive <span class="math">\(SSR\)</span> as the difference between <span class="math">\(SST\)</span> and <span class="math">\(SSE\)</span>, the sum of squared errors.
We can then calculate <span class="math">\(SSE\)</span> from the least-squares criterion:</p>
<div class="math">$$
SSE = \epsilon '\epsilon = (y - X\beta )' (y - X\beta )
$$</div>
<p>In R, this is:</p>
<div class="highlight"><pre><span></span><code><span class="n">m1.sse</span> <span class="o"><-</span> <span class="nf">t</span><span class="p">(</span><span class="nf">residuals</span><span class="p">(</span><span class="n">m1</span><span class="p">))</span> <span class="o">%*%</span> <span class="nf">residuals</span><span class="p">(</span><span class="n">m1</span><span class="p">)</span>
</code></pre></div>
<p>Putting this altogether, in R, we can derive R-squared as follows.
Recall that, here, <code>lifeExp</code> signifies my dependent or response variable.</p>
<div class="highlight"><pre><span></span><code><span class="o">></span> <span class="n">sst</span> <span class="o"><-</span> <span class="nf">with</span><span class="p">(</span><span class="n">panel.data</span><span class="p">,</span> <span class="nf">sum</span><span class="p">((</span><span class="n">lifeExp</span> <span class="o">-</span> <span class="nf">mean</span><span class="p">(</span><span class="n">lifeExp</span><span class="p">))</span><span class="o">^</span><span class="m">2</span><span class="p">))</span>
<span class="o">></span> <span class="n">m1.sse</span> <span class="o"><-</span> <span class="nf">t</span><span class="p">(</span><span class="nf">residuals</span><span class="p">(</span><span class="n">m1</span><span class="p">))</span> <span class="o">%*%</span> <span class="nf">residuals</span><span class="p">(</span><span class="n">m1</span><span class="p">)</span>
<span class="o">></span> <span class="p">(</span><span class="n">sst</span> <span class="o">-</span> <span class="n">m1.sse</span><span class="p">)</span> <span class="o">/</span> <span class="n">sst</span>
<span class="p">[,</span><span class="m">1</span><span class="p">]</span>
<span class="p">[</span><span class="m">1</span><span class="p">,]</span> <span class="m">0.7675404</span>
</code></pre></div>
<p>We obtain <span class="math">\(R^2 = 0.9355361\)</span>, which, compared to the R-squared estimated for the full (individual fixed effects) model by <code>lfe</code>, is a pretty good estimate.
<strong>Why bother calculating this when <code>lfe</code> does it for free?</strong>
In my work, I found that <code>lfe</code> and <code>felm()</code> choked on some two-ways panel models I was fitting but, if that's not a problem for you, just use <code>lfe</code>.
<strong>In addition, <code>lfe</code> doesn't provide adjusted R-squared, which is a better estimate between models with differing numbers of parameters.</strong></p>
<p>Adjusted R-squared is defined as:</p>
<div class="math">$$
\bar{R}^2 = 1 - (1 - R^2)\frac{n-1}{n-p-1} = 1 - \frac{SSR(n-1)^{-1}}{SST(n-p-1)^{-1}}
$$</div>
<p>Provided we have <span class="math">\(R^2\)</span>, here denoted <code>m1.r2</code>, we can calculate <span class="math">\(n\)</span>, the number of observations, and the adjusted R-squared statistic in R as follows:</p>
<div class="highlight"><pre><span></span><code><span class="n">m1.r2</span> <span class="o"><-</span> <span class="p">(</span><span class="n">sst</span> <span class="o">-</span> <span class="n">m1.sse</span><span class="p">)</span> <span class="o">/</span> <span class="n">sst</span>
<span class="n">N</span> <span class="o"><-</span> <span class="nf">dim</span><span class="p">(</span><span class="n">panel.data</span><span class="p">)[</span><span class="m">1</span><span class="p">]</span>
<span class="m">1</span> <span class="o">-</span> <span class="p">(</span><span class="m">1</span> <span class="o">-</span> <span class="n">m1.r2</span><span class="p">)</span><span class="o">*</span><span class="p">((</span><span class="n">N</span> <span class="o">-</span> <span class="m">1</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">N</span> <span class="o">-</span> <span class="nf">length</span><span class="p">(</span><span class="nf">coef</span><span class="p">(</span><span class="n">m1</span><span class="p">))</span> <span class="o">-</span> <span class="m">1</span><span class="p">))</span>
</code></pre></div>
<h3>Calculating Fitted Values</h3>
<p>You might think of extracting the fitted values with the <code>fitted()</code> function in R.
It's not clear to me that this works; the values I get from <code>fitted()</code> for any of the models I've worked with are too small.
Unfortunately, there's no documentation I can find as to how the <code>fitted()</code> function performs on <code>plm()</code> model instances; i.e., <code>?fitted.plm</code> returns nothing, nor does a quick search online.</p>
<p>Luckily, we can recall from elementary statistics that the fitted values can also be calculated as the difference between our observed values and the residuals.</p>
<div class="highlight"><pre><span></span><code>panel.data$lifeExp - residuals(m1)
</code></pre></div>
<p>A quick fitted-versus-observed plot shows we're not doing too badly with this model.</p>
<div class="highlight"><pre><span></span><code># An example fitted-vs-observed plot
plot(panel.data$lifeExp - residuals(m1), panel.data$lifeExp, asp = 1)
abline(0, 1, col = 'red', lty = 'dashed', lwd = 2)
</code></pre></div>
<h3>Calculating Fitted Values for Hypothesis Testing</h3>
<p><strong>What if you want to conduct hypothesis testing using proposed values for one or more main effects?</strong>
For instance, within the Gapminder example, we might ask the question: what is a certain country's per-capita GDP grew at a different rate than it actually did?
This is harder to do because the <code>predict()</code> function in R doesn't work out-of-the-box with <code>felm()</code> or <code>plm()</code> models.</p>
<p>This is a clearly made-up example, but let's say that per-capita GDP world-wide was 10% lower in 2002 and 2007 than we actually observed.
Below, I use the <code>dplyr</code> library to transform the data this way.
If you're unfamiliar with <code>dplyr</code>, just know that, below, I am scaling back per-capita GDP estimates in 2002 and 2007 by 10%, merging those years with the rest of the data, and then formatting my data frame so it is arranged the same way as the original.</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span>
<span class="n">panel.data2</span> <span class="o"><-</span> <span class="n">panel.data</span> <span class="o">%>%</span>
<span class="nf">filter</span><span class="p">(</span><span class="n">year</span> <span class="o">>=</span> <span class="m">2002</span><span class="p">)</span> <span class="o">%>%</span>
<span class="nf">mutate</span><span class="p">(</span><span class="n">gdpPercap</span> <span class="o">=</span> <span class="n">gdpPercap</span> <span class="o">-</span> <span class="p">(</span><span class="n">gdpPercap</span> <span class="o">*</span> <span class="m">0.10</span><span class="p">))</span> <span class="o">%>%</span>
<span class="nf">rbind</span><span class="p">(</span><span class="nf">filter</span><span class="p">(</span><span class="n">panel.data</span><span class="p">,</span> <span class="n">year</span> <span class="o"><</span> <span class="m">2002</span><span class="p">))</span> <span class="o">%>%</span>
<span class="nf">arrange</span><span class="p">(</span><span class="n">country</span><span class="p">,</span> <span class="n">year</span><span class="p">)</span>
</code></pre></div>
<p><strong>Now, let's test this hypothesis with <code>plm()</code>.</strong>
Let's also develop a slightly more complex fixed effects regression model.
We have data on population in each year, and this might have something to do with life expectancy; it's a time-varying estimate of our system, so let's include it in the model.</p>
<div class="highlight"><pre><span></span><code><span class="o">></span> <span class="n">m3</span> <span class="o"><-</span> <span class="nf">plm</span><span class="p">(</span><span class="n">lifeExp</span> <span class="o">~</span> <span class="n">gdpPercap</span> <span class="o">+</span> <span class="n">pop</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">panel.data</span><span class="p">,</span> <span class="n">model</span> <span class="o">=</span> <span class="s">'within'</span><span class="p">,</span>
<span class="n">effect</span> <span class="o">=</span> <span class="s">'individual'</span><span class="p">,</span> <span class="n">index</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="s">'country'</span><span class="p">))</span>
<span class="o">></span> <span class="nf">coef</span><span class="p">(</span><span class="n">m3</span><span class="p">)</span>
<span class="n">gdpPercap</span> <span class="n">pop</span>
<span class="m">3.936623e-04</span> <span class="m">6.196916e-08</span>
</code></pre></div>
<p>We get a similar, though slightly smaller estimate of the effect of (change in) per-capita GDP on (change in) life expectancy.
<strong>Now, how can we obtain the fitted values using the existing model, but new data?</strong>
Again, linear algebra helps us find the answer.
Recall the formula for the individual fixed effects model we saw earlier.</p>
<div class="math">$$
y_{it} = \alpha_i + \beta x_{it} + \phi z_i + \varepsilon_{it}
$$</div>
<p>Because <span class="math">\(z_i\)</span> is really just a repeated index for each subject (observed multiple times), every term is additive other than the main effects <span class="math">\(\beta x_{it}\)</span>.
This means it is easy to calculate the fitted values for a new set of values (for the same subjects).
<strong>In the case of the individual fixed effects model,</strong> we compute the product of the main effects and the new observations of our <span class="math">\(X\)</span> matrix and add it to the fixed effects, which need to be repeated in the order they appear in the original design matrix.
Because our Gapminder data are ordered by country name, then year, for 12 years, we need to repeat each country fixed effect 12 times.
Below, the function <code>fixef()</code> allows us to extract the country fixed effects from a given model.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Get the new X matrix with our new (hypothetical) values</span>
<span class="nb">new</span>.<span class="o">X</span> <- <span class="n">as</span>.<span class="n">matrix</span>(<span class="n">select</span>(<span class="n">panel</span>.<span class="n">data2</span>, <span class="n">gdpPercap</span>, <span class="nb">pop</span>))
<span class="c1"># Repeat each fixed effect 12 times (12 observations for each subject)</span>
<span class="n">fe</span> <- <span class="n">rep</span>(<span class="n">as</span>.<span class="n">numeric</span>(<span class="n">fixef</span>(<span class="n">m3</span>)), <span class="n">each</span> = <span class="mi">12</span>)
<span class="c1"># Compute our predictions</span>
<span class="n">preds</span> <- (<span class="nb">new</span>.<span class="o">X</span> %*% <span class="n">coef</span>(<span class="n">m3</span>)) + <span class="n">fe</span>
</code></pre></div>
<p>It's important to note that this is not "out-of-sample prediction." That wouldn't make sense for a fixed effects regression and would, in fact, be misleading. We have fit fixed effects for each of our subjects (here, countries) and it wouldn't make sense to use this model for a different set of subjects. What we're doing here is testing a counterfactual, e.g., suppose that <span class="math">\(X\)</span> has a different value, what would be the effect on <span class="math">\(Y\)</span>?</p>
<p><strong>We can confirm we've calculated the fitted values correctly by returning to the original dataset and adding the residuals to our fitted values.</strong>
The residuals add the random variation of our original, observed response back into our model; the result is a perfect fit, as seen in how the plot points line up perfectly along the 1:1 line.</p>
<div class="highlight"><pre><span></span><code><span class="n">old.X</span> <span class="o"><-</span> <span class="nf">as.matrix</span><span class="p">(</span><span class="nf">select</span><span class="p">(</span><span class="n">panel.data</span><span class="p">,</span> <span class="n">gdpPercap</span><span class="p">,</span> <span class="n">pop</span><span class="p">))</span>
<span class="c1"># Repeat each fixed effect 12 times (12 observations for each subject)</span>
<span class="n">fe</span> <span class="o"><-</span> <span class="nf">rep</span><span class="p">(</span><span class="nf">as.numeric</span><span class="p">(</span><span class="nf">fixef</span><span class="p">(</span><span class="n">m3</span><span class="p">)),</span> <span class="n">each</span> <span class="o">=</span> <span class="m">12</span><span class="p">)</span>
<span class="c1"># Compute our predictions THIS TIME with the residuals</span>
<span class="n">preds</span> <span class="o"><-</span> <span class="p">(</span><span class="n">new.X</span> <span class="o">%*%</span> <span class="nf">coef</span><span class="p">(</span><span class="n">m3</span><span class="p">))</span> <span class="o">+</span> <span class="n">fe</span> <span class="o">+</span> <span class="nf">residuals</span><span class="p">(</span><span class="n">m3</span><span class="p">)</span>
<span class="nf">plot</span><span class="p">(</span><span class="n">preds</span><span class="p">,</span> <span class="n">panel.data</span><span class="o">$</span><span class="n">lifeExp</span><span class="p">,</span> <span class="n">asp</span> <span class="o">=</span> <span class="m">1</span><span class="p">)</span>
<span class="nf">abline</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="n">col</span> <span class="o">=</span> <span class="s">'red'</span><span class="p">,</span> <span class="n">lty</span> <span class="o">=</span> <span class="s">'dashed'</span><span class="p">,</span> <span class="n">lwd</span> <span class="o">=</span> <span class="m">2</span><span class="p">)</span>
</code></pre></div>
<!--TODO show 20160416_obs_v_pred.png-->
<p><a href="/images/20160416_obs_v_pred.png"><img src="/images/thumbs/20160416_obs_v_pred_thumbnail_wide.png" /></a></p>
<p><strong>I haven't been able to figure out how to do this for the two-ways model (with time fixed effects) yet.</strong> The result should require the addition of the time fixed effects, but I'm not getting the right result.</p>
<p><strong>The above examples work for <code>lfe::felm()</code> models, too, as <code>fixef()</code> can also extract the fixed effects models from the <code>lfe</code> package.</strong></p>
<h2>Model Diagnostics</h2>
<p>Two critical assumptions of any linear model, including linear fixed-effects panel models, are <em>constant variance</em> (homoskedasticity) and <em>normally distributed errors</em> [6].
We might also want to determine the leverage of our observations to see if there are any highly influential points (which might be outliers).
In addition, since we're working with spatial data (in this case), we'll do a crude check for spatial autocorrelation in the residuals, which, if present, would be problematic for inference.</p>
<h3>Normally Distributed Errors</h3>
<p>This is a quick check using a couple of built-in functions.
We want to determine that the curve of our residuals in a QQ-Normal plot follow a straight line.
Some curvature in the tails is to be expected; it is a somewhat subjective test but this assumption is also commonly satisfied with observational data.</p>
<div class="highlight"><pre><span></span><code>qqnorm(residuals(m3), ylab = 'Residuals')
qqline(residuals(m3))
</code></pre></div>
<p>We might also examine a histogram of the residuals.</p>
<div class="highlight"><pre><span></span><code>hist(residuals(m1), xlab = 'Residuals')
</code></pre></div>
<h3>Constant Variance</h3>
<p>Here, a plot of the residuals against the fitted values should reveal whether or not there is non-constant variance (heteroskedasticity) in the residuals, which would violate one of the assumptions of our linear model.
We look for whether or not the spread of points is uniform in the y-direction as we move along the x-axis (constant variance along the x-axis).</p>
<div class="highlight"><pre><span></span><code>plot(preds, residuals(m3))
</code></pre></div>
<h3>Checking for Influential Observations</h3>
<p>To derive the <em>leverage</em> of the observations, we wish to derive <em>hat matrix</em> (or "projection matrix") from our linear model.
This is the matrix given by the linear transformation by which we obtained the estimated coefficients for our model, <span class="math">\(\beta\)</span>.</p>
<div class="math">$$
X\hat{\beta} = X(X^T X)^{-1} X^Ty = Py
$$</div>
<p>Where <span class="math">\(X\)</span> is the design matrix (matrix of explanatory variables) and <span class="math">\(y\)</span> is the vector of our observed response values.
The hat (or projection) matrix is denoted by <span class="math">\(P\)</span>.
The diagonal of this <span class="math">\(N\times N\)</span> matrix (<code>diag(P)</code> in R) contains the leverages for each observation point.</p>
<p>In R, we use matrix multiplication and the <code>solve()</code> function (to obtain the inverse of a matrix).
With the <code>faraway</code> package, we can draw a half-normal plot which sorts the observations by their leverage.</p>
<div class="highlight"><pre><span></span><code><span class="n">X</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">panel</span><span class="p">.</span><span class="k">data</span><span class="err">[</span><span class="p">,</span><span class="n">c</span><span class="p">(</span><span class="s1">'lifeExp'</span><span class="p">,</span><span class="w"> </span><span class="s1">'gdpPercap'</span><span class="p">)</span><span class="err">]</span><span class="w"></span>
<span class="n">X</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">as</span><span class="p">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w"></span>
<span class="n">P</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">t</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w"></span>
<span class="k">require</span><span class="p">(</span><span class="n">faraway</span><span class="p">)</span><span class="w"></span>
<span class="c1"># Create `labs` (labels) for 1 through 1704 observations</span><span class="w"></span>
<span class="n">halfnorm</span><span class="p">(</span><span class="n">diag</span><span class="p">(</span><span class="n">P</span><span class="p">),</span><span class="w"> </span><span class="n">labs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="o">:</span><span class="mi">1704</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Leverages'</span><span class="p">,</span><span class="w"> </span><span class="n">nlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">)</span><span class="w"></span>
</code></pre></div>
<p><strong>This theory and approach are based on the general linear model and ordinary least squares (OLS) regression,</strong> corresponding to models built with <code>lm()</code> in R. Some adjustment may be necessary to calculate leverage correctly for fixed effects models. However, this approach based on OLS seems to work pretty well; the points with the most leverage I find correspond to Kuwait in the 1950s through 1970s, a period when the country's per-capita GDP went on a roller coaster, reaching the highest value seen in the dataset.</p>
<p>If we need to remove influential observations, in order to maintain a balanced panel, it's important to remove all of the observations for that individual; that is, to remove that individual entirely from the panel, rather than just the observation(s) at those time period(s).
In R, we do this by querying those points that have an influence above a certain threshold, say 0.004; here, the individual units of observation are denoted by unique <code>country</code> names.</p>
<div class="highlight"><pre><span></span><code><span class="n">panel.data.mod</span> <span class="o"><-</span> <span class="nf">subset</span><span class="p">(</span><span class="n">panel.data</span><span class="p">,</span>
<span class="o">!</span><span class="p">(</span><span class="n">country</span> <span class="o">%in%</span> <span class="n">panel.data</span><span class="p">[</span><span class="nf">names</span><span class="p">(</span><span class="nf">diag</span><span class="p">(</span><span class="n">P</span><span class="p">)[</span><span class="nf">diag</span><span class="p">(</span><span class="n">P</span><span class="p">)</span> <span class="o">></span> <span class="m">0.004</span><span class="p">]),</span> <span class="s">'country'</span><span class="p">]))</span>
</code></pre></div>
<p><a name='refs'></a></p>
<h2>References</h2>
<ol>
<li>Gaure, S. 2013. lfe: Linear Group Fixed Effects. <em>The R Journal</em> <strong>5</strong>(2):104–117.</li>
<li>Croissant, Y., and G. Millo. 2008. Panel data econometrics in R: The plm package. <em>Journal of Statistical Software</em> <strong>27</strong>(2):1–43.</li>
<li>Halaby, C. N. 2004. Panel Models in Sociological Research: Theory into Practice. <em>Annual Review of Sociology</em> <strong>30</strong>(1):507–544.</li>
<li>Gelman, A., and J. Hill. 2007. <span style="text-decoration:underline">Data Analysis Using Regression and Multilevel/Hierarchical Models.</span> New York, New York, USA: Cambridge University Press.</li>
<li>Allison, P. D. 2009. <span style="text-decoration:underline">Fixed Effects Regression Models</span> ed. T. F. Liao. Thousand Oaks, California, U.S.A.: SAGE.</li>
<li>Crawley, M. J. 2015. <span style="text-decoration:underline">Statistics: An Introduction Using R</span> 2nd ed. Chichest, West Sussex, United Kingdom: John Wiley & Sons.</li>
<li>Bolker, B. M., M. E. Brooks, C. J. Clark, S. W. Geange, J. R. Poulsen, M. H. H. Stevens, and J. S. S. White. 2009. Generalized linear mixed models: a practical guide for ecology and evolution. <em>Trends in Ecology and Evolution</em> <strong>24</strong>(3):127–135.</li>
</ol>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>The Tasseled Cap transformation and band ratios: Applications for urban studies2015-08-03T09:00:00+02:002015-08-03T09:00:00+02:00K. Arthur Endsleytag:karthur.org,2015-08-03:/2015/tasseled-cap-and-band-ratios-for-urban-studies.html<p>Urban environments are heterogeneous at relatively small scales and composed of multiple land cover types but are dominated by vegetation, impervious surface, and soil (V-I-S). Land cover indices such as the Biophysical Composition Index (BCI), based on the tasseled cap transformation, attempt to capture the spatial pattern of these three broad classes of urban land cover. Here I present Python examples for applying the tasseled cap transformation and for calculating the BCI.</p><p>Urban environments are heterogeneous at relatively small scales and composed of a variety of land covers, chiefly including impervious surface, green vegetation (e.g., the urban canopy), and soil.
<strong>Ridd (1995) demonstrated that urban land cover was dominated by these three types, summarized as the Vegetation-Impervious-Soil (VIS) model.</strong>
The model suffers due to the vagaries of any urban land cover analysis: There are multiple subtypes of each land cover with different spectral characteristics (e.g., multiple impervious surface colors, different vegetation species) and impervious surface and soil are easily confused.
Nonetheless, the VIS model is still useful and remains one of the most popular, widely applicable descriptions of urban land cover composition (e.g., Phinn et al. 2002; Small 2003; Kärdi 2007).</p>
<p><strong>How can we identify VIS areas with remote sensing?</strong>
If we can spectrally separate vegetation, impervious surface, and soil then we may be able classify a multispectral image on that basis.
At the resolution of many earth-observing sensors (e.g., Landsat MSS, Landsat TM/ETM+, SPOT), however, the ground sample distance of a pixel is much larger than the minimum mapping unit of urban areas (Small 2003).
That is, many if not most of the pixels in the scene are mixtures of two or more VIS components (e.g., vegetation and impervious surface co-located in a suburban development).
Linear spectral mixture analysis (LSMA) is one way of modeling land cover mixtures to estimate the abundance of VIS components on the ground.
LSMA can be challenging and time-consuming to implement, however.</p>
<p>As an alternative, <strong>some investigators have introduced land cover indices based on band ratios that attempt to capture the spatial variability in the three VIS components.</strong>
The Normalized Difference Vegetation Index (NDVI) is probably the most popular example of a land cover index based on band ratios.
NDVI wasn't expressly intended for use in urban areas but many more recent developments are.
These include the Biophysical Composition Index (BCI; Deng and Wu 2012), the Normalized Difference Impervious Surface Index (NDISI; Xu 2010), the Modified NDISI (MNDISI; Liu et al. 2013), and the Ratio Normalized Difference Soil Index (RNDSI; Deng et al. 2015).</p>
<p>Here, I'll focus on only the BCI as it is based on the well-known tasseled cap transformation (Kauth and Thomas 1976), a coordinate transformation that is similar to principal components analysis (PCA) and similar dimension reduction techniques that I've used in spectral mixture analysis.
<strong>I'll present an overview of BCI, the tasseled cap, and examples of implementing both in Python.</strong></p>
<h2>The Tasseled Cap Transformation</h2>
<p>Calculating the BCI requires the tasseled cap transformation, a coordinate transformation of multispectral remote sensing data.
Tasseled cap was originally developed for Landsat MSS 4-band data.
As such, the tasseled cap transformation, as initially described, produced just two principal components termed "brightness" and "greenness."
These principal components have physical meanings, however, and Kauth and Thomas (1976) and later investigators (Crist and Kauth 1986) demonstrated how it could be used to track changes in the phenology of vegetation, particularly agriculture.</p>
<p>The lifecycle of agricultural land proceeds from bare soil to emerging vegetation, to mature green vegetation that completely covers the visible ground area, to senescent yellow vegetation, and finally to bare soil again (<a href="http://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1160&context=lars_symp">Kauth and Thomas 1976, Figure 2</a>).
This lifecycle is accompanied by changes in spectral reflectance of the agricultural land and if pixels in each stage are plotted in multispectral feature space, they trace a line through the space that corresponds to the temporal evolution of the landscape.
They also clearly fall into separate regions.
If a new coordinate system is aligned with these features it is possible to compress over 95% of the information stored in four (4) MSS bands into just two bands ("brightness" and "greenness").
The tasseled cap's name is derived from its appearance in feature space once the coordinate transformation is performed.</p>
<p>The tasseled cap transformation has a compact mathematical notation but requires a table of rotation coefficients (see Kauth and Thomas 1976) for the <span class="math">\(p\times p\)</span> rotation matrix, <span class="math">\(R\)</span>.
A <span class="math">\(p \times n\)</span> vector <span class="math">\(x\)</span> of multispectral data (comprised of <span class="math">\(p\)</span> spectral bands across <span class="math">\(n\)</span> pixels) is transformed to a <span class="math">\(u\)</span> vector by means of the following rotation, where <span class="math">\(e\)</span> is an optional translation that can be used to avoid negative values in the transformed data:</p>
<div class="math">$$
u = Rx + e
$$</div>
<p>Kauth and Thomas provided the coefficients in <span class="math">\(R\)</span> for the Landsat MSS sensor in their 1976 paper that introduced the transformation.
Those Landsat MSS coefficients were based on MSS data provided in counts.
Since then, others have presented versions of the tasseled cap transformation for other platforms and sensors such as Landsat TM surface reflectance (Crist and Cicone 1984), Landsat ETM+ top-of-atmosphere (TOA) reflectance (Huang et al. 2002), MODIS (Lobser et al. 2007), WorldView-2 (Yarbrough et al. 2014), and Landsat 8 TOA reflectance (Baig et al. 2014).</p>
<p><strong>Below is a simple Python implementation of the tasseled cap transformation,</strong> without support for the optional translation.
The coefficients are those for Landsat TM surface reflectance (via Crist and Cicone 1984) but could be replaced with the relevant coefficients for any other sensor data.
Landsat surface reflectance data can be obtained easily from the <a href="http://espa.cr.usgs.gov/index/">USGS EROS Science Processing Architecture (ESPA)</a>.
The function accepts a Landsat TM surface reflectance raster as either a GDAL data source or a <span class="math">\(p\times m\times n\)</span> NumPy array where <span class="math">\(p\)</span> is the number of bands, and <span class="math">\(m, n\)</span> refer to the number of rows and columens, respectively.</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">tasseled_cap_tm</span><span class="p">(</span><span class="n">rast</span><span class="p">,</span> <span class="n">reflectance</span><span class="o">=</span><span class="kc">True</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Applies the Tasseled Cap transformation for TM data. Assumes that the TM</span>
<span class="sd"> data are TM reflectance data (i.e., Landsat Surface Reflectance). The</span>
<span class="sd"> coefficients for reflectance factor data are taken from Crist (1985) in</span>
<span class="sd"> Remote Sensing of Environment 17:302.</span>
<span class="sd"> '''</span>
<span class="k">if</span> <span class="n">reflectance</span><span class="p">:</span>
<span class="c1"># Reflectance factor coefficients for TM bands 1-5 and 7; they are</span>
<span class="c1"># entered here in tabular form so they are already transposed with</span>
<span class="c1"># respect to the form suggested by Kauth and Thomas (1976)</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span>
<span class="p">(</span> <span class="mf">0.2043</span><span class="p">,</span> <span class="mf">0.4158</span><span class="p">,</span> <span class="mf">0.5524</span><span class="p">,</span> <span class="mf">0.5741</span><span class="p">,</span> <span class="mf">0.3124</span><span class="p">,</span> <span class="mf">0.2303</span><span class="p">),</span>
<span class="p">(</span><span class="o">-</span><span class="mf">0.1603</span><span class="p">,</span><span class="o">-</span><span class="mf">0.2819</span><span class="p">,</span><span class="o">-</span><span class="mf">0.4934</span><span class="p">,</span> <span class="mf">0.7940</span><span class="p">,</span><span class="o">-</span><span class="mf">0.0002</span><span class="p">,</span><span class="o">-</span><span class="mf">0.1446</span><span class="p">),</span>
<span class="p">(</span> <span class="mf">0.0315</span><span class="p">,</span> <span class="mf">0.2021</span><span class="p">,</span> <span class="mf">0.3102</span><span class="p">,</span> <span class="mf">0.1594</span><span class="p">,</span><span class="o">-</span><span class="mf">0.6806</span><span class="p">,</span><span class="o">-</span><span class="mf">0.6109</span><span class="p">),</span>
<span class="p">(</span><span class="o">-</span><span class="mf">0.2117</span><span class="p">,</span><span class="o">-</span><span class="mf">0.0284</span><span class="p">,</span> <span class="mf">0.1302</span><span class="p">,</span><span class="o">-</span><span class="mf">0.1007</span><span class="p">,</span> <span class="mf">0.6529</span><span class="p">,</span><span class="o">-</span><span class="mf">0.7078</span><span class="p">),</span>
<span class="p">(</span><span class="o">-</span><span class="mf">0.8669</span><span class="p">,</span><span class="o">-</span><span class="mf">0.1835</span><span class="p">,</span> <span class="mf">0.3856</span><span class="p">,</span> <span class="mf">0.0408</span><span class="p">,</span><span class="o">-</span><span class="mf">0.1132</span><span class="p">,</span> <span class="mf">0.2272</span><span class="p">),</span>
<span class="p">(</span> <span class="mf">0.3677</span><span class="p">,</span><span class="o">-</span><span class="mf">0.8200</span><span class="p">,</span> <span class="mf">0.4354</span><span class="p">,</span> <span class="mf">0.0518</span><span class="p">,</span><span class="o">-</span><span class="mf">0.0066</span><span class="p">,</span><span class="o">-</span><span class="mf">0.0104</span><span class="p">)</span>
<span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">raise</span> <span class="bp">NotImplemented</span><span class="p">(</span><span class="s1">'Only support for Landsat TM reflectance has been implemented'</span><span class="p">)</span>
<span class="n">shp</span> <span class="o">=</span> <span class="n">rast</span><span class="o">.</span><span class="n">shape</span>
<span class="c1"># Can accept either a gdal.Dataset or numpy.array instance</span>
<span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">rast</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">rast</span><span class="o">.</span><span class="n">ReadAsArray</span><span class="p">()</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">shp</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">shp</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">*</span><span class="n">shp</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">rast</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">shp</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">shp</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">*</span><span class="n">shp</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">shp</span><span class="p">)</span>
</code></pre></div>
<p>When I first implemented this in Python, I naturally wanted to check to make sure I had it right.
So, I plotted the first three tasseled cap (TC) components, Brightness, Greenness, and Wetness, in feature space.
We can compare these images to the figures in Kauth and Thomas (1976) or Crist and Cicone (1984) to make sure we're on the right track.</p>
<p><a href="/images/20150803_tc_feature_space_ax1-2.png"><img style="float:left;margin-right:20px;" src="/images/thumbs/20150803_tc_feature_space_ax1-2_thumbnail_square.png" /></a>
<a href="/images/20150803_tc_feature_space_ax1-3.png"><img style="float:left;margin-right:20px;" src="/images/thumbs/20150803_tc_feature_space_ax1-3_thumbnail_square.png" /></a>
<a href="/images/20150803_tc_feature_space_ax3-2.png"><img style="float:left;" src="/images/thumbs/20150803_tc_feature_space_ax3-2_thumbnail_square.png" /></a></p>
<div style="clear:both;"></div>
<p>The left image shows the Brightness-Greenness plane, where we can see the eponymous tasseled cap itself.
The middle image is Brightness-Wetness plane and while it doesn't look exactly like the figure in Crist and Cicone (1984), it's not wildly different.
The right image is the so-called "transition zone" view; the Wetness-Grenness plane.
The transition zone view looks correct although it has a very interesting fork at the top part of volume.</p>
<h2>The Biophysical Composition Index (BCI)</h2>
<p>Now that I can calculate the tasseled cap components I can also calculate the BCI.
The BCI was developed as a one-dimensional measure of urban land cover composition based on the observation that the average values of the Brightness and Wetness tasseled cap components are much higher than the average value of the Greenness component.
Deng and Wu (2012) proposed a normalized difference approach that compresses the information in the first three tasseled cap components into a measure defined on the interval <span class="math">\([-1, 1]\)</span>.
It is defined:</p>
<div class="math">$$
BCI = \frac{0.5(H + L) - V}{0.5(H + L) + V}
$$</div>
<p>Where <span class="math">\(H, V, L\)</span> refer to the normalized first three tasseled cap components, <span class="math">\(TC1, TC2, TC3\)</span>, normalized such that:</p>
<div class="math">$$
H = \frac{\mathrm{TC1} - \mathrm{TC1}_{min}}{\mathrm{TC1}_{max} - \mathrm{TC1}_{min}}
\quad
V = \frac{\mathrm{TC2} - \mathrm{TC2}_{min}}{\mathrm{TC2}_{max} - \mathrm{TC2}_{min}}
\quad
L = \frac{\mathrm{TC3} - \mathrm{TC3}_{min}}{\mathrm{TC3}_{max} - \mathrm{TC3}_{min}}
$$</div>
<p><strong>Below is a simple Python implementation of the BCI.</strong></p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">biophysical_composition_index</span><span class="p">(</span><span class="n">rast</span><span class="p">,</span> <span class="n">nodata</span><span class="o">=-</span><span class="mi">9999</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Calculates the biophysical composition index (BCI) of Deng and Wu (2012)</span>
<span class="sd"> in Remote Sensing of Environment 127. The input raster is expected to be</span>
<span class="sd"> a tasseled cap-transformed raster. The NoData value is assumed to be</span>
<span class="sd"> negative (could never be the maximum value in a band).</span>
<span class="sd"> '''</span>
<span class="n">shp</span> <span class="o">=</span> <span class="n">rast</span><span class="o">.</span><span class="n">shape</span>
<span class="c1"># Can accept either a gdal.Dataset or numpy.array instance</span>
<span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">rast</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">rast</span><span class="o">.</span><span class="n">ReadAsArray</span><span class="p">()</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">shp</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">shp</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">*</span><span class="n">shp</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">rast</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">shp</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">shp</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">*</span><span class="n">shp</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="n">unit</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span> <span class="n">shp</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="n">shp</span><span class="p">[</span><span class="mi">2</span><span class="p">]))</span>
<span class="n">stack</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">3</span><span class="p">):</span>
<span class="c1"># Calculate the minimum values after excluding NoData values</span>
<span class="n">tcmin</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">setdiff1d</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="o">...</span><span class="p">]</span><span class="o">.</span><span class="n">ravel</span><span class="p">(),</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">nodata</span><span class="p">]))</span><span class="o">.</span><span class="n">min</span><span class="p">()</span>
<span class="c1"># Calculate the normalized TC component TC_i for i in {1, 2, 3}</span>
<span class="n">stack</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">divide</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">subtract</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="o">...</span><span class="p">],</span> <span class="n">unit</span> <span class="o">*</span> <span class="n">tcmin</span><span class="p">),</span>
<span class="n">unit</span> <span class="o">*</span> <span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="o">...</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span> <span class="o">-</span> <span class="n">tcmin</span><span class="p">)))</span>
<span class="c1"># Unpack the High-albedo, Vegetation, and Low-albedo components</span>
<span class="n">h</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="n">l</span> <span class="o">=</span> <span class="n">stack</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">divide</span><span class="p">(</span>
<span class="n">np</span><span class="o">.</span><span class="n">subtract</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">divide</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">l</span><span class="p">),</span> <span class="n">unit</span> <span class="o">*</span> <span class="mi">2</span><span class="p">),</span> <span class="n">v</span><span class="p">),</span>
<span class="n">np</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">divide</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">l</span><span class="p">),</span> <span class="n">unit</span> <span class="o">*</span> <span class="mi">2</span><span class="p">),</span> <span class="n">v</span><span class="p">))</span>\
<span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span> <span class="n">shp</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">shp</span><span class="p">[</span><span class="mi">2</span><span class="p">]))</span>
</code></pre></div>
<p>Deng and Wu found significant positive correlation between the BCI and impervious surface, significant negative correlation between the BCI and vegetation cover.
I can confirm that BCI seems to be an excellent indicator of these land covers, as it is positively correlated with the National Land Cover Dataset (NLCD) impervious surface fraction.
The 2001 NLCD impervious layer (left) and a BCI image calculated from a July 2002 Landsat TM image (right) are shown below; the Pearson's correlation coefficient between the datasets is <span class="math">\(r=0.624\)</span>.</p>
<p><a href="/images/20150803_nlcd_2001_impervious.png"><img style="float:left;" src="/images/thumbs/20150803_nlcd_2001_impervious_thumbnail_square.png" /></a>
<a href="/images/20150803_bci_2002.png"><img style="float:left;" src="/images/thumbs/20150803_bci_2002_thumbnail_square.png" /></a></p>
<div style="clear:both;"></div>
<p>I also compared the BCI to measures of vegetation, impervious surface, and soil derived through spectral mixture analysis of a 1998 Landsat TM image.
The correlation matrix is below and indicates that these independently derived land covers are strongly correlated with the BCI.
In this case, however, the soil fraction from spectral mixture analysis is likely a mixed endmember with non-photosynthetic vegetation and therefore cannot be readily interpreted in this context.</p>
<div class="highlight"><pre><span></span><code>Pearson's Correlation Coefficients (r)
BCI Veg. Imp. Soil
---------- ----- ------ ------ -----
BCI 1.000 -0.877 0.838 0.538
Vegetation 1.000 -0.724 -0.810
Impervious 1.000 0.182
Soil 1.000
</code></pre></div>
<!--
## Urban Land Cover Composition Needs a Consistent Measure
As an extension of our initial question: **How can we monitor changes in VIS over time?**
Is the BCI a consistent measure of urban land cover composition?
For longitudinal studies of urban areas we wish to derive a metric of land cover composition that can be consistently applied across time and across climatic gradients and ecoregions.
-->
<h2>References</h2>
<ol>
<li>Ridd, M. 1995. Exploring a VIS (vegetation-impervious surface-soil) model for urban ecosystem analysis through remote sensing: comparative anatomy for cities. <em>International Journal of Remote Sensing</em> 16 (12):2165–2185.</li>
<li>Phinn, S., M. Stanford, P. Scarth, A. T. Murray, and P. Shyy. 2002. Monitoring the composition of urban environments based on the vegetation-impervious surface-soil (VIS) model by subpixel analysis techniques. <em>International Journal of Remote Sensing</em> 23 (20):4131–4153.</li>
<li>Small, C. 2003. High spatial resolution spectral mixture analysis of urban reflectance. <em>Remote Sensing of Environment</em> 88 (1-2):170–186.</li>
<li>Kärdi, T. 2007. Remote sensing of urban areas: Linear spectral unmixing of Landsat Thematic Mapper images acquired over Tartu (Estonia). Proc. Estonian Acad. Sci. Biol. Ecol 56 (1):19–32.</li>
<li>Deng, C., and C. Wu. 2012. BCI: A biophysical composition index for remote sensing of urban environments. <em>Remote Sensing of Environment</em> 127:247–259.</li>
<li>Xu, H. 2010. Analysis of Impervious Surface and its Impact on Urban Heat Environment using the Normalized Difference Impervious Surface Index (NDISI). <em>Photogrammetric Engineering & Remote Sensing</em> 76 (5):557–565.</li>
<li>Liu, C., Z. Shao, M. Chen, and H. Luo. 2013. MNDISI: A multi-source composition index for impervious surface area estimation at the individual city scale. <em>Remote Sensing Letters</em> 4 (8):803–812.</li>
<li>Deng, Y., C. Wu, M. Li, and R. Chen. 2015. RNDSI: A ratio normalized difference soil index for remote sensing of urban/suburban environments. <em>International Journal of Applied Earth Observation and Geoinformation</em> 39:40–48.</li>
<li>Kauth, R. J., and G. S. Thomas. 1976. The tasseled cap - A graphic description of the spectral-temporal development of agricultural crops as seen by Landsat. In Proceedings of the Symposium on Machine Processing of Remotely Sensed Data, West Lafayette, Indiana, U.S.A, 29 June-1 July 1976, 41–51.</li>
<li>Crist, E. P., and R. J. Kauth. 1986. The Tasseled Cap de-mystified. <em>Photogrammetric Engineering and Remote Sensing</em> 52 (1):81–86.</li>
<li>Crist, E. P., and R. C. Cicone. 1984. A Physically-Based Transformation of Thematic Mapper Data---The TM Tasseled Cap. <em>IEEE Transactions on Geoscience and Remote Sensing</em> GE-22 (3):256–263.</li>
<li>Huang, C., B. Wylie, L. Yang, C. Homer, and G. Zylstra. 2002. Derivation of a tasseled cap transformation based on Landsat 7 at-satellite reflectance. <em>International Journal of Remote Sensing</em> 23 (8):1741–1748.</li>
<li>Lobser, S. E., and W. B. Cohen. 2007. MODIS tasseled cap: land cover characteristics expressed through transformed MODIS data. <em>International Journal of Remote Sensing</em> 28 (22):5079–5101.</li>
<li>Yarbrough, L. D., K. Navulur, and R. Ravi. 2014. Presentation of the Kauth–Thomas transform for WorldView-2 reflectance data. <em>Remote Sensing Letters</em> 5 (2):131–138.</li>
<li>Baig, M. H. A., L. Zhang, T. Shuai, and Q. Tong. 2014. Derivation of a tasseled cap transformation based on Landsat 8 at-satellite reflectance. <em>Remote Sensing Letters</em> 5 (5):423–431.</li>
</ol>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Masking saturated pixels may improve spectral mixture analysis2015-07-28T15:05:00+02:002015-07-28T15:05:00+02:00K. Arthur Endsleytag:karthur.org,2015-07-28:/2015/masking-saturated-pixels-spectral-mixture-analysis.html<p>The Landat Surface Reflectance (SR) product sometimes contains saturation in one or more bands (a value of 16,000 reflectance units or 160% reflectance).
Presumably, these correspond to saturation at the detector; the same kind of saturation that is likely to occur over clouds or snow-covered areas.
Such clouds and …</p><p>The Landat Surface Reflectance (SR) product sometimes contains saturation in one or more bands (a value of 16,000 reflectance units or 160% reflectance).
Presumably, these correspond to saturation at the detector; the same kind of saturation that is likely to occur over clouds or snow-covered areas.
Such clouds and snow can be masked out with a provided masking layer (like CFMask) but there are other land covers like bright soil, deserts, water, and impervious surface that can saturate one or more bands (e.g., Asner et al. 2010; Rashed et al. 2001).</p>
<p>In the summer ("leaf-on" season), we can be confident that, in most areas, snow simply isn't a possible land cover type.
Urban areas are full of bright targets, however, and it seems that almost every surface reflectance image contains a few saturated pixels not due to clouds or snow.
The saturation value (16,000 for Landsat SR; 20,000 for Landsat top-of-atmosphere reflectance) may not be present in all bands—some targets are only thermally bright while others can produce spectral reflections in the optical bands (e.g., sun glint).
Thus, it seems that these bright targets aren't always masked by the included quality assurance (QA) layer; there is still spectral reflectance information in the bands that aren't saturated.</p>
<p>Saturated pixels are easily masked; I'll provide some sample Python code that shows how.
<strong>But can saturated pixels be ignored or are they a problem for certain types of analyses?</strong>
I'm currently in the midst of a study of urban reflectance in Southeast Michigan, using spectral mixture analysis to estimate the fractional abundance of certain land cover types.
To improve the abundance estimates (by improving the signal-to-noise ratio of the input) and to reduce throughput, I use a minimum noise fraction (MNF) transformation, akin to a principal components rotation that projects the Landsat TM/ETM+ reflectance to an orthogonal basis.
Although I had masked these saturated pixels from the start I began to wonder what the difference would be if I had left them in.</p>
<p>The example I present here is for a Landsat 7 ETM+ SR image that has been clipped to Oakland County, Michigan (from within WRS-2 row 20, path 30).
The Landsat 7 ETM+ image was captured in July 1999 (ordinal day 196); the image identifier is LE70200301999196EDC00.
The original SR image had the QA mask, provided by the USGS, applied to the image in both cases; where saturation was masked and where it was not.</p>
<h2>Visualizing Saturation in the Mixing Space</h2>
<p>If saturation persists prior to the transformation, the transformed values are scaled differently than when they are first removed.
This can be seen "heads-up" in a GIS by toggling between layers: The MNF-transformed image with saturated values (below, left) and the one without (below, right).</p>
<p><a href="/images/20150728_mnf_with_saturation.png"><img style="float:left;margin-right:20px;" src="/images/thumbs/20150728_mnf_with_saturation_thumbnail_square.png" /></a>
<a href="/images/20150728_mnf_without_saturation.png"><img style="float:left;" src="/images/thumbs/20150728_mnf_without_saturation_thumbnail_square.png" /></a></p>
<div style="clear:both;"></div>
<p>It can also be seen in the histograms (left, with saturation; right, without).</p>
<p><a href="/images/20150728_histogram_with_saturation.png"><img style="float:left;margin-right:20px;" src="/images/thumbs/20150728_histogram_with_saturation_thumbnail_square.png" /></a>
<a href="/images/20150728_histogram_without_saturation.png"><img style="float:left;" src="/images/thumbs/20150728_histogram_without_saturation_thumbnail_square.png" /></a></p>
<div style="clear:both;"></div>
<p>It's interesting to note that the raster image with saturation shows much brighter soil/ cropland areas (yellow in the image) than in the raster image where saturated values were masked.
This would seem to suggest that leaving the fully (and partially) saturated pixels in place prior to transformation could improve the accuracy of the abundance estimation.
The change in the histograms is not so easy to interpret, however.</p>
<p>Another way of visualizing the difference between these two images is to examine their feature spaces.
Below is an example of the mixing space of the image where saturated pixels were not masked; a 2D slice of the feature space showing the first three MNF components.
These are the three components with the highest signal-to-noise ratio (SNR) where the SNR of the first component is greater than that of the second, and so on.
We see that there are several pixels in this mixing space that are far-flung from the main volume.</p>
<p><a href="/images/20150728_mnf_space_with_saturation_example.png"><img style="float:left;margin-right:20px;" src="/images/thumbs/20150728_mnf_space_with_saturation_example_thumbnail_square.png" /></a></p>
<div style="clear:both;"></div>
<p>If you're unaccustomed to looking at images like the one above, take my word for it: This looks wrong.
Or don't take my word for it; see Tompkins et al. (1997), Wu and Murray (2003) or Small (2004) for examples of mixture spaces.
But then what does the feature space look like for the transformed image where saturation was masked?
Below, right, is the feature space for that image.
On the left, again, is the image where saturation persists, yet I've "zoomed in" to the main mixture volume to facilitate a fair comparison between these images.</p>
<p><a href="/images/20150728_mnf_space_with_saturation_ax1-2.png"><img style="float:left;margin-right:20px;" src="/images/thumbs/20150728_mnf_space_with_saturation_ax1-2_thumbnail_square.png" /></a>
<a href="/images/20150728_mnf_space_without_saturation_ax1-2.png"><img style="float:left;margin-right:20px;" src="/images/thumbs/20150728_mnf_space_without_saturation_ax1-2_thumbnail_square.png" /></a></p>
<div style="clear:both;"></div>
<p>Now these mixing spaces look like what we see in the literature: They should be well-defined convex volumes.
They should resemble a ternary diagram.
We can see that in the left-hand plot (with saturation), the third MNF component (represented in color) has such a wide range of values that the mixture volume is all one color; there are extreme values in this componenent that are not in view.
Those must be the pixels we intially saw scattered throughout the space.
The right-hand plot (without saturation) seems properly scaled.
Note that I've indicated the locations of pixels with known ground cover types in both plots.</p>
<h2>Masking Saturation in Python with GDAL and NumPy</h2>
<p>How did I mask the saturated values, by the way?
It can be tricky when one or more but not all of the bands are saturated at a given pixel.
Below is a Python function, utilizing GDAL and NumPy, that demonstrates how this might be done.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">osgeo</span> <span class="kn">import</span> <span class="n">gdal</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="k">def</span> <span class="nf">mask_saturation</span><span class="p">(</span><span class="n">rast</span><span class="p">,</span> <span class="n">nodata</span><span class="o">=-</span><span class="mi">9999</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Masks out saturated values (surface reflectances of 16000). Arguments:</span>
<span class="sd"> rast A gdal.Dataset or NumPy array</span>
<span class="sd"> nodata The NoData value; defaults to -9999.</span>
<span class="sd"> '''</span>
<span class="c1"># Can accept either a gdal.Dataset or numpy.array instance</span>
<span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">rast</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">):</span>
<span class="n">rast</span> <span class="o">=</span> <span class="n">rast</span><span class="o">.</span><span class="n">ReadAsArray</span><span class="p">()</span>
<span class="c1"># Create a baseline "nothing is saturated in any band" raster</span>
<span class="n">mask</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">empty</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span> <span class="n">rast</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">rast</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">2</span><span class="p">]))</span>
<span class="n">mask</span><span class="o">.</span><span class="n">fill</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>
<span class="c1"># Update the mask for saturation in any band</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">rast</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]):</span>
<span class="n">np</span><span class="o">.</span><span class="n">logical_or</span><span class="p">(</span><span class="n">mask</span><span class="p">,</span>
<span class="n">np</span><span class="o">.</span><span class="n">in1d</span><span class="p">(</span><span class="n">rast</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="o">...</span><span class="p">]</span><span class="o">.</span><span class="n">ravel</span><span class="p">(),</span> <span class="p">(</span><span class="mi">16000</span><span class="p">,))</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">rast</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="o">...</span><span class="p">]</span><span class="o">.</span><span class="n">shape</span><span class="p">),</span>
<span class="n">out</span><span class="o">=</span><span class="n">mask</span><span class="p">)</span>
<span class="c1"># Repeat the NoData value across the bands</span>
<span class="n">np</span><span class="o">.</span><span class="n">place</span><span class="p">(</span><span class="n">rast</span><span class="p">,</span> <span class="n">mask</span><span class="o">.</span><span class="n">repeat</span><span class="p">(</span><span class="n">rast</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">),</span> <span class="p">(</span><span class="n">nodata</span><span class="p">,))</span>
<span class="k">return</span> <span class="n">rast</span>
</code></pre></div>
<h2>What Leads to Better Estimates?</h2>
<p>The differences I've shown above are interesting but what actually leads to more accurate estimation of fractional land cover abundances in spectral mixture analysis?
I performed a linear spectral mixture analysis on the MNF-transformed images using the same four (4) image endmembers: Impervious surface, Green vegetation, Soil, and Shade.
The shade endmember is actually photometric shade: zero reflectance in all bands.
Both water and "NoData" pixels are masked to this value as the linear algebra libraries used cannot work with "NoData" values and masked arrays in NumPy have serious performance drawbacks.</p>
<p>One measure of the "fit" between our abundance estimates and reality is to compare the observed to predicted Landsat ETM+ reflectance, where predicted reflectance is computed by a forward model using the endmember spectra and their estimated fractional abundance in a given pixel.
It is standard practice (e.g., Rashed et al. 2003; Wu and Murray, 2003) to calculate this "fit" at the root mean squared error (RMSE) between the observed reflectance and the predicted reflectance.
Powell et al. (2007) presents a formula for the RMSE of a pixel <span class="math">\(i\)</span> that is normalized by the number of endmembers, <span class="math">\(M\)</span>:</p>
<div class="math">$$
RMSE_i = \left(M^{-1} \sum_{k=1}^M e_{(i,k)}^2 \right)^{1/2}
$$</div>
<p>I then normalized the sum of the RMSE values across a large, random subset of the pixels (for performance considerations) by the range in reflectances, <span class="math">\(r\)</span>:</p>
<div class="math">$$
\% RMSE = \frac{\sum_{i=0}^N RMSE_i}{r_{max} - r_{min}}
$$</div>
<p>As the minimum reflectance is always zero (given the presence of shade), <span class="math">\(r_{min} \equiv 0\)</span>.
The abundance images are each compared to their respective, original Landsat ETM+ image; with or without saturated pixels masked out, as the case may be.</p>
<p>Percent RMSE was calculated for both abundance estimates (with and without saturation) where abundance was estimated two different ways each time: Through a non-negative least squares (NNLS) estimation and through a fully-constrained least squares (FCLS) estimation.
With FCLS, abundance estimates are constrained to be both non-negative and on the interval <span class="math">\([0,1]\)</span>.
With NNLS, the second constraint is dropped.
The final abundance images, with or without saturated pixels, look very similar.
The FCLS image from the estimate without saturated pixels is below.
Impervious surface is mapped to red, vegetation to green, and soil to blue; each pixel's abundance estimates were re-summed to one after shade was subtracted.</p>
<p><a href="/images/20150728_fcls_abundance_without_saturation.png"><img style="float:left;margin-right:20px;" src="/images/thumbs/20150728_fcls_abundance_without_saturation_thumbnail_square.png" /></a></p>
<div style="clear:both;"></div>
<p>Results from the validation are in the table below.</p>
<div class="highlight"><pre><span></span><code>=============================
Saturation? Approach %RMSE
----------- -------- -----
Yes NNLS 7.9%
No NNLS 9.2%
Yes FCLS 1.2%
No FCLS 3.2%
=============================
</code></pre></div>
<p>It would seem that leaving saturated pixels unmasked results in improved modeling of the ETM+ reflectance.
This is only a first approximation of the accuracy, however; we don't really care about the modeled reflectance.
The true validation is in comparing the abundance estimates to ground data.</p>
<h2>Validation of Abundance Estimates</h2>
<p>Shade and soil were subtracted from the abundance images before validation against manually interpreted land cover from aerial photographs in 90-meter windows centered on random sampling points.
169 sample points were located in Oakland County and the coefficient of determination <span class="math">\(\left( R^2 \right)\)</span>, RMSE, and mean absolute error (MAE) were calculated for each of the abundance images.</p>
<p>Without saturation:</p>
<div class="highlight"><pre><span></span><code>============================
N R^2 MAE RMSE
--- ------ ------ ------
169 0.4742 0.1534 0.1998
============================
</code></pre></div>
<p>With saturated pixels retained:</p>
<div class="highlight"><pre><span></span><code>============================
N R^2 MAE RMSE
--- ------ ------ ------
169 0.4360 0.1557 0.2069
============================
</code></pre></div>
<h2>Conclusion</h2>
<p>Despite the better performance in the forward model, the image in which saturated pixels were retained provided worse estimates than the image in which they were masked.
<strong>Removing the saturated pixels seems to improve the accuracy of abundance estimates.</strong>
The magnitude of the improvement is small and we cannot be sure, from this test case alone, that the improvement is not simply a coincidence&emdash;that is, while removing these saturated pixels improved the results in this case, it may be that the saturated pixels in place in another image conspire to produce equal or better results to estimates where they are removed.</p>
<p>However, given the topology of the feature space, the most likely explanation for this marginal improvement in accuracy is the removal of saturated pixels.
And perhaps the most important reason for masking saturated pixels is that it makes interpretation of the mixing space much easier.</p>
<h2>References</h2>
<ol>
<li>Asner, G. P., and K. Heidebrecht. 2010. Spectral unmixing of vegetation, soil and dry carbon cover in arid regions: Comparing multispectral and hyperspectral observations. International Journal of Remote Sensing 23 (19):3939–3958.</li>
<li>Rashed, T., J. R. Weeks, M. S. Gadalla, and A. G. Hill. 2001. Revealing the Anatomy of Cities through Spectral Mixture Analysis of Multispectral Satellite Imagery: A Case Study of the Greater Cairo Region, Egypt. Geocarto International 16 (4):7–18.</li>
<li>Rashed, T., J. R. Weeks, D. A. Roberts, J. Rogan, and R. L. Powell. 2003. Measuring the Physical Composition of Urban Morphology Using Multiple Endmember Spectral Mixture Models. Photogrammetric Engineering & Remote Sensing 69 (9):1011–1020.</li>
<li>Green, A. A., M. Berman, P. Switzer, and M. D. Craig. 1988. A transformation for ordering multispectral data in terms of image quality with implications for noise removal. IEEE Transactions on Geoscience and Remote Sensing 26 (1):65–74.</li>
<li>Tompkins, S., J. F. Mustard, C. M. Pieters, and D. W. Forsyth. 1997. Optimization of Endmembers for Spectral Mixture Analysis. Remote Sensing of Environment 59 (3):472–189.</li>
<li>Wu, C., and A. T. Murray. 2003. Estimating impervious surface distribution by spectral mixture analysis. Remote Sensing of Environment 84 (4):493–505.</li>
<li>Small, C. 2004. The Landsat ETM+ spectral mixing space. Remote Sensing of Environment 93 (1-2):1–17.</li>
<li>Powell, R. L., D. A. Roberts, P. E. Dennison, and L. L. Hess. 2007. Sub-pixel mapping of urban land cover using multiple endmember spectral mixture analysis: Manaus, Brazil. Remote Sensing of Environment 106 (2):253–267.</li>
</ol>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Clipping rasters in Python2015-07-17T12:30:00+02:002015-07-17T12:30:00+02:00K. Arthur Endsleytag:karthur.org,2015-07-17:/2015/clipping-rasters-in-python.html<p>Clipping rasters can be trivial with a desktop GIS like QGIS or with command line tools like GDAL.
However, I recently ran into a situation where I needed to clip large rasters in an automated, online Python process.
It simply wouldn't do to interrupt the procedure and clip them myself …</p><p>Clipping rasters can be trivial with a desktop GIS like QGIS or with command line tools like GDAL.
However, I recently ran into a situation where I needed to clip large rasters in an automated, online Python process.
It simply wouldn't do to interrupt the procedure and clip them myself.
The Python bindings for GDAL/OGR are pretty neat but they are very low-level; how could I use Python to clip the rasters as part of a continuous Python session?</p>
<p>Jared Erickson posted <a href="http://pcjericks.github.io/py-gdalogr-cookbook/raster_layers.html#clip-a-geotiff-with-shapefile">an excellent tutorial on this topic</a>, one of many in his <a href="http://pcjericks.github.io/py-gdalogr-cookbook/index.html">"Python GDAL/OGR Cookbook"</a>.
Even this simple example hints at how complicated something like clipping can be with the low-level GDAL/OGR API.
However, there are still some limitations to the example he provided:</p>
<ul>
<li>It only works for single clipping features.</li>
<li>It only works for multiple-band rasters.</li>
<li>The clipping features must be bounded within the raster.</li>
</ul>
<p>Here I'll present a solution to the second and third points: <strong>Clipping a raster, stacked or single-band, with clipping features that extend beyond its bounds.</strong></p>
<h2>Improvement: Support for Single-Band Images</h2>
<p>This is an easy fix.
We simply need to use a <code>try...catch</code> sequence of statements to anticipate the error in our NumPy indexing when a single-band raster is presented.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Multi-band image?</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">clip</span> <span class="o">=</span> <span class="n">rast</span><span class="p">[:,</span> <span class="n">ulY</span><span class="p">:</span><span class="n">lrY</span><span class="p">,</span> <span class="n">ulX</span><span class="p">:</span><span class="n">lrX</span><span class="p">]</span>
<span class="c1"># Nope: Must be single-band</span>
<span class="k">except</span> <span class="ne">IndexError</span><span class="p">:</span>
<span class="n">clip</span> <span class="o">=</span> <span class="n">rast</span><span class="p">[</span><span class="n">ulY</span><span class="p">:</span><span class="n">lrY</span><span class="p">,</span> <span class="n">ulX</span><span class="p">:</span><span class="n">lrX</span><span class="p">]</span>
</code></pre></div>
<h2>Improvement: Out-of-Bounds Clipping Features</h2>
<p>The second part is not so easy.
The clipping in Jared's example is based on NumPy arrays and therefore must be conformal.
One thing that can happen when the clipping features extend "above" a raster's extent is that the estimated pixel coordinate for the upper-left corner become negative--and negative pixels don't make sense!
We have to catch this case, remember the negative pixel coordinate, and set it to zero.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># If the clipping features extend out-of-bounds and ABOVE the raster...</span>
<span class="k">if</span> <span class="n">gt</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o"><</span> <span class="n">maxY</span><span class="p">:</span>
<span class="c1"># In such a case... ulY ends up being negative--can't have that!</span>
<span class="n">iY</span> <span class="o">=</span> <span class="n">ulY</span>
<span class="n">ulY</span> <span class="o">=</span> <span class="mi">0</span>
</code></pre></div>
<p>However, in doing this, the clipping features are effectively "pushed down" (southwards) because we prevented the upper-left corner coordinate from being negative.
To compensate, we must pull the clipping features "back up" (northwards).
Note that here we need to add an <code>else</code> clause for the case when the clipping features don't extend above the raster.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># If the clipping features extend out-of-bounds and ABOVE the raster...</span>
<span class="k">if</span> <span class="n">gt</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o"><</span> <span class="n">maxY</span><span class="p">:</span>
<span class="c1"># The clip features were "pushed down" to match the bounds of the</span>
<span class="c1"># raster; this step "pulls" them back up</span>
<span class="n">premask</span> <span class="o">=</span> <span class="n">image_to_array</span><span class="p">(</span><span class="n">raster_poly</span><span class="p">)</span>
<span class="c1"># We slice out the piece of our clip features that are "off the map"</span>
<span class="n">mask</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">((</span><span class="n">premask</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span> <span class="o">-</span> <span class="nb">abs</span><span class="p">(</span><span class="n">iY</span><span class="p">),</span> <span class="n">premask</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]),</span> <span class="n">premask</span><span class="o">.</span><span class="n">dtype</span><span class="p">)</span>
<span class="n">mask</span><span class="p">[:]</span> <span class="o">=</span> <span class="n">premask</span><span class="p">[</span><span class="nb">abs</span><span class="p">(</span><span class="n">iY</span><span class="p">):,</span> <span class="p">:]</span>
<span class="n">mask</span><span class="o">.</span><span class="n">resize</span><span class="p">(</span><span class="n">premask</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="c1"># Then fill in from the bottom</span>
<span class="c1"># Most importantly, push the clipped piece down</span>
<span class="n">gt2</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="n">maxY</span> <span class="o">-</span> <span class="p">(</span><span class="n">maxY</span> <span class="o">-</span> <span class="n">gt</span><span class="p">[</span><span class="mi">3</span><span class="p">])</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">mask</span> <span class="o">=</span> <span class="n">image_to_array</span><span class="p">(</span><span class="n">raster_poly</span><span class="p">)</span>
</code></pre></div>
<p>Finally, we have to deal with the converse problem: clipping features that extend below the raster's bounds.
This is trickier.
What I've done is created a larger NumPy array onto which the clipping features are projected.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Clip the image using the mask</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">clip</span> <span class="o">=</span> <span class="n">gdalnumeric</span><span class="o">.</span><span class="n">choose</span><span class="p">(</span><span class="n">mask</span><span class="p">,</span> <span class="p">(</span><span class="n">clip</span><span class="p">,</span> <span class="n">nodata</span><span class="p">))</span>
<span class="c1"># If the clipping features extend out-of-bounds and BELOW the raster...</span>
<span class="k">except</span> <span class="ne">ValueError</span><span class="p">:</span>
<span class="c1"># We have to cut the clipping features to the raster!</span>
<span class="n">rshp</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">mask</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">if</span> <span class="n">mask</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span> <span class="o">!=</span> <span class="n">clip</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]:</span>
<span class="n">rshp</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">clip</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span>
<span class="k">if</span> <span class="n">mask</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="n">clip</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]:</span>
<span class="n">rshp</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">clip</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">mask</span><span class="o">.</span><span class="n">resize</span><span class="p">(</span><span class="o">*</span><span class="n">rshp</span><span class="p">,</span> <span class="n">refcheck</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">clip</span> <span class="o">=</span> <span class="n">gdalnumeric</span><span class="o">.</span><span class="n">choose</span><span class="p">(</span><span class="n">mask</span><span class="p">,</span> <span class="p">(</span><span class="n">clip</span><span class="p">,</span> <span class="n">nodata</span><span class="p">))</span>
</code></pre></div>
<h2>The Complete Source Code</h2>
<p>Here's a complete Python function for clipping a raster.
It follows Jared's example but expands on the comments throughout and includes the two improvements I've described.
It does not yet support clipping by multiple features at once.
Also, I haven't evaluated (nor come across the problem of) its performance with clipping features that extend only to the side of the raster; not sure if this would be a problem.</p>
<p>The function's API consists of two required arguments: The raster to clip and the file path to the clipping features (e.g., a Shapefile).
An optional GDAL GeoTransform can be provided.
Also, the NoData value can be set.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">osgeo</span> <span class="kn">import</span> <span class="n">gdal</span><span class="p">,</span> <span class="n">gdalnumeric</span><span class="p">,</span> <span class="n">ogr</span>
<span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span><span class="p">,</span> <span class="n">ImageDraw</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="k">def</span> <span class="nf">clip_raster</span><span class="p">(</span><span class="n">rast</span><span class="p">,</span> <span class="n">features_path</span><span class="p">,</span> <span class="n">gt</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">nodata</span><span class="o">=-</span><span class="mi">9999</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Clips a raster (given as either a gdal.Dataset or as a numpy.array</span>
<span class="sd"> instance) to a polygon layer provided by a Shapefile (or other vector</span>
<span class="sd"> layer). If a numpy.array is given, a "GeoTransform" must be provided</span>
<span class="sd"> (via dataset.GetGeoTransform() in GDAL). Returns an array. Clip features</span>
<span class="sd"> must be a dissolved, single-part geometry (not multi-part). Modified from:</span>
<span class="sd"> http://pcjericks.github.io/py-gdalogr-cookbook/raster_layers.html</span>
<span class="sd"> #clip-a-geotiff-with-shapefile</span>
<span class="sd"> Arguments:</span>
<span class="sd"> rast A gdal.Dataset or a NumPy array</span>
<span class="sd"> features_path The path to the clipping features</span>
<span class="sd"> gt An optional GDAL GeoTransform to use instead</span>
<span class="sd"> nodata The NoData value; defaults to -9999.</span>
<span class="sd"> '''</span>
<span class="k">def</span> <span class="nf">array_to_image</span><span class="p">(</span><span class="n">a</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Converts a gdalnumeric array to a Python Imaging Library (PIL) Image.</span>
<span class="sd"> '''</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">Image</span><span class="o">.</span><span class="n">fromstring</span><span class="p">(</span><span class="s1">'L'</span><span class="p">,(</span><span class="n">a</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">a</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span>
<span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s1">'b'</span><span class="p">))</span><span class="o">.</span><span class="n">tostring</span><span class="p">())</span>
<span class="k">return</span> <span class="n">i</span>
<span class="k">def</span> <span class="nf">image_to_array</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Converts a Python Imaging Library (PIL) array to a gdalnumeric image.</span>
<span class="sd"> '''</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">gdalnumeric</span><span class="o">.</span><span class="n">fromstring</span><span class="p">(</span><span class="n">i</span><span class="o">.</span><span class="n">tobytes</span><span class="p">(),</span> <span class="s1">'b'</span><span class="p">)</span>
<span class="n">a</span><span class="o">.</span><span class="n">shape</span> <span class="o">=</span> <span class="n">i</span><span class="o">.</span><span class="n">im</span><span class="o">.</span><span class="n">size</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">i</span><span class="o">.</span><span class="n">im</span><span class="o">.</span><span class="n">size</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">return</span> <span class="n">a</span>
<span class="k">def</span> <span class="nf">world_to_pixel</span><span class="p">(</span><span class="n">geo_matrix</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Uses a gdal geomatrix (gdal.GetGeoTransform()) to calculate</span>
<span class="sd"> the pixel location of a geospatial coordinate; from:</span>
<span class="sd"> http://pcjericks.github.io/py-gdalogr-cookbook/raster_layers.html#clip-a-geotiff-with-shapefile</span>
<span class="sd"> '''</span>
<span class="n">ulX</span> <span class="o">=</span> <span class="n">geo_matrix</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">ulY</span> <span class="o">=</span> <span class="n">geo_matrix</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span>
<span class="n">xDist</span> <span class="o">=</span> <span class="n">geo_matrix</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">yDist</span> <span class="o">=</span> <span class="n">geo_matrix</span><span class="p">[</span><span class="mi">5</span><span class="p">]</span>
<span class="n">rtnX</span> <span class="o">=</span> <span class="n">geo_matrix</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span>
<span class="n">rtnY</span> <span class="o">=</span> <span class="n">geo_matrix</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span>
<span class="n">pixel</span> <span class="o">=</span> <span class="nb">int</span><span class="p">((</span><span class="n">x</span> <span class="o">-</span> <span class="n">ulX</span><span class="p">)</span> <span class="o">/</span> <span class="n">xDist</span><span class="p">)</span>
<span class="n">line</span> <span class="o">=</span> <span class="nb">int</span><span class="p">((</span><span class="n">ulY</span> <span class="o">-</span> <span class="n">y</span><span class="p">)</span> <span class="o">/</span> <span class="n">xDist</span><span class="p">)</span>
<span class="k">return</span> <span class="p">(</span><span class="n">pixel</span><span class="p">,</span> <span class="n">line</span><span class="p">)</span>
<span class="c1"># Can accept either a gdal.Dataset or numpy.array instance</span>
<span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">rast</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">):</span>
<span class="n">gt</span> <span class="o">=</span> <span class="n">rast</span><span class="o">.</span><span class="n">GetGeoTransform</span><span class="p">()</span>
<span class="n">rast</span> <span class="o">=</span> <span class="n">rast</span><span class="o">.</span><span class="n">ReadAsArray</span><span class="p">()</span>
<span class="c1"># Create an OGR layer from a boundary shapefile</span>
<span class="n">features</span> <span class="o">=</span> <span class="n">ogr</span><span class="o">.</span><span class="n">Open</span><span class="p">(</span><span class="n">features_path</span><span class="p">)</span>
<span class="k">if</span> <span class="n">features</span><span class="o">.</span><span class="n">GetDriver</span><span class="p">()</span><span class="o">.</span><span class="n">GetName</span><span class="p">()</span> <span class="o">==</span> <span class="s1">'ESRI Shapefile'</span><span class="p">:</span>
<span class="n">lyr</span> <span class="o">=</span> <span class="n">features</span><span class="o">.</span><span class="n">GetLayer</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">splitext</span><span class="p">(</span><span class="n">features_path</span><span class="p">)[</span><span class="mi">0</span><span class="p">])[</span><span class="mi">1</span><span class="p">])</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">lyr</span> <span class="o">=</span> <span class="n">features</span><span class="o">.</span><span class="n">GetLayer</span><span class="p">()</span>
<span class="c1"># Get the first feature</span>
<span class="n">poly</span> <span class="o">=</span> <span class="n">lyr</span><span class="o">.</span><span class="n">GetNextFeature</span><span class="p">()</span>
<span class="c1"># Convert the layer extent to image pixel coordinates</span>
<span class="n">minX</span><span class="p">,</span> <span class="n">maxX</span><span class="p">,</span> <span class="n">minY</span><span class="p">,</span> <span class="n">maxY</span> <span class="o">=</span> <span class="n">lyr</span><span class="o">.</span><span class="n">GetExtent</span><span class="p">()</span>
<span class="n">ulX</span><span class="p">,</span> <span class="n">ulY</span> <span class="o">=</span> <span class="n">world_to_pixel</span><span class="p">(</span><span class="n">gt</span><span class="p">,</span> <span class="n">minX</span><span class="p">,</span> <span class="n">maxY</span><span class="p">)</span>
<span class="n">lrX</span><span class="p">,</span> <span class="n">lrY</span> <span class="o">=</span> <span class="n">world_to_pixel</span><span class="p">(</span><span class="n">gt</span><span class="p">,</span> <span class="n">maxX</span><span class="p">,</span> <span class="n">minY</span><span class="p">)</span>
<span class="c1"># Calculate the pixel size of the new image</span>
<span class="n">pxWidth</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">lrX</span> <span class="o">-</span> <span class="n">ulX</span><span class="p">)</span>
<span class="n">pxHeight</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">lrY</span> <span class="o">-</span> <span class="n">ulY</span><span class="p">)</span>
<span class="c1"># If the clipping features extend out-of-bounds and ABOVE the raster...</span>
<span class="k">if</span> <span class="n">gt</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o"><</span> <span class="n">maxY</span><span class="p">:</span>
<span class="c1"># In such a case... ulY ends up being negative--can't have that!</span>
<span class="n">iY</span> <span class="o">=</span> <span class="n">ulY</span>
<span class="n">ulY</span> <span class="o">=</span> <span class="mi">0</span>
<span class="c1"># Multi-band image?</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">clip</span> <span class="o">=</span> <span class="n">rast</span><span class="p">[:,</span> <span class="n">ulY</span><span class="p">:</span><span class="n">lrY</span><span class="p">,</span> <span class="n">ulX</span><span class="p">:</span><span class="n">lrX</span><span class="p">]</span>
<span class="k">except</span> <span class="ne">IndexError</span><span class="p">:</span>
<span class="n">clip</span> <span class="o">=</span> <span class="n">rast</span><span class="p">[</span><span class="n">ulY</span><span class="p">:</span><span class="n">lrY</span><span class="p">,</span> <span class="n">ulX</span><span class="p">:</span><span class="n">lrX</span><span class="p">]</span>
<span class="c1"># Create a new geomatrix for the image</span>
<span class="n">gt2</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">gt</span><span class="p">)</span>
<span class="n">gt2</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">minX</span>
<span class="n">gt2</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="n">maxY</span>
<span class="c1"># Map points to pixels for drawing the boundary on a blank 8-bit,</span>
<span class="c1"># black and white, mask image.</span>
<span class="n">points</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">pixels</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">geom</span> <span class="o">=</span> <span class="n">poly</span><span class="o">.</span><span class="n">GetGeometryRef</span><span class="p">()</span>
<span class="n">pts</span> <span class="o">=</span> <span class="n">geom</span><span class="o">.</span><span class="n">GetGeometryRef</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">pts</span><span class="o">.</span><span class="n">GetPointCount</span><span class="p">()):</span>
<span class="n">points</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">pts</span><span class="o">.</span><span class="n">GetX</span><span class="p">(</span><span class="n">p</span><span class="p">),</span> <span class="n">pts</span><span class="o">.</span><span class="n">GetY</span><span class="p">(</span><span class="n">p</span><span class="p">)))</span>
<span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">points</span><span class="p">:</span>
<span class="n">pixels</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">world_to_pixel</span><span class="p">(</span><span class="n">gt2</span><span class="p">,</span> <span class="n">p</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">p</span><span class="p">[</span><span class="mi">1</span><span class="p">]))</span>
<span class="n">raster_poly</span> <span class="o">=</span> <span class="n">Image</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="s1">'L'</span><span class="p">,</span> <span class="p">(</span><span class="n">pxWidth</span><span class="p">,</span> <span class="n">pxHeight</span><span class="p">),</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">rasterize</span> <span class="o">=</span> <span class="n">ImageDraw</span><span class="o">.</span><span class="n">Draw</span><span class="p">(</span><span class="n">raster_poly</span><span class="p">)</span>
<span class="n">rasterize</span><span class="o">.</span><span class="n">polygon</span><span class="p">(</span><span class="n">pixels</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="c1"># Fill with zeroes</span>
<span class="c1"># If the clipping features extend out-of-bounds and ABOVE the raster...</span>
<span class="k">if</span> <span class="n">gt</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o"><</span> <span class="n">maxY</span><span class="p">:</span>
<span class="c1"># The clip features were "pushed down" to match the bounds of the</span>
<span class="c1"># raster; this step "pulls" them back up</span>
<span class="n">premask</span> <span class="o">=</span> <span class="n">image_to_array</span><span class="p">(</span><span class="n">raster_poly</span><span class="p">)</span>
<span class="c1"># We slice out the piece of our clip features that are "off the map"</span>
<span class="n">mask</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">((</span><span class="n">premask</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span> <span class="o">-</span> <span class="nb">abs</span><span class="p">(</span><span class="n">iY</span><span class="p">),</span> <span class="n">premask</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]),</span> <span class="n">premask</span><span class="o">.</span><span class="n">dtype</span><span class="p">)</span>
<span class="n">mask</span><span class="p">[:]</span> <span class="o">=</span> <span class="n">premask</span><span class="p">[</span><span class="nb">abs</span><span class="p">(</span><span class="n">iY</span><span class="p">):,</span> <span class="p">:]</span>
<span class="n">mask</span><span class="o">.</span><span class="n">resize</span><span class="p">(</span><span class="n">premask</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="c1"># Then fill in from the bottom</span>
<span class="c1"># Most importantly, push the clipped piece down</span>
<span class="n">gt2</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="n">maxY</span> <span class="o">-</span> <span class="p">(</span><span class="n">maxY</span> <span class="o">-</span> <span class="n">gt</span><span class="p">[</span><span class="mi">3</span><span class="p">])</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">mask</span> <span class="o">=</span> <span class="n">image_to_array</span><span class="p">(</span><span class="n">raster_poly</span><span class="p">)</span>
<span class="c1"># Clip the image using the mask</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">clip</span> <span class="o">=</span> <span class="n">gdalnumeric</span><span class="o">.</span><span class="n">choose</span><span class="p">(</span><span class="n">mask</span><span class="p">,</span> <span class="p">(</span><span class="n">clip</span><span class="p">,</span> <span class="n">nodata</span><span class="p">))</span>
<span class="c1"># If the clipping features extend out-of-bounds and BELOW the raster...</span>
<span class="k">except</span> <span class="ne">ValueError</span><span class="p">:</span>
<span class="c1"># We have to cut the clipping features to the raster!</span>
<span class="n">rshp</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">mask</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">if</span> <span class="n">mask</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span> <span class="o">!=</span> <span class="n">clip</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]:</span>
<span class="n">rshp</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">clip</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span>
<span class="k">if</span> <span class="n">mask</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="n">clip</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]:</span>
<span class="n">rshp</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">clip</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">mask</span><span class="o">.</span><span class="n">resize</span><span class="p">(</span><span class="o">*</span><span class="n">rshp</span><span class="p">,</span> <span class="n">refcheck</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">clip</span> <span class="o">=</span> <span class="n">gdalnumeric</span><span class="o">.</span><span class="n">choose</span><span class="p">(</span><span class="n">mask</span><span class="p">,</span> <span class="p">(</span><span class="n">clip</span><span class="p">,</span> <span class="n">nodata</span><span class="p">))</span>
<span class="k">return</span> <span class="p">(</span><span class="n">clip</span><span class="p">,</span> <span class="n">ulX</span><span class="p">,</span> <span class="n">ulY</span><span class="p">,</span> <span class="n">gt2</span><span class="p">)</span>
</code></pre></div>Identifying water bodies from Landsat TM/ETM+ with density slicing, machine learning2015-06-22T09:45:00+02:002015-06-22T09:45:00+02:00K. Arthur Endsleytag:karthur.org,2015-06-22:/2015/water-bodies-landsat-density-slicing-machine-learning.html<p>I recently needed to develop a way of detecting water bodes in a Landsat image in order to mask them. Here's a surprisingly robust solution that's easy to implement.</p><p>I recently found myself in need of a comprehensive water body mask I could generate from and apply to Landsat TM/ETM+ images.
Based on my past experience and the available literature (e.g., Frazier et al. 2000), I knew that any solution was best sourced from Landsat's thermal bands.
I wasn't entirely sure what to expect from a solution based only on spectral data but Frazier et al. (2000) presented a very robust solution using only density slicing of Landast TM Band 5 (1.55-1.75 um).</p>
<p>I decided I would test density slicing and compare the results to a simple supervised classification, just as Frazier et al. (2000) had done.
They selected a maximum likelihood classifier; I selected a naive Bayes classifier.
The study area is the Oakland county, Michigan subset of a Landsat 5 image from July, 1999.
Oakland county has several small and mid-size lakes throughout its extent that make for a compelling example.
The examples below require, among other tools, the <a href="http://www.gdal.org/gdal_utilities.html">GDAL</a> and <a href="http://www.gdal.org/ogr_utilities.html">OGR</a> command line utilities.</p>
<h2>Data Preparation</h2>
<p>Supervised classification requires examples of the classes we wish to identify; in this case, water bodies and non-water areas.
Example areas of water and non-water were generated in two different ways:</p>
<ol>
<li>In one iteration, the water areas were generated directly from hydrology data provided by <a href="http://www.mcgi.state.mi.us/mgdl/">the Michigan Geographic Data Library (MiGDL)</a>.</li>
<li>In the second iteration, I prepared examples of water and non-water areas myself in random samples from the study area.</li>
</ol>
<p>For the second part, I wrote a previous article on <a href="http://karthur.org/2015/generating-sample-validation-data-bash-qgis.html">how to generate sampling areas on the same grid as a Landsat image</a>; see that article for details.
Here, I proceed as if the human-produced training data are already prepared.</p>
<h3>Projection Issues</h3>
<p>Use of MiGDL data can be frustrating because the data are stored in the <a href="http://spatialreference.org/ref/esri/102123/">Michigan State Plane projection</a>.
In clipping the lake and stream data to Oakland county, for instance, I had to ensure my clipping features were in the same projection.
I re-projected the clipped data to UTM 17N (WGS84) afterwards.</p>
<div class="highlight"><pre><span></span><code><span class="nv">MICHIGAN_GEOREF</span><span class="o">=</span><span class="s1">'PROJCS["NAD_1983_Michigan_GeoRef_Meters",GEOGCS["GCS_North_American_1983",DATUM["North_American_Datum_1983",SPHEROID["GRS_1980",6378137,298.257222101]],PRIMEM["Greenwich",0],UNIT["Degree",0.017453292519943295]],PROJECTION["Hotine_Oblique_Mercator"],PARAMETER["False_Easting",2546731.496],PARAMETER["False_Northing",-4354009.816],PARAMETER["Scale_Factor",0.9996],PARAMETER["Azimuth",337.255555555556],PARAMETER["Longitude_Of_Center",-86],PARAMETER["Latitude_Of_Center",45.30916666666666],UNIT["Meter",1],AUTHORITY["EPSG","102123"]]'</span>
<span class="c1"># Create clipping features in Michigan GeoRef</span>
ogr2ogr clip_features.shp Oakland_county.shp -t_srs <span class="nv">$MICHIGAN_GEOREF</span>
<span class="c1"># Perform the clip</span>
ogr2ogr -f <span class="s2">"ESRI Shapefile"</span> -s_srs <span class="nv">$MICHIGAN_GEOREF</span><span class="se">\</span>
hydropoly_miv14a_clip.shp hydropoly_miv14a.shp<span class="se">\</span>
-t_srs <span class="s2">"EPSG:32617"</span> -clipsrc clip_features.shp
</code></pre></div>
<h3>Cleaning the Hydrology Exemplar Data</h3>
<p>After reprojeciton, the lake and stream areas delineated in the MiGDL dataset were filtered, discarding water bodies smaller than 20 acres, which might be ephemeral.</p>
<div class="highlight"><pre><span></span><code>ogr2ogr -where <span class="s2">"ACRES >= 20"</span> hydropoly_miv14a_clip_gte_20ac.shp hydropoly_miv14a_clip.shp
</code></pre></div>
<p>Next, the filtered water bodies were shrunk in size by diminishing their outward edges by 90 meters---a reverse buffering.
This is to account for a lack of spatial fit between the MiGDL data and the Landsat data and also eliminates uncertainty in our water area model due to nearshore vegetation cover.</p>
<div class="highlight"><pre><span></span><code>ogr2ogr hydropoly_miv14a_clip_gte_20ac_shrink_90m.shp hydropoly_miv14a_clip_gte_20ac.shp -dialect sqlite -sql <span class="s2">"SELECT GID, ACRES, SQKM, ST_Buffer(Geometry, -90) FROM hydropoly_miv14a_clip_gte_20ac"</span>
</code></pre></div>
<p>The final step in preparing known examples of water cover is to rasterize the layer of known water bodies and clip it to the area of analysis.
For this, I used my own <a href="https://github.com/arthur-e/gdal_extent.py">gdal_extent.py</a> utility and the <code>gdal_rasterize</code> <a href="http://www.gdal.org/gdal_rasterize.html">command line utility</a>.
In this iteration, all Landsat pixels not intersected by the water area examples are considered to be examples of land areas.</p>
<div class="highlight"><pre><span></span><code><span class="nv">EXTENT</span><span class="o">=</span><span class="k">$(</span>gdal_extent.py LandsatTM_raster.tiff<span class="k">)</span>
<span class="nv">SIZE</span><span class="o">=</span><span class="k">$(</span>gdal_extent.py -size LandsatTM_raster.tiff<span class="k">)</span>
gdal_rasterize -burn <span class="m">1</span> -ts <span class="nv">$SIZE</span> -tr <span class="m">30</span> <span class="m">30</span> -te <span class="nv">$EXTENT</span> hydropoly_miv14a_clip_gte_20ac_shrink_90m.shp hydropoly_miv14a_clip_gte_20ac_shrink_90m.tiff
</code></pre></div>
<p><strong>In the second iteration,</strong> the water bodies generated above are validated in Google Earth by a human analyst.
Those water bodies that intersect land are removed from the water examples layer.
After applying machine learning in the first iteration, it was discovered that some man-made reservoirs were incorrectly identified as land areas.
These reservoir areas were added to the water examples layer for the second iteration.</p>
<p>In contrast to the previous approach, land areas were explicitly identified for this iteration.
Rectangular areas of 300 meters squared were generated on a grid aligned with the Landsat data, randomly sampled, and then validated in Google Earth by a human analyst (me), discarding any areas that intersected a water body.
The remaining areas were used as exemplars of land areas.</p>
<h2>Results: Unsupervised Classification</h2>
<p>A Gaussian naive Bayes estimator was applied to known water and non-water areas.
A stratified 10-fold cross-validation was applied to the water examples generated from the MiGDL dataset (without human validation).
The mean precision and mean recall of the cross-validation folds are presented below.
The results are very good excepting the precision with which we can detect water, which is poor.</p>
<div class="highlight"><pre><span></span><code><span class="nb">----------------------------</span><span class="c"></span>
<span class="c">Class Precision Recall</span>
<span class="nb">---------</span><span class="c"> </span><span class="nb">---------</span><span class="c"> </span><span class="nb">------</span><span class="c"></span>
<span class="c">Not water 0</span><span class="nt">.</span><span class="c">9997 0</span><span class="nt">.</span><span class="c">9787</span>
<span class="c">Water 0</span><span class="nt">.</span><span class="c">3090 0</span><span class="nt">.</span><span class="c">9763</span>
<span class="nb">----------------------------</span><span class="c"></span>
</code></pre></div>
<p>After human validation was used to better discriminate land and water in the training data, naive Bayes was applied again and the precision for the water class is much improved.</p>
<div class="highlight"><pre><span></span><code><span class="nb">----------------------------</span><span class="c"></span>
<span class="c">Class Precision Recall</span>
<span class="nb">---------</span><span class="c"> </span><span class="nb">---------</span><span class="c"> </span><span class="nb">------</span><span class="c"></span>
<span class="c">Not water 0</span><span class="nt">.</span><span class="c">9675 1</span><span class="nt">.</span><span class="c">0000</span>
<span class="c">Water 1</span><span class="nt">.</span><span class="c">0000 0</span><span class="nt">.</span><span class="c">9664</span>
<span class="nb">----------------------------</span><span class="c"></span>
</code></pre></div>
<p>These outstanding results are the first indication that a water discrimination technique based solely on spectral information might be very effective.
In the last step, we'll apply density slicing of Landsat TM Band 5.</p>
<h2>Results: Density Slicing</h2>
<p>If we examine histograms of segregated water and land areas (below), we can see that there is a sharp divide between the spectral responses of these land cover types in Band 5.</p>
<p><a href="/images/20150622_density_slice_water.png"><img style="float:left;margin-right:20px;" src="/images/thumbs/20150622_density_slice_water_thumbnail_wide.png" /></a>
<a href="/images/20150622_density_slice_land.png"><img style="float:left;margin-right:20px;clear:right;" src="/images/thumbs/20150622_density_slice_land_thumbnail_wide.png" /></a></p>
<div style="clear:both;"></div>
<p>There are two cutoffs we might choose: less than or equal to 500 or 1000 reflectance units.
In addition, we might want to sieve the results to reduce commission error.
Thus, I experimented with four combinations of thresholds and sieving: a cutoff at either 500 or 1000 reflectance units with or without sieving.
We find that sieving had no effect on the performance as measured by precision and recall:</p>
<div class="highlight"><pre><span></span><code><span class="nb">------------------------------------------</span><span class="c"></span>
<span class="c">Approach Class Precision Recall</span>
<span class="nb">-----------</span><span class="c"> </span><span class="nb">---------</span><span class="c"> </span><span class="nb">---------</span><span class="c"> </span><span class="nb">------</span><span class="c"></span>
<span class="c">B5 </span><span class="nv"><</span><span class="c">/= 500 Not water 0</span><span class="nt">.</span><span class="c">94 1</span><span class="nt">.</span><span class="c">00</span>
<span class="c">B5 </span><span class="nv"><</span><span class="c">/= 500 Water 1</span><span class="nt">.</span><span class="c">00 0</span><span class="nt">.</span><span class="c">96</span>
<span class="c">B5 </span><span class="nv"><</span><span class="c">/= 1000 Not water 0</span><span class="nt">.</span><span class="c">96 1</span><span class="nt">.</span><span class="c">00</span>
<span class="c">B5 </span><span class="nv"><</span><span class="c">/= 1000 Water 1</span><span class="nt">.</span><span class="c">00 0</span><span class="nt">.</span><span class="c">98</span>
<span class="nb">------------------------------------------</span><span class="c"></span>
</code></pre></div>
<p>And that, on average, a cutoff of 1000 reflectance units performs slightly better:</p>
<div class="highlight"><pre><span></span><code><span class="nb">------------------------------</span><span class="c"></span>
<span class="c">Approach Precision Recall</span>
<span class="nb">-----------</span><span class="c"> </span><span class="nb">---------</span><span class="c"> </span><span class="nb">------</span><span class="c"></span>
<span class="c">B5 </span><span class="nv"><</span><span class="c">/= 500 0</span><span class="nt">.</span><span class="c">98 0</span><span class="nt">.</span><span class="c">98</span>
<span class="c">B5 </span><span class="nv"><</span><span class="c">/= 1000 0</span><span class="nt">.</span><span class="c">99 0</span><span class="nt">.</span><span class="c">99</span>
<span class="nb">------------------------------</span><span class="c"></span>
</code></pre></div>
<p>Density slicing performs exceptionally well!
It performs even better than the naive Bayes estimator.
The performance of simple density slicing seems surprising but, then again, we looked at the histogram for Band 5 when choosing the thresholds for slicing and it was very apparent that the land and water pixels were separated at somewhere between 500 and 1000 reflectance units.
The real question, then, is whether this relationship holds up across Landsat images (across time and space).
In, it works very well for Landsat surface reflectance (SR) data from other dates, though I haven't tried this yet outside of southeast Michigan.</p>
<h2>References</h2>
<ol>
<li>Frazier, P. S., and K. J. Page. 2000. Water Body Detection and Delineation with Landsat TM Data. Photogrammetric Engineering & Remote Sensing 66 (12):1461–1467.</li>
</ol>Generating sample validation points with the Unix Shell and QGIS2015-06-09T18:30:00+02:002015-06-09T18:30:00+02:00K. Arthur Endsleytag:karthur.org,2015-06-09:/2015/generating-sample-validation-data-bash-qgis.html<p>For ongoing work I'm doing with Landsat data, I recently needed to generate some quick validation points against high-resolution aerial photography.
I wanted to generate a fixed number of random rectangles, 90-meters squared (3 by 3 Landsat pixels), within the extent of a high-resolution aerial photograph I was using as …</p><p>For ongoing work I'm doing with Landsat data, I recently needed to generate some quick validation points against high-resolution aerial photography.
I wanted to generate a fixed number of random rectangles, 90-meters squared (3 by 3 Landsat pixels), within the extent of a high-resolution aerial photograph I was using as "ground data."
More specifically, we want to generate a grid of 90-meter squared polygons distributed throughout that align with our Landsat image.</p>
<p>The steps involved are:</p>
<ol>
<li>Clip the Landsat image to the bounds of (the intersection with) my aerial photograph.</li>
<li>Resample this clip to the desired resolution of my validation window (e.g., 90 meters).</li>
<li>Use QGIS to generate a grid from the coarsened Landsat clip.</li>
<li>Randomly sample from the generated grid.</li>
</ol>
<h2>Clipping the Landsat Image to the Airphoto Reference</h2>
<p>So, we want, e.g., 100 validation points within the bounds of the aerial photograph.
To create the validation points within a defined extent, we first need to measure the extent of the reference image.
The <code>gdalinfo</code> utility is useful for this (as is the image properties dialog in QGIS).
I recently created <a href="https://github.com/arthur-e/gdal_extent.py">a GDAL-esque tool that can help automate this process</a> called <code>gdal_extent.py</code>.
After using it to extract the bounds of the aerial photograph, <code>airphoto.tiff</code> as a GeoJSON file, I can convert it to a Shapefile, which is the expected format for cutline features when clipping with GDAL.
As the input file to <code>ogr2ogr</code> is GeoJSON, I have to tell it the spatial reference system (SRS) it should expect from the source file with the <code>-s_srs</code> switch.</p>
<div class="highlight"><pre><span></span><code>python gdal_extent.py -geojson airphoto.tiff > extent.json
ogr2ogr -f <span class="s2">"ESRI Shapefile"</span> -s_srs <span class="s2">"EPSG:32617"</span> extent.shp extent.json
</code></pre></div>
<p>Alternatively, I could have just created a Shapefile that represents the bounds of my aerial photograph any other way.
Also, my approach generates a rectangular bounding box whereas we could use any more complex polygon.
So, assuming that you have a file <code>extent.shp</code> that somehow represents the bounds of your image, it's time to clip our Landsat image (<code>L7Image.tiff</code>) to this extent.
I'll use <code>gdalwarp</code>; note that I specify the output resolution as 90 meters in both the horizontal and vertical directions.</p>
<div class="highlight"><pre><span></span><code>gdalwarp -cutline extent.shp -cl extent -crop_to_cutline -tr <span class="m">90</span> <span class="m">90</span> L7Image.tiff grid_90m.tiff
</code></pre></div>
<h2>Generating a Sampling Grid in QGIS</h2>
<p>This is where QGIS comes in.
A close cousin of ArcGIS' Fishnet tool, QGIS' "Vector grid" tool (under "Vector, Research Tools") is what I'll use to convert the pixels of my Landsat image to polygons.</p>
<p><img alt="The Vector grid tool is a dear invention." src="http://karthur.org/images/20150609_qgis_vector_grid_tool.png"></p>
<p>The output Shapefile should resemble a grid of pixels.</p>
<h2>Selecting a Subset of Random Validation points</h2>
<p>In the last step, I must downselect from this grid of pixels a random subset.
Here, I'll present two techniques for introducing a stochastic selection process.</p>
<h3>Random Selection within the Database</h3>
<p>The simplest way would be to let a file database randomly select features from my polygon grid.
In this example, I'll generate KML output because, for another application, I want to validate my points in Google Earth.
The output Shapefile from the Vector grid tool is <code>vector_grid.shp</code> and we're downselecting 100 validation points:</p>
<div class="highlight"><pre><span></span><code>ogr2ogr -f <span class="s2">"KML"</span> -dialect <span class="s2">"sqlite"</span> -sql <span class="s2">"SELECT * FROM vector_grid ORDER BY RANDOM() LIMIT 100"</span> validation_points.kml vector_grid.shp
</code></pre></div>
<h3>Random Selection with Bash</h3>
<p>If you don't trust the randomization capabilities of a database or have your own reason for generating random features, you might prefer this approach.
In this example, the output Shapefile from the Vector grid tool is <code>vector_grid.shp</code>; we'll use a combination of Bash built-ins and GDAL tools here:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Generate 100 randoms, reverse the string, cut the last character, reverse again</span>
<span class="nv">PIXELS</span><span class="o">=</span><span class="k">$(</span>ogrinfo vector_grid.shp -al <span class="p">|</span> grep <span class="s2">"Feature Count: "</span> <span class="p">|</span> cut -c <span class="m">16</span>-<span class="k">)</span>
<span class="nv">RANDS</span><span class="o">=</span><span class="s2">"</span><span class="k">$(</span>shuf -i <span class="m">0</span>-<span class="nv">$PIXELS</span> -n <span class="m">100</span> <span class="p">|</span> tr <span class="s1">'\n'</span> <span class="s1">','</span> <span class="p">|</span> rev <span class="p">|</span> cut -c <span class="m">2</span>- <span class="p">|</span> rev<span class="k">)</span><span class="s2">"</span>
ogr2ogr -f <span class="s2">"ESRI Shapefile"</span> -where <span class="s2">"ID IN (</span><span class="nv">$RANDS</span><span class="s2">)"</span> validation_points.shp vector_grid.shp
</code></pre></div>
<p><strong>As an added bonus, this approach allows us to create fields for validating certain quantities directly in our Shapefile.</strong>
This is a great way to automate the generation of validation points for the worthwhile consideration of the undergraduate interns, graduate students, and other valuable research assistants working with you.
For example, if I want to validate the fraction of vegetation cover (<code>VegFrac</code>) and impervious surface cover (<code>ImpFrac</code>) and record that information directly in my Shapefile (while editing it in QGIS, later), I simply add those fields as part of a <code>SELECT</code> statement that includes my previous <code>WHERE</code> condition:</p>
<div class="highlight"><pre><span></span><code>ogr2ogr -f <span class="s2">"ESRI Shapefile"</span> -sql <span class="s2">"SELECT ID, 0.0 AS VegFrac, 0.0 AS ImpFrac FROM vector_grid WHERE ID IN (</span><span class="nv">$RANDS</span><span class="s2">)"</span> validation_points.shp vector_grid.shp
</code></pre></div>
<h2>In Conclusion</h2>
<p>Now I'm ready to analyze the areas specified in <code>validation_points.shp</code> by going through them, one at a time, in QGIS or another desktop GIS tool.
Technically, I used more than Bash and QGIS but GDAL really is the most fundamental tool in Bash, right?</p>LEDAPS installation on Ubuntu GNU/Linux2015-05-06T11:30:00+02:002015-05-06T11:30:00+02:00K. Arthur Endsleytag:karthur.org,2015-05-06:/2015/ledaps-installation-ubuntu.html<p>The Landsat Ecosystem Disturbance Adaptive Processing System (LEDAPS) is a software system for generating surface reflectance data for Landsat 4, 5, and 7 TM or ETM+ sensors. The installation can be difficult, so I've prepared a guide based on my last successful installation of version 2.2.0.</p><p><strong>Update:</strong> The Landsat Ecosystem Disturbance Adaptive Processing System (LEDAPS) is a software system for generating surface reflectance data for Landsat 4, 5, and 7 TM or ETM+ sensors.
The installation can be difficult, so I've prepared a guide based on my last successful installation of version 2.2.0.
Since I first authored this article, I've heard reports of various difficulties I didn't encounter, including that the auxiliary files are no longer available.
On the other hand, new releases are now on GitHub and the documentation seems to have improved (over <a href="https://code.google.com/p/ledaps/wiki/Version_2_2_0">the original on Google Code</a>).
I am now downloading surface reflectance data directly and in bulk from <a href="http://espa.cr.usgs.gov/">the USGS EROS Science Products Architecture (ESPA)</a>, so I haven't needed to use LEDAPS myself in awhile.
<strong>In short, this article may need to be updated, but what follows is a guide for installing LEDAPS 2.2.0 on Ubuntu GNU/Linux 14.04 or higher.</strong></p>
<h2>LEDAPS Dependencies</h2>
<p>The <a href="https://code.google.com/p/ledaps/wiki/Version_2_2_0">Google Code Wiki for LEDAPS</a> lists the following dependencies:</p>
<ul>
<li><a href="http://hdfeos.org/software/library.php">HDF-EOS2 libraries</a> (<code>hdf-eos</code>)</li>
<li><a href="https://code.google.com/p/espa-common/">ESPA common libraries</a> (<code>espa-common</code>)</li>
<li>GCTP libraries (Bundled with HDF-EOS2)</li>
<li>TIFF libraries (<code>libtiff5</code>)</li>
<li>GeoTIFF libraries (<code>libgeotiff2</code>)</li>
<li>HDF4 libraries (<code>libhdf4</code>)</li>
<li><code>libxml2</code></li>
</ul>
<p>Some of these dependencies have their own dependencies, which adds "JPEG support" (for <code>espa-common</code>) and <code>zlib</code> (for HDF4) to the list.</p>
<h2>A Walkthrough</h2>
<p>This walkthrough is written and tested for Ubuntu GNU/Linux 14.04 ("trusty").
<a href="https://gist.github.com/arthur-e/f836e3e612bdf4a0ca88">It is available in full as a public Gist</a>.
You'll want to make sure you have the basic dependencies installed first.</p>
<div class="highlight"><pre><span></span><code>sudo apt-get install zlib1g zlib1g-dev libtiff5 libtiff5-dev <span class="se">\</span>
libgeotiff2 libgeotiff-dev libxml2 libxml2-dev
</code></pre></div>
<p>I could not figure out which JPEG library is needed by <code>espa-common</code> but it is likely installed by default on Ubuntu.
For the following examples, I assume you want to build the libraries in <code>/usr/local/</code> and you have <code>sudo</code> privileges.
Don't forget to set a <code>USERNAME</code> variable.</p>
<div class="highlight"><pre><span></span><code><span class="nv">USERNAME</span><span class="o">=</span>heyyouguys
</code></pre></div>
<h3>HDF4 Library</h3>
<p><strong>First, we install HDF4 support.</strong>
You could install this as a package with <code>sudo apt-get install libhdf4-0 libhdf4-0-alt libhdf4-alt-dev</code>.
I prefer to <a href="http://www.hdfgroup.org/release4/obtainsrc.html">build it from source</a> in this case.
This also ensures that the paths I've provided for shared libraries later on in the walkthrough will match.</p>
<div class="highlight"><pre><span></span><code>sudo mkdir /usr/local/hdf4
sudo chown <span class="nv">$USERNAME</span> /usr/local/hdf4
<span class="nb">cd</span> /usr/local/hdf4
<span class="c1"># Download the latest HDF4 source code</span>
wget http://www.hdfgroup.org/ftp/HDF/HDF_Current/src/hdf-4.2.11.tar.gz
tar -xzvf hdf-4.2.11.tar.gz
<span class="nb">cd</span> hdf-4.2.11
./configure
make
make check
make install
<span class="c1"># Update shared links</span>
sudo ldconfig
</code></pre></div>
<h3>HDF-EOS2 Library</h3>
<p><strong>Next, we install the HDF-EOS2 library.</strong>
They actually do provide some fairly helpful instructions for <a href="http://hdfeos.org/software/hdfeos.php">building from source</a>, from which I've adapted these instructions.
Note that your path to the <code>h4cc</code> compiler may differ, especially if you did not build HDF4 from source.
HDF-EOS2 is available as a package (<code>sudo apt-get install libhdfeos0 libhdfeos-dev libgctp0d libgctp-dev</code>) but I could not get LEDAPS to build off of this package.</p>
<div class="highlight"><pre><span></span><code>sudo mkdir /usr/local/hdf-eos
sudo chown <span class="nv">$USERNAME</span> /usr/local/hdf-eos
<span class="nb">cd</span> /usr/local/hdf-eos
<span class="c1"># Download latest HDF-EOS source code</span>
wget ftp://edhs1.gsfc.nasa.gov/edhs/hdfeos/latest_release/HDF-EOS2.19v1.00.tar.Z
tar -xzvf HDF-EOS2.19v1.00.tar.Z
<span class="nb">cd</span> hdfeos
./configure -enable-install-include <span class="se">\</span>
<span class="nv">CC</span><span class="o">=</span>/usr/local/hdf4/hdf-4.2.11/hdf4/bin/h4cc
make
make check
make install
<span class="c1"># Update shared links</span>
sudo ldconfig
</code></pre></div>
<p>Pay attention to this message after <code>make install</code>, which might be slightly different on your machine.
You'll want to remember this path for later.</p>
<div class="highlight"><pre><span></span><code>Libraries have been installed in:
/usr/local/hdf4/hdf-4.2.11/hdf4/lib
</code></pre></div>
<h3>ESPA Common Library</h3>
<p><strong>Next is the ESPA Common Library,</strong> <a href="https://code.google.com/p/espa-common/wiki/Version_1_1_0">its documentation here</a>.
The original documentation consists of only one instruction: "Goto the <code>src/raw_binary</code> directory and build the source code there."
Anyway, the first thing we have to do is get the source code.
Note that while I've linked to the 1.1.0 version documentation, the instructions are (apparently) the same for version 1.3.1, which is what we're installing.</p>
<div class="highlight"><pre><span></span><code>sudo svn checkout http://espa-common.googlecode.com/svn/releases/version_1.3.1 <span class="se">\</span>
/usr/local/espa-common/version_1.3.1
sudo chown -R <span class="nv">$USERNAME</span> espa-common
</code></pre></div>
<p>To inform <code>epsa-common</code> and, later, LEDAPS on where to find shared libraries, we have to set a number of environmental variables.
At least one of the following paths will likely not match your setup.
The only hints I can give you are that the <code>*INC</code> paths should point to the respective library's includes while the <code>*LIB</code> paths should point to the exported library.</p>
<div class="highlight"><pre><span></span><code><span class="nb">export</span> <span class="nv">HDFEOS_GCTPINC</span><span class="o">=</span><span class="s2">"/usr/include/gctp/"</span>
<span class="nb">export</span> <span class="nv">HDFEOS_GCTPLIB</span><span class="o">=</span><span class="s2">"/usr/local/hdf-eos/hdfeos/hdfeos2/lib"</span>
<span class="nb">export</span> <span class="nv">TIFFINC</span><span class="o">=</span><span class="s2">"/usr/include/x86_64-linux-gnu/"</span>
<span class="nb">export</span> <span class="nv">TIFFLIB</span><span class="o">=</span><span class="s2">"/usr/lib/x86_64-linux-gnu/"</span>
<span class="nb">export</span> <span class="nv">GEOTIFF_INC</span><span class="o">=</span><span class="s2">"/usr/include/geotiff/"</span>
<span class="nb">export</span> <span class="nv">GEOTIFF_LIB</span><span class="o">=</span><span class="s2">"/usr/lib/"</span>
<span class="nb">export</span> <span class="nv">HDFINC</span><span class="o">=</span><span class="s2">"/usr/local/hdf4/hdf-4.2.11/hdf4/include/"</span>
<span class="nb">export</span> <span class="nv">HDFLIB</span><span class="o">=</span><span class="s2">"/usr/local/hdf4/hdf-4.2.11/hdf4/lib/"</span>
<span class="nb">export</span> <span class="nv">HDFEOS_INC</span><span class="o">=</span><span class="s2">"/usr/local/hdf-eos/hdfeos/include/"</span>
<span class="nb">export</span> <span class="nv">HDFEOS_LIB</span><span class="o">=</span><span class="s2">"/usr/local/hdf-eos/hdfeos/hdfeos2/lib"</span>
<span class="nb">export</span> <span class="nv">JPEGINC</span><span class="o">=</span><span class="s2">"/usr/include/"</span>
<span class="nb">export</span> <span class="nv">JPEGLIB</span><span class="o">=</span><span class="s2">"/usr/lib/x86_64-linux-gnu/"</span>
<span class="nb">export</span> <span class="nv">XML2INC</span><span class="o">=</span><span class="s2">"/usr/include/libxml2/"</span>
<span class="nb">export</span> <span class="nv">XML2LIB</span><span class="o">=</span><span class="s2">"/usr/lib/x86_64-linux-gnu/"</span>
</code></pre></div>
<p>It's easy to mistake where HDF-EOS2 keeps its <code>lib</code> files.
Remember the path we wrote down after <code>make install</code>?
You have to carefully read the output from <code>make install</code> to note that it installs the <code>lib</code> files in a subfolder called <code>hdfeos2</code> (by default, I assume).
If the above path doesn't work for you, just run <code>make install</code> on HDF-EOS2 again and look for the message that indicates the proper path.</p>
<p>We next <code>cd</code> to the <code>src/raw_binary</code> subfolder.
From there, with a little luck, building is straightforward.</p>
<div class="highlight"><pre><span></span><code>make
make install
sudo ldconfig
</code></pre></div>
<p>If you encounter errors about the <code>-lGctp</code> switch, then <code>make</code> is unable to find the GCTP <code>lib</code> files; you specified the wrong path under <code>HDFEOS_GCTPLIB</code>.</p>
<h3>LEDAPS</h3>
<p><strong>Finally, we're ready to install LEDAPS.</strong>
Again, though I will link you to <a href="https://code.google.com/p/ledaps/wiki/Version_2_2_0">the version 2.2.0 documentation</a>, we're going to install version 2.2.1.
<a href="https://code.google.com/p/ledaps/issues/detail?id=5&can=1">This bug report</a> might also be helpful if you run into errors.
The <a href="https://github.com/usgs-eros/espa-surface-reflectance">GitHub site</a> (formerly on the <a href="https://code.google.com/p/ledaps/wiki/Version_2_2_0">Google Code Wiki</a>) indicates that you should download the LEDAPS auxiliary files (<code>ledaps_aux.1978-2014.tar.gz</code>) before building LEDAPS, there is no reason to do this before LEDAPS has been successfully installed (it's a ~2.9 GB file).
We first have to export the paths to the includes and <code>lib</code> files we just built in <code>espa-common</code>.</p>
<div class="highlight"><pre><span></span><code><span class="nb">export</span> <span class="nv">ESPAINC</span><span class="o">=</span><span class="s2">"/usr/local/espa-common/version_1.3.1/src/raw_binary/include/"</span>
<span class="nb">export</span> <span class="nv">ESPALIB</span><span class="o">=</span><span class="s2">"/usr/local/espa-common/version_1.3.1/src/raw_binary/lib/"</span>
</code></pre></div>
<p>Now you're ready to build LEDAPS!
Good luck.</p>
<div class="highlight"><pre><span></span><code>sudo svn checkout http://ledaps.googlecode.com/svn/releases/version_2.2.1 <span class="se">\</span>
/usr/local/ledaps/version_2.2.1
sudo chown -R <span class="nv">$USERNAME</span> ledaps
<span class="nb">cd</span> /usr/local/ledaps/version_2.2.1/ledapsSrc/src/
make
make install
make clean
sudo ldconfig
</code></pre></div>
<h2>Clean-Up</h2>
<p>A description of the LEDAPS outputs, which you won't find linked to from the USGS or the Google Code site, is <a href="https://daac.ornl.gov/MODELS/guides/LEDAPS.html">available here</a>.</p>Holism versus reductionism: The holy war in ecology2015-03-18T11:00:00+01:002015-03-18T11:00:00+01:00K. Arthur Endsleytag:karthur.org,2015-03-18:/2015/holism-vs-reductionism-ecology.html<p>The philosophical tension between the worldviews of holism and reductionism persists in today's ecology classroom.
This debate traces roots to the "individualistic" versus "organismal" debate at the beginning of the twentieth century between the population and community ecology schools [1].
The chief actors in this debate were Henry Gleason, proponent …</p><p>The philosophical tension between the worldviews of holism and reductionism persists in today's ecology classroom.
This debate traces roots to the "individualistic" versus "organismal" debate at the beginning of the twentieth century between the population and community ecology schools [1].
The chief actors in this debate were Henry Gleason, proponent of the individualistic view, and Frederic Clements, who argued that plant communities function as "complex organisms."
Though not cast in the same terms, Gleason's point of view has come to be associated with reductionism while Clements' point of view, including his analogy relating plant communities to the human body, was soon attached to the doctrine of holism.</p>
<p>Odenbaugh (2007) provides an excellent summary of the debate between Gleason and Clements:</p>
<blockquote>
<p>Suppose a set of species in a particular place and time is disturbed by some exogenous process like a forest fire from a lightening strike. Clements argued that communities in response to such disturbances follow a very specific sequence of stages called "seres" and that there is a single self‐perpetuating and tightly integrated climax community. Clements considered communities to be "superorganisms"...Gleason considered Clements' views to be without empirical support and argued that succession results from individual species' physiological requirements and local meteorological conditions.</p>
</blockquote>
<p>In general, holism is the view that an integrated whole has a reality independent of and greater than the sum of its parts.
It is marked, particularly, by the belief in "emergent properties" which are only observed at the system-level.
Reductionism, in contrast, posits that all phenomena are at all times physically realized and therefore system-level behavior is determined by and can be derived from the constituent parts [3].</p>
<p><strong>Reductionism is clearly useful as a foundation on which concepts in ecosystem ecology can easily be built.</strong>
For example, in examining ecosystems as functional units of nature, biochemical pathways in photosynthesis are often discussed as determining the spatial distribution of certain plant communities.
We observe that C4 plants have higher photosynthetic and water use efficiency and are stimulated by higher temperatures than C3 plants.
We also observe that C3 and C4 plants are found in different patterns on the landscape, with C4 plants found predominantly in drier ecoregions than C3 plants (and CAM plants in ecoregions drier still).
The reductionist argument is that these individual differences in plant life histories determine their spatial distribution and, thus, the plant communities associated with a particular ecosystem.</p>
<p>There are also problems with Clements' view.
His perspective was too broad-scale to appreciate real, fine-grained changes in species turnover along environmental gradients.
<strong>However, rejecting Clement's view does not necessitate rejecting holism.</strong>
There are very real examples of emergent phenomena at work.
Conway's oft-cited Game of Life is the premier example but there are examples from the physical world as well (beyond cellular automata).
While the spatial distribution of an animal species might be modeled effectively in a vacuum, when we consider multiple species interactions it becomes increasingly difficult if not impossible to predict their spatial distributions with any appreciable accuracy while restricting ourselves to thinking of them as linear combinations.</p>
<p>But is this merely "pragmatic anti-reductionism?" [3].
I think not.
As Joe Faith (1998) argues, there are two problems with the reductionist view of some physical systems.
One is that our understanding of some physicals systems is limited to conceptual and mathematical models (e.g., ideal bodies in physics).
Another is that properties of the constituent elements of a system are often determined by system-level behavior (while an insistence that it is exclusively the reverse is reductionism at its purest).
An example of this from physics, provided by Faith, is the compression of an ideal gas, which results in an increased mean momentum per unit volume (and increased momentum of the constiuent particles) [3].</p>
<p>Many ecologists seem to agree there are merits to both views.
My favorite postulation comes from Sierra et al. (2015):</p>
<blockquote>
<p>Where these ecosystem manipulation experiments have included the interaction with one or two additional factors, results suggest that effects are not additive or predictable from individual variables alone.</p>
</blockquote>
<p>While Currie (2011) and others push back on this view by distinguishing between holism and complexity, I would argue that the defense of reductionism by appealing to complexity is only as tenable as pragmatic anti-reductionism.
In the end, the reality of complexity, seen either way, has led to more ideological defensiveness rather than scientific advancement.
That is, the debate tends to generate more noise than heat (or more heat than work).</p>
<h2>References</h2>
<ol>
<li>Dalgaard, T., N.J. Hutchings, J.R. Porter. 2003. "Agroecology, scaling and interdisciplinarity." <em>Agriculture, Ecosystems & Environment.</em> 100 (1):39-51.</li>
<li>Odenbaugh, J. 2007. Seeing the Forest and the Trees: Realism about Communities and Ecosystems. <em>Philosophy of Science</em> 74 (5):628–641.</li>
<li>Faith, J. 1998. Why gliders don’t exist: Anti-reductionism and emergence. In Artificial Life VI: Proceedings of the Sixth International Conference on Artificial Life, eds. C. Adami, R. K. Belew, H. Kitano, and C. Taylor.</li>
<li>Sierra, C. A., S. E. Trumbore, E. A. Davidson, S. Vicca, and I. Janssens. 2015. Sensitivity of decomposition rates of soil organic matter with respect to simultaneous changes in temperature and moisture. <em>Journal of Advances in Modeling Earth Systems</em> 7:355–356.</li>
<li>Currie, W. S. 2011. Units of nature or processes across scales? The ecosystem concept at age 75. <em>New Phytologist</em> 190 (1):21–34.</li>
</ol>Custom citation styles in LaTeX with CSL and Pandoc2015-01-28T15:00:00+01:002015-01-28T15:00:00+01:00K. Arthur Endsleytag:karthur.org,2015-01-28:/2015/custom-citation-styles-in-latex.html<p>Fine-grained customization of citation styles in LaTeX's default bibliography environment can be hard to achieve. Here, I'll describe an alternative to the standard bibliography environment that I found for style customization without sacrificing the raw power of LaTeX.</p><p>I love to use LaTeX for typesetting my papers.
The flexibility of the environment and the crisp beauty of the final product&I think anyone who uses it regularly knows what I'm gushing about.
While working on a recent paper, however, I was frustrated by the prospect of customizing bibliography and citation styles in LaTeX.
I care about fine-grained control of my bibliography.
I wondered if there was another sensitive soul out there who felt the same way and decided to create a better soluion.</p>
<p><strong>Here, I'll describe an alternative to the standard bibliography environment that I found for style customization without sacrificing the raw power of LaTeX.</strong>
I'll also comment briefly on some alternatives I've seen and tried.
I take for granted that BibTeX library is already available as creating one is outside the scope of this article.
I will mention that my favorite reference manager, <a href="http://blog.mendeley.com/tipstricks/howto-use-mendeley-to-create-citations-using-latex-and-bibtex/">Mendeley, can export its database as a BibTeX library</a>.</p>
<h2>The Problem</h2>
<p>LaTeX's default bibliography environment is simple enough to invoke.
For a complete description, see Stefan Kottwitz's book [<a href="http://www.amazon.com/LaTeX-Beginners-Guide-Stefan-Kottwitz/dp/1847199860/ref=sr_1_1">1</a>].
Inline styles are written with the <code>\cite{}</code> command, as in the example below.</p>
<div class="highlight"><pre><span></span><code>Previous studies have demonstrated that collective efficacy <span class="k">\cite</span><span class="nb">{</span>Morenoff2001<span class="nb">}</span> and prejudice <span class="k">\cite</span><span class="nb">{</span>Sampson2004<span class="nb">}</span> shape neighborhoods more strongly than their physical make-up and condition.
</code></pre></div>
<p>The final and only other requirement is to add the bibliography and point to a BibTeX reference database.
In the example below, it would be named <code>myrefs.bib</code>.</p>
<div class="highlight"><pre><span></span><code><span class="k">\bibliographystyle</span><span class="nb">{</span>alpha<span class="nb">}</span>
<span class="k">\bibliography</span><span class="nb">{</span>myrefs<span class="nb">}</span>
</code></pre></div>
<p>It is necessary to invoke the <code>bibtex</code> program.
This translates the references from our BibTeX library (<code>myrefs.bib</code>) that we have cited in our document into a <code>thebibliography</code> <em>environment</em> which is put in place where we invoked the command, <code>\bibliography</code>.
In short, we call <code>bibtex</code> and then typeset twice.</p>
<div class="highlight"><pre><span></span><code>bibtex my_document.tex
</code></pre></div>
<p>After typesetting (twice), we see in-line citations and our bibliography.
The bibliography style we specified, <code>alpha</code>, is formatted such that the in-line citation labels are a combination of a shortened author name and publication year, the bibliography is sorted by author name, and square brakets surround the labels.
<strong>But what if you want something different?</strong></p>
<p>Well, there are four default styles.
The other three are described by Kottwitz [<a href="http://www.amazon.com/LaTeX-Beginners-Guide-Stefan-Kottwitz/dp/1847199860/ref=sr_1_1">1</a>] are listed below.
<a href="https://www.sharelatex.com/learn/Bibtex_bibliography_styles">ShareLaTeX has a list of eight (8)</a>, including the four discussed here, with example output.</p>
<ul>
<li><code>plain</code>: Arabic numbers for the labels, sorted according to the names of the authors. The number is written in square brackets which also appear with <code>\cite</code>.</li>
<li><code>unsrt</code>: No sorting. All entries appear like they were cited in the text, otherwise it looks like <code>plain</code>.</li>
<li><code>abbrv</code>: Like <code>plain</code>, but first names and other field entries are abbreviated.</li>
</ul>
<p>The <code>bibtex</code> program figures out how to style your citations and bibliography from a specification that resides in a <code>*.bst</code> file (i.e., there is a <code>plain.bst</code> for the built-in <code>plain</code> style).
You could write one of these yourself, perhaps using one of the built-ins as a template, but the postfix language they are written can be very difficult to read and write.</p>
<p>It seems that others have recognized the need for more flexible customization and, to be fair, there are other options out there in the form of preprocessors and LaTeX packages.
<a href="http://tex.stackexchange.com/a/25702/39510">This TeX StackExchange article</a> does a good job of summarizing their trade-offs.
The only alternative among the extant packages and programs that I've tried is <code>natbib</code> which doesn't completely solve the problem of fully customizable citations and bibliographies (though it's a good start).
In the end, one still has to provide a <code>*.bst</code> file with <code>natbib</code>.</p>
<h2>A Solution</h2>
<p>I should mention that my high expectations of full customization were shaped by my previous experience with <a href="http://rmarkdown.rstudio.com/">R Markdown</a> and <code>knitr</code>.
With R Markdown, formatting for bibliographies and in-line citations can be specified by <a href="http://citationstyles.org/downloads/primer.html">the Citation Style Language (CSL)</a>; see also <a href="http://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html">this reference</a>.
I have a vivid memory of first seeing the <a href="https://www.zotero.org/styles">Zotero repository of CSL stylesheets</a>: <strong>all 7,438 of them.</strong>
Clearly, 7,438 options are better than 4 (or 8), right?
If the numbers don't convince you, just open up a CSL stylesheet; it's an XML variant, making it much easier to read and write than <code>*.bst</code> files.</p>
<p>So, CSL works out-of-the-box in R Markdown—great!
And <code>knitr</code>, with help from <a href="http://johnmacfarlane.net/pandoc/">Pandoc</a>, enables R Markdown documents to be serialized to a wide variety of formats (e.g., PDF, HTML, Microsoft Word).
<strong>But what if you want to use the full range of TeX commands and environments available through third-party packages?</strong>
Some people might also object to using R Markdown to write their paper, anyway; particularly if they don't use R or markdown.</p>
<p>I'm not one of those people but I do want to use raw TeX sometimes.
So, I started looking into Pandoc to preprocess my TeX documents.
Pandoc supports CSL and can defer raw TeX input to the LaTeX typesetting program.
<strong>By chaining Pandoc and LaTeX, with a custom CSL file to my liking, I can fully customize my bibliography and citation styles without sacrificing the full range of TeX features available in third-party libraries.</strong></p>
<p>As an example, here is the References section of a paper I wrote.
I wanted hanging indents in my bibliography—one last, obsessive detail to achieve my vision for a bibliography—so I have invoked <code>\setlength</code> and <code>\hangparas</code> which require the <code>setspace</code> and <code>hanging</code> packages, respectively.
The last line looks a lot like it did before, right?
I just point to my BibTeX bibliography database.</p>
<div class="highlight"><pre><span></span><code><span class="k">\section</span><span class="nb">{</span>References<span class="nb">}</span>
<span class="k">\setlength</span><span class="nb">{</span><span class="k">\parindent</span><span class="nb">}{</span>0pt<span class="nb">}</span> <span class="c">% Reset indentation for references...</span>
<span class="k">\hangparas</span><span class="nb">{</span>32pt<span class="nb">}{</span>1<span class="nb">}</span>
<span class="k">\bibliography</span><span class="nb">{</span>/home/arthur/library.bib<span class="nb">}</span>
</code></pre></div>
<p>To typeset this with Pandoc, I have a little shell script that encapsulates the variety of options available with that program.
I locate my BibTeX library for Pandoc with the <code>--bibliography</code> option and I tell it how I'd like my bibliography and citations formatted, in CSL, with the <code>--csl</code> option.</p>
<div class="highlight"><pre><span></span><code>pandoc <span class="se">\</span>
-M <span class="nv">author</span><span class="o">=</span><span class="s2">"K. Arthur Endsley"</span> <span class="se">\</span>
-M <span class="nv">date</span><span class="o">=</span><span class="s2">"February 2, 2015"</span> <span class="se">\</span>
-f latex+raw_tex -N -R <span class="se">\</span>
--smart <span class="se">\</span>
--include-in-header<span class="o">=</span>header.tex <span class="se">\</span>
--bibliography<span class="o">=</span>/home/arthur/library.bib <span class="se">\</span>
--csl<span class="o">=</span>citation_style.csl <span class="se">\</span>
--template<span class="o">=</span>template.latex <span class="se">\</span>
-o MyPaper.pdf MyPaper.tex
</code></pre></div>
<p>Note that I have a custom LaTeX template and a header TeX file that load some packages and set up my document.
<strong>These are important, as <code>\usepackage</code> and some other commands can only be used in the preamble—they can't go inside your input TeX file to Pandoc.</strong>
To get an idea of what I mean, see my input TeX file (<code>MyPaper.tex</code>) below:</p>
<div class="highlight"><pre><span></span><code><span class="k">\title</span><span class="nb">{</span>Assessment of Urban Change through a Land-Cover Change Proxy at the Neighborhood Scale with Subpixel Measurements from Satellite Remote Sensing<span class="nb">}</span>
<span class="k">\author</span><span class="nb">{</span>K. Arthur Endsley<span class="nb">}</span>
<span class="k">\begin</span><span class="nb">{</span>document<span class="nb">}</span>
<span class="k">\maketitle</span>
<span class="k">\section</span><span class="nb">{</span>Background<span class="nb">}</span>
Neighborhood change manifests in changes in the physical environment...
</code></pre></div>
<p>Anything else has to go in the template or in the header.</p>
<h2>References</h2>
<ol>
<li><a href="http://www.amazon.com/LaTeX-Beginners-Guide-Stefan-Kottwitz/dp/1847199860/ref=sr_1_1">LaTeX: Beginner's Guide</a> by Stefan Kottwitz</li>
</ol>Bayesian networks for land cover classification2014-12-04T13:00:00+01:002014-12-04T13:00:00+01:00K. Arthur Endsleytag:karthur.org,2014-12-04:/2014/bayesian-networks-for-land-cover.html<p>I experimented with Bayesian networks for land cover classification in the Detroit metro area using census data as predictors. An example of using Bayesian networks for land cover classification in R is presented with code samples.</p><p>For a term project in my first semester as a PhD student at the University of Michigan, in a class on landscape modeling, I wanted to investigate the relationship between socioeconomic or demographic change and land cover changes in urban areas.
<strong>My intent was to produce a model sensitive to neighborhood change—particularly new development or abandonment as signaled by census measures—and to explain that change in terms of physical changes on the landscape:</strong> changes in vegetation, impervious surface, or soil cover (i.e., the VIS model) [1].
My choice of Bayesian networks was inspired by a study [2] in which they were used to determine land cover transition probabilities and, in turn, drive a cellular automata model for predicting urbanization (new urban development).</p>
<p><img alt="A Bayesian network prediction without masking" src="http://karthur.org/images/20141204_bn_example_graph.png"></p>
<p>I found that while (static) Bayesian networks are very good at representing complex conditional probability distributions and reproducing static patterns, they aren't useful for predicting relatively rare events like slow and sparse land cover changes.
Furthermore, the interpretation of the conditional probabilities can be somewhat subjective.
<strong>Nonetheless, Bayesian networks could be a powerful tool for generalizing from a sparse dataset and may perform well on classification problems such as land cover classification.</strong></p>
<h2>Background</h2>
<p>There are many motivations for studying urban environments and urban change in terms of land use or land cover changes.
Studies include mitigating the impacts of urban sprawl [3], generating urban development scenarios [2], or monitoring rates of urban growth and impervious surface increase [4].
To these ends, previous studies have employed a variety of models that attempt either to predict future states of the landscape or to explain the drivers of urban change.</p>
<p>Some ostensibly explanatory models are not easily interpretable despite the modeler's intentions.
With some cellular automata models, a failure to reproduce fine-scale patterns and accurately locate urban growth may indicate a problem with the model's structure (transition rules, namely) but identifying which parts of the structure that are at fault can be challenging as it requires many different measures of model outcomes both quantitative and qualitative [5].
<strong>Are Bayesian networks an <em>interpretable</em> (explanatory) and yet spatially accurate modeling approach for investigating changes in urban land cover?</strong></p>
<p>Detroit is a particularly interesting case study for urban change because of its considerable economic and population decline [6].
The "greening" of Detroit—an increase in vegetation cover within the urban core due to abandonment and revegetation—is a well-documented phenomenon [7,8] that further implicates land cover change as a signal of socioeonomic and demographic change.
From a modeling perspective, boom-and-bust dynamics are more challenging than constant growth and some landscape models are ill-equipped for such purposes [3].</p>
<h3>Bayesian Networks</h3>
<p>Bayesian networks (BNs)—also known variously as belief networks, Bayesian belief networks, Bayes nets, and causal probabilistic networks—are a relatively recent [9] tool for estimating probabilities of occurrence given sparse observations and have been demonstrated to be useful for land cover modeling.
<strong>They are directed, acyclic graphs</strong> where each node is a variable and the probabilities of both "predictor" and "response" variables can be queried [10].
Nodes connected to one another imply a conditional dependence in a certain direction (hence, the graph is directed) and the links between them cannot form cycles (hence, the graph is acyclic).
BNs must also exhibit the Markov property; that is, the conditional probability of any node must depend only on its immediate parents [10,11].</p>
<p><strong>In general, BNs are either discrete or continuous</strong>; discrete and continuous variables are usually not mixed and software tools that support mixed types in the network are not common [10,12].
This is to facilitate calculating the joint probability distribution, which is either a multinomial distribution in the case of discrete-valued variables or a Gaussian distribution in the case of continuous values.
In the case of continuous variables, the parameters are just regression coefficients.
Because the nodes of a Bayesian network are linked, multivariate regression is performed to predict the distribution arising at each node in the network, providing regression coefficients for each pairwise interaction between a node and its connections [10].</p>
<p><strong>Training a Bayesian network generally consists of two steps: learning the network structure and then fitting the parameters.</strong>
In some studies, the network structure may be known or specified by an expert.
The conditional probability tables (CPTs) for some or all of the variables might also be specified by an expert [2].</p>
<p>Structure learning is computationally intensive but many different algorithms are available that are all tractable on end-user hardware.
The second step, fitting parameters, is generally done through a maximum likelihood approach (whereby the best fit parameters are estimated) or a Bayesian approach (whereby the posterior distribution of the parameters for a discrete distribution is estimated).
The Bayesian approach is preferred as it provides more robust estimates and guarantees the conditional probability tables will be complete [10].</p>
<h2>Bayesian Networks in R</h2>
<p>The book <strong>Bayesian Networks in R</strong> by Nagarjan et al. [10] mentions a number of different R packages for investigating Bayesian networks.
I'll speak only about the <code>bnlearn</code> package [13], which I've found to be the easiest to use and yet quite robust.</p>
<p>I used three classes of network learning algorithms—all available in the <code>bnlearn</code> package for R—to investigate possible network structures and to select a stable (consistent) structure for modeling based on a random sample of my discretized training data.
Most of the algorithms I tried produced extremely dense graphs including many complete graphs (i.e., every node is connected to every other), which generally do not perform well for prediction.
Ultimately, two hybrid algorithms, General 2-Phase Restricted Maximization (RSMAX2) and Max-Min Hill Climbing (MMHC), agreed upon the same network structure.
The hybrid score and constraint-based class of algorithms is considered to produce more reliable networks than either score-based or constraint-based algorithms alone [10].</p>
<h3>Learning Network Structure in R</h3>
<p>The <code>bnlearn</code> package makes learning network structure a one-liner for any algorithm of choice.
Some algorithms have more options than others.
For instance, Incremental Association Markov Blanket (IAMB or <code>iamb</code>) doesn't require any parameters.
The Hill Climbing algorithm (<code>hc</code>) optionally allows the user to specify how many times it will randomly restart to avoid local maxima and how many times it will add, delete, or reverse arcs after a restart.</p>
<div class="highlight"><pre><span></span><code><span class="nf">iamb</span><span class="p">(</span><span class="n">training.sample</span><span class="p">)</span>
<span class="nf">hc</span><span class="p">(</span><span class="n">training.sample</span><span class="p">,</span> <span class="n">restart</span><span class="o">=</span><span class="m">10</span><span class="p">,</span> <span class="n">perturb</span><span class="o">=</span><span class="m">5</span><span class="p">)</span>
</code></pre></div>
<p>One particularly nice feature of the <code>bnlearn</code> package is that it has graph plotting built right in so that you can visualize the structure of the network that was learned.</p>
<div class="highlight"><pre><span></span><code><span class="n">mmhc.dag</span> <span class="o"><-</span> <span class="nf">mmhc</span><span class="p">(</span><span class="n">training.sample</span><span class="p">)</span>
<span class="nf">plot</span><span class="p">(</span><span class="n">mmhc.dag</span><span class="p">)</span>
</code></pre></div>
<p>We can manipulate the graphs post-hoc by setting arcs ourselves.
For instance, we might want to insist that there is an arc pointing from "old" land cover to "new" land cover.</p>
<div class="highlight"><pre><span></span><code><span class="n">mmhc.dag</span> <span class="o"><-</span> <span class="nf">set.arc</span><span class="p">(</span><span class="n">mmhc.dag</span><span class="p">,</span> <span class="s">'old'</span><span class="p">,</span> <span class="s">'new'</span><span class="p">)</span>
</code></pre></div>
<h3>Fitting Network Parameters in R</h3>
<p>Fitting networks in <code>bnlearn</code> is also short and sweet.
Note that we're using a different set of training data to fit the model parameters.
The <code>method</code> argument is where one specifies whether to use maximum likelihood estimation (<code>mle</code>) or Bayesian parameter estimation (<code>bayes</code>); the latter is currently only implemented for discrete data sets.</p>
<div class="highlight"><pre><span></span><code><span class="n">fitted.network</span> <span class="o"><-</span> <span class="nf">bn.fit</span><span class="p">(</span><span class="n">mmhc.dag</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">training.sample2</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s">'bayes'</span><span class="p">)</span>
</code></pre></div>
<h2>An Example</h2>
<p>The BNs were trained from high-resolution, 30-meter land cover data from 2001 and landscape measures (distance to parks and distance to roads) joined to the coarse-resolution census data for 2006.
A three-folds random sample of the combined predictors was created so that the samples used to learn the network structure, score the network structure, and fit the model were disjoint.
Each disjoint sample contains less than 4% of the complete dataset.
The predictor variables were then aggregated to 300 meters using nearest neighbor resampling to reduce the computational complexity of prediction (classification).</p>
<p><strong>Land cover classification with Bayesian networks consists of the following general steps:</strong></p>
<ol>
<li>For each pixel, get the available evidence (e.g., census measures, proximities, and land cover observations).</li>
<li>Obtain the posterior probability distributions given the evidence.</li>
<li>Considering the "new" land cover variable, choose the outcome (e.g., land cover) from the posterior probability distribution.</li>
</ol>
<p>I'll consider in detail each of these three steps in the following sections.</p>
<h3>Step 1: Get the Evidence</h3>
<p><strong>One of the virtues of BNs is that predictions do not require a simultaneous observation of all predictor variables;</strong> even just one predictor variable can be used as evidence.
Below is a striking example of how this works.</p>
<p><img alt="A Bayesian network prediction without masking" src="http://karthur.org/images/20141204_bn_prediction_example_without_masking.png"></p>
<p>In the lower right portion of this Detroit metro area image land cover image and around the bottom and left edges there is what looks like random noise.
This is an area of the scene where we have no predictor variables; it's an area consisting chiefly of the Detroit River and Windsor, Canada and, thus, was masked out in our dataset.
<strong>Without any evidence to show to our model, land cover predictions are pulled from the prior distribution only.</strong>
When predictions are made in areas where evidence exists, however, the posterior distribution is obtained and structure emerges.</p>
<p>In this step, for each pixel we want to show the network the evidence (the pixel's predictor variables or the values across all image bands).
We first need to create an independence network so that we can query our network's conditional probabilities.
This requires the <code>gRain</code> package.
The <code>bnlearn</code> package knows how to manipulate <code>gRain</code> data structures, provided the package is available; it provides a <code>as.grain</code> function to return our trained network as a <code>gRain</code> object.
Then, <code>gRain</code> compiles our network as an independence network using the junction tree algorithm.</p>
<div class="highlight"><pre><span></span><code><span class="nf">require</span><span class="p">(</span><span class="n">gRain</span><span class="p">)</span>
<span class="c1"># We use the junction tree algorithm to create an independence network that we can query</span>
<span class="n">prior</span> <span class="o"><-</span> <span class="nf">compile</span><span class="p">(</span><span class="nf">as.grain</span><span class="p">(</span><span class="n">fitted.network</span><span class="p">))</span>
</code></pre></div>
<p>We call the output junction tree by the name <code>prior</code> as querying it will basically provide us with the prior distribution for land cover (before any evidence is shown).</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Get the prior probabilities for new land cover</span>
<span class="nf">querygrain</span><span class="p">(</span><span class="n">prior</span><span class="p">,</span> <span class="n">nodes</span><span class="o">=</span><span class="s">'new'</span><span class="p">)</span><span class="o">$</span><span class="n">new</span>
</code></pre></div>
<p>In the next step, we'll show the evidence to this independence network in order to obtain the posterior distribution.</p>
<h3>Step 2: Show Evidence and Get the Posterior Distribution</h3>
<p>This step is a little trickier.
I had to write my own function to update the junction tree with the evidence, in particular because of the unique application to land cover classification.
There's additional complexity due to the fact that our network has nodes (variables) with human-readable character strings (e.g. "med.hhold.income") but the raster layers that contain our discrete training data are integer-valued.
As a result, we have to translate the raster layer class identifiers (e.g., 0, 1, 2...) into their class labels (e.g., "med.hhold.income", ...).</p>
<p>In the <code>update.network</code> function, we take in the junction tree to be modified (shown evidence) and an associative array (named list of vector in R) of the evidence (e.g., "med.hhold.income"=1, ...).</p>
<div class="highlight"><pre><span></span><code><span class="n">VARS</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="kc">...</span><span class="p">)</span> <span class="c1"># Some list of variable names as character strings</span>
<span class="c1"># A function to update the posterior probability distribution with evidence</span>
<span class="n">update.network</span> <span class="o"><-</span> <span class="nf">function </span><span class="p">(</span><span class="n">jtree</span><span class="p">,</span> <span class="n">states</span><span class="p">)</span> <span class="p">{</span>
<span class="n">states</span> <span class="o"><-</span> <span class="nf">na.omit</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
<span class="c1"># Do not do anything if the input data are all NA</span>
<span class="nf">if </span><span class="p">(</span><span class="nf">dim</span><span class="p">(</span><span class="n">states</span><span class="p">)[</span><span class="m">1</span><span class="p">]</span> <span class="o">==</span> <span class="m">0</span><span class="p">)</span> <span class="p">{</span>
<span class="nf">return</span><span class="p">(</span><span class="n">jtree</span><span class="p">)</span>
<span class="p">}</span>
<span class="c1"># Translate the raster classes [0, 1, 2, ...] into factors</span>
<span class="n">evidence</span> <span class="o"><-</span> <span class="nf">data.frame</span><span class="p">(</span><span class="nf">matrix</span><span class="p">(</span><span class="n">nrow</span><span class="o">=</span><span class="nf">dim</span><span class="p">(</span><span class="n">states</span><span class="p">)[</span><span class="m">1</span><span class="p">],</span> <span class="n">ncol</span><span class="o">=</span><span class="nf">length</span><span class="p">(</span><span class="n">VARS</span><span class="p">)))</span>
<span class="nf">names</span><span class="p">(</span><span class="n">evidence</span><span class="p">)</span> <span class="o"><-</span> <span class="n">VARS</span>
<span class="nf">for </span><span class="p">(</span><span class="n">var</span> <span class="n">in</span> <span class="n">VARS</span><span class="p">)</span> <span class="p">{</span>
<span class="n">evidence</span><span class="p">[,</span><span class="n">var</span><span class="p">]</span> <span class="o"><-</span> <span class="nf">t</span><span class="p">(</span><span class="n">factors</span><span class="p">[</span><span class="n">var</span><span class="p">,][</span><span class="n">states</span><span class="p">[,</span><span class="n">var</span><span class="p">]</span> <span class="o">+</span> <span class="m">1</span><span class="p">])</span>
<span class="p">}</span>
<span class="nf">for </span><span class="p">(</span><span class="n">i</span> <span class="n">in</span> <span class="nf">seq</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="nf">dim</span><span class="p">(</span><span class="n">evidence</span><span class="p">)[</span><span class="m">1</span><span class="p">]))</span> <span class="p">{</span>
<span class="n">jtree</span> <span class="o"><-</span> <span class="nf">setEvidence</span><span class="p">(</span><span class="n">jtree</span><span class="p">,</span> <span class="n">nodes</span><span class="o">=</span><span class="nf">names</span><span class="p">(</span><span class="n">evidence</span><span class="p">),</span>
<span class="n">nslist</span><span class="o">=</span><span class="nf">mapply</span><span class="p">(</span><span class="n">list</span><span class="p">,</span> <span class="n">evidence</span><span class="p">[</span><span class="n">i</span><span class="p">,]))</span>
<span class="p">}</span>
<span class="nf">return</span><span class="p">(</span><span class="n">jtree</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div>
<p>To obtain the posterior distribution, then, looks something like this.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Update the posterior</span>
<span class="n">posterior</span> <span class="o"><-</span> <span class="nf">update.network</span><span class="p">(</span><span class="n">prior</span><span class="p">,</span> <span class="nf">c</span><span class="p">(</span><span class="n">med.hhold.income</span><span class="o">=</span><span class="m">1</span><span class="p">,</span> <span class="kc">...</span><span class="p">))</span>
<span class="c1"># Get the posterior probabilities for new land cover</span>
<span class="nf">querygrain</span><span class="p">(</span><span class="n">posterior</span><span class="p">,</span> <span class="n">nodes</span><span class="o">=</span><span class="s">'new'</span><span class="p">)</span><span class="o">$</span><span class="n">new</span>
</code></pre></div>
<h3>Step 3: Predict the Outcome from the Posterior Distribution</h3>
<p>Here, we need another convenience function; one to pick from the posterior distribution.
The <code>choose.outcome</code> function does this by creating a cumulative probability distribution, generating a random deviate on the uniform interval between 0 and 1, and then choosing the class that covers the interval in which the deviate is found.</p>
<p>For example, if there are two classes with posterior probabilities of 44% for class 0 and 56% for class 1, then a random deviate generated on [0, 0.44] will cause the pixel to be assigned to class 0 (44% of the time) while a random deviate generated on (0.44, 1.0] will cause the pixel to be assigned to class 1 (56% of the time).</p>
<div class="highlight"><pre><span></span><code><span class="c1"># A function to choose outcomes, one at a time, with the same probability as the given posterior distribution</span>
<span class="n">choose.outcome</span> <span class="o"><-</span> <span class="nf">function </span><span class="p">(</span><span class="n">posterior</span><span class="p">)</span> <span class="p">{</span>
<span class="n">posterior</span> <span class="o"><-</span> <span class="nf">sort</span><span class="p">(</span><span class="n">posterior</span><span class="p">)</span>
<span class="c1"># Sort the posterior probabilties by factors, e.g. "1=0.56,0=0.44" becomes "0=0.44,1=0.56"</span>
<span class="n">post</span> <span class="o"><-</span> <span class="nf">numeric</span><span class="p">()</span>
<span class="nf">for </span><span class="p">(</span><span class="n">i</span> <span class="n">in</span> <span class="nf">seq.int</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="nf">length</span><span class="p">(</span><span class="n">posterior</span><span class="p">)))</span> <span class="p">{</span>
<span class="n">post</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o"><-</span> <span class="n">posterior</span><span class="p">[</span><span class="nf">as.character</span><span class="p">(</span><span class="n">i</span> <span class="o">-</span> <span class="m">1</span><span class="p">)]</span>
<span class="p">}</span>
<span class="c1"># Generate a vector of probability thresholds e.g. [0.0, 0.44] for transitions to [0, 1]; upper bound of p=1.0 is implied.</span>
<span class="n">prob</span> <span class="o"><-</span> <span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="nf">length</span><span class="p">(</span><span class="n">post</span><span class="p">))</span>
<span class="nf">for </span><span class="p">(</span><span class="n">i</span> <span class="n">in</span> <span class="nf">seq.int</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">post</span><span class="p">)</span> <span class="o">-</span> <span class="m">1</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="n">by</span><span class="o">=</span><span class="m">-1</span><span class="p">))</span> <span class="p">{</span>
<span class="n">j</span> <span class="o"><-</span> <span class="nf">length</span><span class="p">(</span><span class="n">post</span><span class="p">)</span> <span class="o">-</span> <span class="n">i</span>
<span class="n">prob</span> <span class="o"><-</span> <span class="n">prob</span> <span class="o">+</span> <span class="nf">c</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="n">j</span><span class="p">),</span> <span class="n">post</span><span class="p">[(</span><span class="n">j</span><span class="m">-1</span><span class="p">)</span><span class="o">:</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">post</span><span class="p">)</span><span class="o">-</span><span class="n">j</span><span class="p">)])</span>
<span class="p">}</span>
<span class="c1"># Generate a random uniform deviate on [0, 1] to determine which factor to output</span>
<span class="n">r</span> <span class="o"><-</span> <span class="nf">runif</span><span class="p">(</span><span class="m">1</span><span class="p">)</span>
<span class="nf">for </span><span class="p">(</span><span class="n">i</span> <span class="n">in</span> <span class="nf">seq.int</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="nf">length</span><span class="p">(</span><span class="n">prob</span><span class="p">)</span> <span class="o">-</span> <span class="m">2</span><span class="p">))</span> <span class="p">{</span>
<span class="nf">if </span><span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="n">prob</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="m">2</span><span class="p">])</span> <span class="p">{</span>
<span class="nf">return</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="c1"># p < threshold in e.g. [0, 0.44]? Output that factor</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">prob</span><span class="p">)</span> <span class="o">-</span> <span class="m">1</span><span class="p">)</span> <span class="c1"># p < implied upper bound of 1.0? Output last factor</span>
<span class="p">}</span>
</code></pre></div>
<p><strong>Finally, we're ready to make some predictions</strong>, which, in this case, mean showing evidence to the network, obtaining the posterior distribution, and making a prediction by sampling from the posterior distribution.
For our land cover classification, we can use the <code>stackApply</code> function from the <code>raster</code> package to operate on a stack of raster layers, each corresponding to a predictor variable, for an efficient way of generating a vector of evidence.</p>
<div class="highlight"><pre><span></span><code><span class="nf">require</span><span class="p">(</span><span class="n">gRain</span><span class="p">)</span>
<span class="nf">require</span><span class="p">(</span><span class="n">raster</span><span class="p">)</span>
<span class="c1"># Use the junction tree algorithm to create an independence network to query</span>
<span class="n">prior</span> <span class="o"><-</span> <span class="nf">compile</span><span class="p">(</span><span class="nf">as.grain</span><span class="p">(</span><span class="n">fitted.network</span><span class="p">))</span>
<span class="c1"># A function to operate on each vector of predictors (vector of pixels across bands)</span>
<span class="n">func</span> <span class="o"><-</span> <span class="nf">function </span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="kc">...</span><span class="p">)</span> <span class="p">{</span>
<span class="nf">choose.outcome</span><span class="p">(</span><span class="nf">querygrain</span><span class="p">(</span><span class="nf">update.network</span><span class="p">(</span><span class="n">prior</span><span class="p">,</span> <span class="nf">as.data.frame</span><span class="p">(</span><span class="nf">t</span><span class="p">(</span><span class="n">r</span><span class="p">))),</span>
<span class="n">nodes</span><span class="o">=</span><span class="s">'new'</span><span class="p">)</span><span class="o">$</span><span class="n">new</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">expert.prediction</span> <span class="o"><-</span> <span class="nf">stackApply</span><span class="p">(</span><span class="n">layers</span><span class="p">,</span>
<span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="nf">length</span><span class="p">(</span><span class="nf">names</span><span class="p">(</span><span class="n">layers</span><span class="p">))),</span> <span class="n">func</span><span class="p">)</span>
</code></pre></div>
<p>In using <code>stackApply</code>, we need to convert the vector of discrete raster values into a data frame with <code>as.data.frame(t(r))</code> where <code>r</code> is the vector of raster values; we take the transpose, <code>t(r)</code>, before turning it into a data frame so it has the right shape expected by the <code>update.network</code> function.</p>
<h3>Transition Probabilities</h3>
<p><strong>Another neat feature of Bayesian networks is that we can easily obtain transition probabilities for our outcomes.</strong>
In this case, transition probability refers to the probability that a given pixel will be assigned a certain land cover class by our model.</p>
<p>Recall that in the <code>choose.outcome</code> function we were sampling a single outcome from the posterior distribution.
To calculate transition probabilities, we instead want to assign the probability of a specific outcome for a given pixel as the value of that pixel.
This time, we use the raster calculator, <code>calc</code>, in the <code>raster</code> package to apply an arbitrary function over the pixels of our raster stack.</p>
<div class="highlight"><pre><span></span><code><span class="nf">require</span><span class="p">(</span><span class="n">gRain</span><span class="p">)</span>
<span class="nf">require</span><span class="p">(</span><span class="n">raster</span><span class="p">)</span>
<span class="c1"># In our case, we have 3 possible outcomes</span>
<span class="n">no.outcomes</span> <span class="o"><-</span> <span class="m">3</span>
<span class="c1"># Find transition probabilities for the expert graph</span>
<span class="n">trans.probs.expert</span> <span class="o"><-</span> <span class="n">raster</span><span class="o">::</span><span class="nf">calc</span><span class="p">(</span><span class="n">layers</span><span class="p">,</span> <span class="nf">function </span><span class="p">(</span><span class="n">states</span><span class="p">)</span> <span class="p">{</span>
<span class="n">trans</span> <span class="o"><-</span> <span class="nf">matrix</span><span class="p">(</span><span class="n">nrow</span><span class="o">=</span><span class="nf">dim</span><span class="p">(</span><span class="n">states</span><span class="p">)[</span><span class="m">1</span><span class="p">],</span> <span class="n">ncol</span><span class="o">=</span><span class="n">no.outcomes</span><span class="p">)</span>
<span class="nf">for </span><span class="p">(</span><span class="n">i</span> <span class="n">in</span> <span class="nf">seq</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="nf">dim</span><span class="p">(</span><span class="n">states</span><span class="p">)[</span><span class="m">1</span><span class="p">]))</span> <span class="p">{</span>
<span class="c1"># Query the network for the posterior probability of a certain "new" outcome</span>
<span class="n">trans</span><span class="p">[</span><span class="n">i</span><span class="p">,]</span> <span class="o"><-</span> <span class="nf">querygrain</span><span class="p">(</span><span class="nf">update.network</span><span class="p">(</span><span class="n">prior</span><span class="p">,</span> <span class="nf">as.data.frame</span><span class="p">(</span><span class="nf">t</span><span class="p">(</span><span class="n">states</span><span class="p">[</span><span class="n">i</span><span class="p">,]))),</span>
<span class="n">nodes</span><span class="o">=</span><span class="s">'new'</span><span class="p">)</span><span class="o">$</span><span class="n">new</span>
<span class="p">}</span>
<span class="nf">return</span><span class="p">(</span><span class="n">trans</span><span class="p">)</span>
<span class="p">},</span> <span class="n">forcefun</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span>
</code></pre></div>
<p>Below are images of the transition probabilities for the Detroit metro area land cover in 2011 as predicted from 2010 U.S. Census and landscape measures (click on each for their full resolution).</p>
<p><a href="/images/20141204_bn_trans_prob_undev.png"><img style="float:left;margin-right:20px;" src="/images/thumbs/20141204_bn_trans_prob_undev_thumbnail_wide.png" /></a>
<a href="/images/20141204_bn_trans_prob_lowdev.png"><img style="float:left;margin-right:20px;" src="/images/thumbs/20141204_bn_trans_prob_lowdev_thumbnail_wide.png" /></a>
<a href="/images/20141204_bn_trans_prob_highdev.png"><img style="float:left;margin-right:20px;" src="/images/thumbs/20141204_bn_trans_prob_highdev_thumbnail_wide.png" /></a></p>
<div style="clear:both;"></div>
<p>We can see that the transition probabilities all three of the predicted land cover classes—undeveloped, low development, and high development—are all fairly high but are spatially distinct.
Here, "development" refers to proportion of impervious surface area as indicated by the National Land Cover Dataset (NLCD).
Undeveloped areas are thought to have less than 20% impervious surface cover, low-development areas between 20% and 80%, and high-development areas more than 80%.
Thus, in the "undeveloped" transition probabilities, we see very high (>/= 0.8) probabilities in the outlying suburban and exurban areas where the urban core has basically 0% chance of transitioning to undeveloped.
The urban core is easily resolved in the "low development" transition probabilities and main road and highway corridors are seen in the "high development" transition probabilities, as expected.</p>
<p><strong>How does our final land cover classification look?</strong>
Land cover data from 2006, as the "new" or "outcome" land cover, were used to train the Bayesian network so it's no surprise the 2006 classification looks very good.
The classification from 2011 is just slightly worse.
The images below show the difference between the classifier's prediction and the observed NLCD land cover.</p>
<p><a href="/images/20141204_bn_observed_2006_diff.png"><img style="float:left;margin-right:20px;" src="/images/thumbs/20141204_bn_observed_2006_diff_thumbnail_wide.png" /></a>
<a href="/images/20141204_bn_observed_2011_diff.png"><img style="float:left;margin-right:20px;" src="/images/thumbs/20141204_bn_observed_2011_diff_thumbnail_wide.png" /></a></p>
<div style="clear:both;"></div>
<p>To quantify the agreement, we can use <a href="http://en.wikipedia.org/wiki/Cohen%27s_kappa">Cohen's kappa</a>, a measure of the rank-order agreement between two sets, where the sets are the scene-wide predicted and observed land cover values, pixel for pixel.
For the 2006 prediction, Cohen's kappa is 0.97—considerably high given that 1.0 would represent perfect agreement.
In 2011, this agreement drops to 0.92.
Different data structures are required in the training and validation process and there isn't an efficient way to remove the training pixels from the output in R.
Thus, these kappas are also inflated slightly due to the inclusion of training data in the validation set, which is the entire image.
However, the training data constitute less than 4% of the dataset.</p>
<h2>Concluding Remarks</h2>
<p><strong>While the classification accuracies as indicated by Cohen's kappa are quite high, there are three important considerations that should mitigate our enthusiasm:</strong></p>
<ol>
<li>The model completely fails to predict rare events, in this case, new urban development.</li>
<li>The model included "old" land cover as a predictor, which is a considerable advantage as most pixels don't change their land cover from year-to-year.</li>
<li>This classification was based in part on another classification, the NLCD "development" land cover classification.</li>
</ol>
<p>The failure to predict rare events is related to our inclusion of "old" land cover.
<strong>In a sense, there is a considerable inertia to land cover change;</strong> extant land cover rarely does change.
It would be interesting to run the model again without the "old" land cover.
It would also be interesting to investigate the model's performance given a remote sensing dataset rather than a previous classification.
And we haven't even begun to look at the conditional probability tables!
<strong>In summary, I think it's fair to say that Bayesian networks are promising for reproducing static patterns given sparse evidence and deserve attention in future land cover classification applications that are considering machine learning approaches.</strong></p>
<h2>References</h2>
<ol>
<li>Ridd, M. 1995. Exploring a VIS (vegetation-impervious surface-soil) model for urban ecosystem analysis through remote sensing: comparative anatomy for cities. <em>International Journal of Remote Sensing</em> 16 (12):2165–2185.</li>
<li>Kocabas, V., and S. Dragicevic. 2007. Enhancing a GIS Cellular Automata Model of Land Use Change: Bayesian Networks, Influence Diagrams and Causality. <em>Transactions in GIS</em> 11 (5):681–702.</li>
<li>Jantz, C. A., S. J. Goetz, and M. K. Shelley. 2004. Using the SLEUTH urban growth model to simulate the impacts of future policy scenarios on urban land use in the Baltimore-Washington metropolitan area. <em>Environment and Planning B: Planning and Design</em> 31 (2):251–271.</li>
<li>Sexton, J. O., X.-P. Song, C. Huang, S. Channan, M. E. Baker, and J. R. Townshend. 2013. Urban growth of the Washington, D.C.–Baltimore, MD metropolitan region from 1984 to 2010 by annual, Landsat-based estimates of impervious cover. <em>Remote Sensing of Environment</em> 129:42–53.</li>
<li>Brown, D. G., P. H. Verburg, R. G. Pontius, and M. D. Lange. 2013. Opportunities to improve impact, integration, and evaluation of land change models. <em>Current Opinion in Environmental Sustainability</em> 5 (5):452–457.</li>
<li>Hoalst-Pullen, N., M. W. Patterson, and J. D. Gatrell. 2011. Empty spaces: neighbourhood change and the greening of Detroit, 1975–2005. <em>Geocarto International</em> 26 (6):417–434.</li>
<li>Emmanuel, R. 1997. Urban vegetational change as an indicator of demographic trends in cities: the case of Detroit. <em>Environment and Planning B: Planning and Design</em> 24:415–426.</li>
<li>Ryznar, R. M., and T. W. Wagner. 2001. Using Remotely Sensed Imagery to Detect Urban Change: Viewing Detroit from Space. <em>Journal of the American Planning Association</em> 67 (3):327–336.</li>
<li>Pearl, J. 1985. Bayesian networks: A model of self-activated memory for evidential reasoning. In Seventh Annual Conference of the Cognitive Science Society.</li>
<li>Nagarajan, R., M. Scutari, and S. Lebre. 2013. Bayesian Networks in R. New York, New York, USA: Springer.</li>
<li>Charniak, E. 1991. Bayesian Networks without Tears. <em>AI Magazine</em> 12 (4).</li>
<li>Uusitalo, L. 2007. Advantages and challenges of Bayesian networks in environmental modelling. <em>Ecological Modelling</em> 203 (3-4):312–318.</li>
<li>Scutari, M. 2014. bnlearn - an R package for Bayesian network learning and inference. <a href="http://www.bnlearn.com">http://www.bnlearn.com</a></li>
</ol>PostGIS 2.x: Getting raster data out of the database2013-08-21T12:00:00+02:002015-02-23T15:00:00+01:00K. Arthur Endsleytag:karthur.org,2013-08-21:/2013/postgis-2-getting-raster-data.html<p>A cross-posted article from AmericaView, this tutorial discusses how to access raster data in a PostGIS 2.x database and provide potential solutions, including code samples.</p><p><strong>This article was originally posted on the <a href="http://blog.americaview.org/2013/08/postgis-2x-getting-raster-data-out-of.html">AmericaView Blog</a>.
You can also fork <a href="https://gist.github.com/arthur-e/6222249">the Gist of this article</a>.</strong>
The Python and raw SQL examples are taken from my work on the Burned Area Emergency Response <a href="http://geodjango.mtri.org/geowepp/">Spatial WEPP Model Inputs Generator</a>.</p>
<p>PostGIS 2.x (<a href="http://blog.opengeo.org/2013/08/20/postgis-2-1-what-you-need-to-know/">latest release, 2.1</a>) enables users to do <a href="http://postgis.net/docs/manual-2.0/RT_reference.html">fairly sophisticated raster processing</a> directly in a database.
For many applications, these data can stay in the database; it's the insight into spatial phenomena that comes out.
Sometimes, however, you need to get file data (e.g. a GeoTIFF) out of PostGIS.
It isn't immediately obvious how to do this efficiently, despite <a href="http://postgis.net/docs/manual-2.0/RT_reference.html#Raster_Outputs">the number of helpful functions</a> that serialize a raster field to Well-Known Binary (WKB) or other "flat" formats.</p>
<h2>Background</h2>
<p>In particular, I recently needed to create a web service that delivers PostGIS raster outputs as file data.
The queries that we needed to support were well suited for PostGIS and sometimes one query would consume another (one or more) as subquer(ies).
These and other considerations led me to decide to implement the service layer in Python using either GeoDjango or GeoAlchemy.
More on that later.
Suffice to say, <strong>a robust and stable solution for exporting and attaching file data from PostGIS to an HTTP response was needed.</strong>
I found at least six (6) different ways of doing this; there may be more:</p>
<ul>
<li>Export an ASCII grid ("AAIGrid")</li>
<li>Connect to the database using a desktop client (e.g. QGIS) [1]</li>
<li>Use a procedural language (like PLPGSQL or PLPython) [<a href="http://geeohspatial.blogspot.com/2013/07/exporting-postgis-rasters-to-other.html">2</a>]</li>
<li>Use the COPY declaration to get a hex dump out, then convert to binary</li>
<li>Fill a 2D NumPy array with a byte array and serialize it to a binary file using GDAL or psycopg2 [<a href="http://stackoverflow.com/questions/10529351/using-a-psycopg2-converter-to-retrieve-bytea-data-from-postgresql">3</a>, <a href="http://www.gdal.org/gdal_tutorial.html">4</a>]</li>
<li>Use a raster output function to get a byte array, which can be written to a binary field</li>
</ul>
<p><strong>It's nice to have options.
But what's the most appropriate?</strong>
If that's a difficult question to answer, what's the easiest option?
I'll explore some of them in detail.</p>
<h2>Export An ASCII Grid</h2>
<p>This works great!
Because an ASCII grid file (or "ESRI Grid" file, with the *.asc or *.grd extension, typically) is just plain text, you can directly export it from the database.
The GDAL driver name is "AAIGrid" which should be the second argument to <code>ST_AsGDALRaster()</code>.
Be sure to remove the column header from your export (see image below).</p>
<p><img alt="Want an ASCII grid (or "ESRI Grid")? No problem! Just don't export the column names." src="http://karthur.org/images/20150223_postgis2_pgadmin_ascii_export.png"></p>
<p>Here's a contrived example:</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span><span class="w"> </span><span class="n">ST_AsGDALRaster</span><span class="p">(</span><span class="n">mytable</span><span class="p">.</span><span class="n">rast</span><span class="p">,</span><span class="w"> </span><span class="s1">'AAIGrid'</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">rast</span><span class="w"></span>
<span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">mytable</span><span class="w"></span>
</code></pre></div>
<p>This approach has a downside, however.
What you get is a file that has no projection information that you may need to convert to another format.
This can present problems for your workflow, especially if you're trying to automate the production of raster files, say, through a web API.</p>
<h2>Connecting Using the QGIS Desktop Client</h2>
<p>There is <a href="http://plugins.qgis.org/plugins/wktraster/">a plug-in for QGIS</a> that promises to allow you to load raster data from PostGIS directly into a QGIS workspace.
I used the Plugins Manager ("Plugins" > "Fetch Python Plugins...") in QGIS to get this plug-in package.
The first time I selected the "Load PostGIS Raster to QGIS" plug-in and tried to install it, I found that I couldn't write to the plug-ins directory (this with a relatively fresh installation of QGIS).
After creating and setting myself as the owner of the python/plugins directory, I was able to install the plug-in without any further trouble.
Connecting to the database and viewing the available relations was also no trouble at all.
One minor irritation is that <strong>you need to enter your password every time the plug-in interfaces with the database</strong>, which can be very often, at every time the list of available relations needs to be updated.</p>
<p><img alt="You'll be doing this a lot." src="http://karthur.org/images/20150223_postgis2_pgadmin_password.png"></p>
<p>There are a few options available to you in displaying raster data from the database: "Read table's vector representation," "Read one table as a raster," "Read one row as a raster," or "Read the dataset as a raster."
It's not clear what the second and last choices are, but "Reading the table as a raster" did not work for me where my table has one raster field and a couple of non-raster, non-geometry/geography fields; QGIS hung for a few seconds then said it "Could not load PG..."
Reading one row worked, however, you have to select the row by its primary key (or row number in a random selection, not sure which it is returning).
<strong>In summary, this might work for a single raster of interest but it is very awkward and time-consuming.</strong></p>
<h2>Using the COPY Declaration in SQL</h2>
<p>My colleague suggested this method, demonstrated in Python, which requires the <code>pygresql</code> module to be installed; easy enough with <code>pip</code>:</p>
<div class="highlight"><pre><span></span><code>pip install psycopg2 pygresql
</code></pre></div>
<p>The basic idea is to use the COPY declaration in SQL to export the raster to a hexadecimal file, then to convert that file to a binary file using <code>xxd</code>.
The following is an implementation in Python:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">os</span><span class="o">,</span> <span class="nn">stat</span><span class="o">,</span> <span class="nn">pg</span>
<span class="c1"># See: http://www.pygresql.org/install.html</span>
<span class="c1"># pip install psycopg2, pygresql</span>
<span class="c1"># Designate path to output file</span>
<span class="n">outfile</span> <span class="o">=</span> <span class="s1">'/home/myself/temp.tiff'</span>
<span class="c1"># Name of PostgreSQL table to export</span>
<span class="n">pg_table</span> <span class="o">=</span> <span class="s1">'geowepp_soil'</span>
<span class="c1"># PostgreSQL connection parameters</span>
<span class="n">pg_server</span> <span class="o">=</span> <span class="s1">'my_server'</span>
<span class="n">pg_database</span> <span class="o">=</span> <span class="s1">'my_database'</span>
<span class="n">pg_user</span> <span class="o">=</span> <span class="s1">'myself'</span>
<span class="c1"># Desginate a file to receive the hex data; make sure it exists with the right permissions</span>
<span class="n">pg_hex</span> <span class="o">=</span> <span class="s1">'/home/myself/temp.hex'</span>
<span class="n">os</span><span class="o">.</span><span class="n">mknod</span><span class="p">(</span><span class="n">pg_hex</span><span class="p">,</span> <span class="n">stat</span><span class="o">.</span><span class="n">S_IRUSR</span> <span class="o">|</span> <span class="n">stat</span><span class="o">.</span><span class="n">S_IWUSR</span> <span class="o">|</span> <span class="n">stat</span><span class="o">.</span><span class="n">S_IRGRP</span> <span class="o">|</span> <span class="n">stat</span><span class="o">.</span><span class="n">S_IWGRP</span><span class="p">)</span>
<span class="n">conn</span> <span class="o">=</span> <span class="n">pg</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="n">pg_database</span><span class="p">,</span> <span class="n">pg_server</span><span class="p">,</span> <span class="mi">5432</span><span class="p">,</span> <span class="kc">None</span><span class="p">,</span> <span class="kc">None</span><span class="p">,</span> <span class="n">pg_user</span><span class="p">)</span>
<span class="n">sql</span> <span class="o">=</span> <span class="s2">"COPY (SELECT encode(ST_AsTIFF(ST_Union("</span> <span class="o">+</span> <span class="n">pg_table</span> <span class="o">+</span> <span class="s2">".rast)), 'hex') FROM "</span> <span class="o">+</span> <span class="n">pg_table</span> <span class="o">+</span> <span class="s2">") TO '"</span> <span class="o">+</span> <span class="n">pg_hex</span> <span class="o">+</span> <span class="s2">"'"</span>
<span class="c1"># You can check it with: print sql</span>
<span class="n">conn</span><span class="o">.</span><span class="n">query</span><span class="p">(</span><span class="n">sql</span><span class="p">)</span>
<span class="n">cmd</span> <span class="o">=</span> <span class="s1">'xxd -p -r '</span> <span class="o">+</span> <span class="n">pg_hex</span> <span class="o">+</span> <span class="s1">' > '</span> <span class="o">+</span> <span class="n">outfile</span>
<span class="n">os</span><span class="o">.</span><span class="n">system</span><span class="p">(</span><span class="n">cmd</span><span class="p">)</span>
</code></pre></div>
<p>This needs to be done on the file system of the database server, which is where PostgreSQL will write.</p>
<h2>Serializing from a Byte Array</h2>
<p>Despite the seeming complexity of this option (then again, compare it to the above), I think it is the most flexible approach.
I'll provide two examples here, with code: using <a href="http://www.geodjango.org/">GeoDjango</a> to execute a raw query and using <a href="https://geoalchemy-2.readthedocs.org/en/latest/">GeoAlchemy2</a>'s object-relational model to execute the query.
Finally, I'll show an example of writing the output to a file or to a Django <code>HttpResponse()</code> instance.</p>
<h3>Using GeoDjango</h3>
<p>First, some setup.
We'll define a <code>RasterQuery</code> class to help with handling the details.
While a new class isn't exactly an idiomatic example, I'm hoping it will succinctly illustrate the considerations involved in <a href="https://docs.djangoproject.com/en/1.5/topics/db/sql/">performing raw SQL queries with Django</a>.</p>
<div class="highlight"><pre><span></span><code><span class="k">class</span> <span class="nc">RasterQuery</span><span class="p">:</span>
<span class="sd">'''</span>
<span class="sd"> Assumes some global FORMATS dictionary describes the valid file formats,</span>
<span class="sd"> their file extensions and MIME type strings.</span>
<span class="sd"> '''</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">qs</span><span class="p">,</span> <span class="n">params</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">file_format</span><span class="o">=</span><span class="s1">'geotiff'</span><span class="p">):</span>
<span class="k">assert</span> <span class="n">file_format</span> <span class="ow">in</span> <span class="n">FORMATS</span><span class="o">.</span><span class="n">keys</span><span class="p">(),</span> <span class="s1">'Not a valid file format'</span>
<span class="bp">self</span><span class="o">.</span><span class="n">cursor</span> <span class="o">=</span> <span class="n">connection</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">params</span> <span class="o">=</span> <span class="n">params</span>
<span class="bp">self</span><span class="o">.</span><span class="n">query_string</span> <span class="o">=</span> <span class="n">qs</span>
<span class="bp">self</span><span class="o">.</span><span class="n">file_format</span> <span class="o">=</span> <span class="n">file_format</span>
<span class="bp">self</span><span class="o">.</span><span class="n">file_extension</span> <span class="o">=</span> <span class="n">FORMATS</span><span class="p">[</span><span class="n">file_format</span><span class="p">][</span><span class="s1">'file_extension'</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">execute</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">params</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
<span class="sd">'''Execute the stored query string with the given parameters'''</span>
<span class="bp">self</span><span class="o">.</span><span class="n">params</span> <span class="o">=</span> <span class="n">params</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">params</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">cursor</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">query_string</span><span class="p">,</span> <span class="n">params</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">cursor</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">query_string</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">fetch_all</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="sd">'''Return all results in a List; a List of buffers is returned'''</span>
<span class="k">return</span> <span class="p">[</span>
<span class="n">row</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">cursor</span><span class="o">.</span><span class="n">fetchall</span><span class="p">()</span>
<span class="p">]</span>
<span class="k">def</span> <span class="nf">write_all</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
<span class="sd">'''For each raster in the query, writes it to a file on the given path'''</span>
<span class="n">name</span> <span class="o">=</span> <span class="n">name</span> <span class="ow">or</span> <span class="s1">'raster_query'</span>
<span class="n">i</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">results</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">fetch_all</span><span class="p">()</span>
<span class="k">for</span> <span class="n">each</span> <span class="ow">in</span> <span class="n">results</span><span class="p">:</span>
<span class="n">name</span> <span class="o">=</span> <span class="n">name</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">file_extension</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">name</span><span class="p">),</span> <span class="s1">'wb'</span><span class="p">)</span> <span class="k">as</span> <span class="n">stream</span><span class="p">:</span>
<span class="n">stream</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">results</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
<span class="n">i</span> <span class="o">+=</span> <span class="mi">1</span>
</code></pre></div>
<p>With the <code>RasterQuery</code> class available to us, we can more cleanly execute our raw SQL queries and serialize the response to a file attachment in a Django view:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">clip_one_raster_by_another</span><span class="p">(</span><span class="n">request</span><span class="p">):</span>
<span class="c1"># Our raw SQL query, with parameter strings</span>
<span class="n">query_string</span> <span class="o">=</span> <span class="s1">'''</span>
<span class="s1"> SELECT ST_AsGDALRaster(ST_Clip(landcover.rast,</span>
<span class="s1"> ST_Buffer(ST_Envelope(burnedarea.rast), </span><span class="si">%s</span><span class="s1">)), </span><span class="si">%s</span><span class="s1">) AS rast</span>
<span class="s1"> FROM landcover, burnedarea</span>
<span class="s1"> WHERE ST_Intersects(landcover.rast, burnedarea.rast)</span>
<span class="s1"> AND burnedarea.rid = </span><span class="si">%s</span><span class="s1">'''</span>
<span class="c1"># Create a RasterQuery instance; apply the parameters</span>
<span class="n">query</span> <span class="o">=</span> <span class="n">RasterQuery</span><span class="p">(</span><span class="n">query_string</span><span class="p">)</span>
<span class="n">query</span><span class="o">.</span><span class="n">execute</span><span class="p">([</span><span class="mi">1000</span><span class="p">,</span> <span class="s1">'GTiff'</span><span class="p">,</span> <span class="mi">2</span><span class="p">])</span>
<span class="n">filename</span> <span class="o">=</span> <span class="s1">'blah.tiff'</span>
<span class="c1"># Outputs:</span>
<span class="c1"># [(<read-only buffer for 0x2613470, size 110173, offset 0 at 0x26a05b0>),</span>
<span class="c1"># (<read-only buffer for 0x26134b0, size 142794, offset 0 at 0x26a01f0>)]</span>
<span class="c1"># Return only the first item</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">HttpResponse</span><span class="p">(</span><span class="n">query</span><span class="o">.</span><span class="n">fetch_all</span><span class="p">()[</span><span class="mi">0</span><span class="p">],</span> <span class="n">content_type</span><span class="o">=</span><span class="n">FORMATS</span><span class="p">[</span><span class="n">_format</span><span class="p">][</span><span class="s1">'mime'</span><span class="p">])</span>
<span class="n">response</span><span class="p">[</span><span class="s1">'Content-Disposition'</span><span class="p">]</span> <span class="o">=</span> <span class="s1">'attachment; filename="</span><span class="si">%s</span><span class="s1">"'</span> <span class="o">%</span> <span class="n">filename</span>
<span class="k">return</span> <span class="n">response</span>
</code></pre></div>
<p>Seem simple enough?
To write to a file instead, see the <code>write_all()</code> method of the <code>RasterQuery</code> class.
The <code>query.fetch_all()[0]</code> at the end is contrived.
I'll show a better way of getting to a nested buffer in the next example.</p>
<h3>Using GeoAlchemy2</h3>
<p>GeoAlchemy2's object-relational model (ORM) allows tables to be represented as classes, just like in Django.</p>
<div class="highlight"><pre><span></span><code><span class="k">class</span> <span class="nc">LandCover</span><span class="p">(</span><span class="n">DeclarativeBase</span><span class="p">):</span>
<span class="n">__tablename__</span> <span class="o">=</span> <span class="s1">'landcover'</span>
<span class="n">rid</span> <span class="o">=</span> <span class="n">Column</span><span class="p">(</span><span class="n">Integer</span><span class="p">,</span> <span class="n">primary_key</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">rast</span> <span class="o">=</span> <span class="n">Column</span><span class="p">(</span><span class="n">ga2</span><span class="o">.</span><span class="n">types</span><span class="o">.</span><span class="n">Raster</span><span class="p">)</span>
<span class="n">filename</span> <span class="o">=</span> <span class="n">Column</span><span class="p">(</span><span class="n">String</span><span class="p">(</span><span class="mi">255</span><span class="p">))</span>
<span class="k">class</span> <span class="nc">BurnedArea</span><span class="p">(</span><span class="n">DeclarativeBase</span><span class="p">):</span>
<span class="n">__tablename__</span> <span class="o">=</span> <span class="s1">'burnedarea'</span>
<span class="n">rid</span> <span class="o">=</span> <span class="n">Column</span><span class="p">(</span><span class="n">Integer</span><span class="p">,</span> <span class="n">primary_key</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">rast</span> <span class="o">=</span> <span class="n">Column</span><span class="p">(</span><span class="n">ga2</span><span class="o">.</span><span class="n">types</span><span class="o">.</span><span class="n">Raster</span><span class="p">)</span>
<span class="n">filename</span> <span class="o">=</span> <span class="n">Column</span><span class="p">(</span><span class="n">String</span><span class="p">(</span><span class="mi">255</span><span class="p">))</span>
<span class="n">burndate</span> <span class="o">=</span> <span class="n">Column</span><span class="p">(</span><span class="n">Date</span><span class="p">)</span>
<span class="n">burnname</span> <span class="o">=</span> <span class="n">Column</span><span class="p">(</span><span class="n">String</span><span class="p">(</span><span class="mi">255</span><span class="p">))</span>
</code></pre></div>
<p>Assuming that SESSION and ENGINE global variables are available, the gist of this approach can be seen in this example:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">clip_fccs_by_mtbs_id2</span><span class="p">(</span><span class="n">request</span><span class="p">):</span>
<span class="n">query</span> <span class="o">=</span> <span class="n">SESSION</span><span class="o">.</span><span class="n">query</span><span class="p">(</span><span class="n">LandCover</span><span class="o">.</span><span class="n">rast</span>\
<span class="o">.</span><span class="n">ST_AsGDALRaster</span><span class="p">(</span><span class="n">LandCover</span><span class="o">.</span><span class="n">rast</span>\
<span class="o">.</span><span class="n">ST_Clip</span><span class="p">(</span><span class="n">LandCover</span><span class="o">.</span><span class="n">rast</span><span class="p">,</span> <span class="n">BurnedArea</span><span class="o">.</span><span class="n">rast</span>\
<span class="o">.</span><span class="n">ST_Envelope</span><span class="p">()</span>\
<span class="o">.</span><span class="n">ST_Buffer</span><span class="p">(</span><span class="mi">1000</span><span class="p">)),</span> <span class="s1">'GTiff'</span><span class="p">)</span><span class="o">.</span><span class="n">label</span><span class="p">(</span><span class="s1">'rast'</span><span class="p">))</span>\
<span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">LandCover</span><span class="o">.</span><span class="n">rast</span><span class="o">.</span><span class="n">ST_Intersects</span><span class="p">(</span><span class="n">BurnedArea</span><span class="o">.</span><span class="n">rast</span><span class="p">),</span> <span class="n">BurnedArea</span><span class="o">.</span><span class="n">rid</span><span class="o">==</span><span class="mi">2</span><span class="p">)</span>
<span class="n">filename</span> <span class="o">=</span> <span class="s1">'blah.tiff'</span>
<span class="c1"># Outputs:</span>
<span class="c1"># [(<read-only buffer for 0x2613470, size 110173, offset 0 at 0x26a05b0>),</span>
<span class="c1"># (<read-only buffer for 0x26134b0, size 142794, offset 0 at 0x26a01f0>)]</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">query</span><span class="o">.</span><span class="n">all</span><span class="p">()</span>
<span class="k">while</span> <span class="nb">type</span><span class="p">(</span><span class="n">result</span><span class="p">)</span> <span class="o">!=</span> <span class="n">buffer</span><span class="p">:</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="c1"># Unwrap until a buffer is found</span>
<span class="c1"># Consequently, it returns only the first item</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">HttpResponse</span><span class="p">(</span><span class="n">result</span><span class="p">,</span> <span class="n">content_type</span><span class="o">=</span><span class="n">FORMATS</span><span class="p">[</span><span class="n">_format</span><span class="p">][</span><span class="s1">'mime'</span><span class="p">])</span>
<span class="n">response</span><span class="p">[</span><span class="s1">'Content-Disposition'</span><span class="p">]</span> <span class="o">=</span> <span class="s1">'attachment; filename="</span><span class="si">%s</span><span class="s1">"'</span> <span class="o">%</span> <span class="n">filename</span>
</code></pre></div>
<p>Here we see a better way of getting at a nested buffer.
If we wanted all of the rasters that were returned (all of the buffers), we could call ST_Union on our final raster selection before passing it to <code>ST_AsGDALRaster</code>.</p>
<h2>In Summary...</h2>
<p><strong>After considering all my (apparent) options, I found this last technique, using the PostGIS raster output function(s) and writing the byte array to a file-attachment in an HTTP response, to be best suited for my application.</strong>
I'd be interested in hearing about other techniques not described here.</p>
<h2>References</h2>
<ol>
<li>Obe, Regina. Leo S. Hsu. "Using PostGIS in a desktop environment." Chapter 12. PostGIS in Action. 2011.</li>
<li><a href="http://geeohspatial.blogspot.com/2013/07/exporting-postgis-rasters-to-other.html">"Exporting PostGIS Rasters to Other Formats...Quickly"</a></li>
<li>StackOverflow: <a href="http://stackoverflow.com/questions/10529351/using-a-psycopg2-converter-to-retrieve-bytea-data-from-postgresql">Using a psycopg2 converter to retrieve bytea data from PostgreSQL</a></li>
<li><a href="http://www.gdal.org/gdal_tutorial.html">GDAL API Tutorial</a></li>
</ol>