Exploratory data analysis is a topic that doesn't get enough attention in courses, formal or otherwise, on statistical analysis or so-called "data science." While some scientists or data professionals may be approaching a problem with what they feel is a great amount of well-established theory behind them, in many cases an a priori understanding of the system under consideration is lacking. In other cases, what is thought to be well-understood about a system is actually ripe for skepticism and further inquiry.

Either way, approaching a dataset with an open mind is a good thing. While any system of study with two or more variables can exhibit complex behavior, including thresholds, non-linearity, and feedbacks, problems in both the physical and social sciences are frequently modeled as purely linear relationships. There are a couple of reasons for this and both are understandable. First, the general linear model (and generalized linear models) are relatively uncomplicated. A second, less compelling reason, is that linear relationships may be thought to be more useful, or at least easier to understand. In exploratory data analysis (EDA), we want to get a sense of the range of possibilities in the system under study, to the extent that our available data accurately represent it.

Here, I present a visual tool and R code to help guide the detection of bivariate trends among groups means (or among quantiles of one continuous variable), which also applies and displays an analysis of variance (ANOVA) test, while also controlling for the influence of a third variable. I've found this script useful for approaching a variety of datasets and, in the interest of cleaning up my code when I have multiple relationships to test, turned it into a re-usable R function.

Example Using American Community Survey

For this example, I'll use data from the 2012 5-Year American Community Survey (ACS) for the Detroit-area counties of Wayne, Oakland, and Macomb; survey data are at the block-group level. I obtained the ACS data from SocialExplorer.com, which is a lot more convenient than going to the U.S. Census Bureau website. With these data, we might ask questions similar to those posed for a general, tabular dataset:

  • What is the relationship between housing vacancy and median household income, controlling for housing density?
  • Is there a relationship between median household income and white population proportion, controlling for population density?

First, I need to process the ACS 2012 data such that:

  • Variables have meaningful names;
  • Only block groups with non-zero population and housing totals are considered;
  • Housing density and population density are calculated for every block group;
  • Median household income is log-transformed;
  • My other variables of interest are appropriately normalized;
  • Calculate quantiles for my variables of interest.

I present this example using the R programming environment (version 3.3.2). For most of the processing, I'll use dplyr and pipes. I'm going to choose to cut my data into quintiles (\(n=5\) quantiles) but this approach works for any kind of discretization.

library(dplyr)

# Load the 2012 ACS data
acs2012.raw <- read.csv('acs2012_5year_by_block_groups.csv',
  header = T, skip = 1, colClasses = c('Geo_FIPS' = 'character'))

quintile <- function (v) {
  # Also, rename levels so they fit inside plot labels
  cut(v, breaks = quantile(v, probs = seq(0, 1, 0.2), na.rm = T),
    include.lowest = T, # Must include for cases of zero vacant housing, etc.
    labels = c('1st', '2nd', '3rd', '4th', '5th'))
}

acs2012 <- acs2012.raw %>%
  select(
    FIPS = Geo_FIPS,
    county.code = Geo_COUNTY,
    area.land.sq.m = Geo_AREALAND,
    Pop.Total = SE_T001_001,
    Pop.White = SE_T013_002,
    Pop.Black = SE_T013_003,
    Median.Hhold.Income = SE_T057_001,
    Housing.Units.Total = SE_T093_001,
    Housing.Units.Vacant = SE_T096_001) %>%
  # Consider only block groups with non-zero housing, population
  filter(Pop.Total > 0) %>%
  filter(Housing.Units.Total > 0) %>%
  filter(!is.na(Median.Hhold.Income)) %>%
  # Normalize variables
  mutate(
    Log10.Median.Hhold.Income = log10(Median.Hhold.Income),
    Housing.Density = Housing.Units.Total / area.land.sq.m,
    Pop.Density = Housing.Units.Total / area.land.sq.m,
    Prop.White = Pop.White / Pop.Total,
    Prop.Black = Pop.Black / Pop.Total,
    Prop.Vacant = Housing.Units.Vacant / Housing.Units.Total) %>%
  mutate_each(funs('quintile'), Pop.Density, Housing.Density,
    Prop.White, Prop.Black, Prop.Vacant)

We lost just 16 block groups by our decision to exclude those block groups with zero population or zero housing. It turns out that median household income (Median.Hhold.Income) is missing for 5 more block groups, so I also remove these cases.

The annotated source code for the levels.plot function seen in subsequent code snippets is at the bottom of this post.

First Plot: Income versus Vacant Housing

Let's answer the first question with data: What is the relationship between housing vacancy and median household income, controlling for housing density? Here, it makes more sense (to me) to have median household income on the y-axis, as a continuous variable; thus, it should be the first variable name passed to levels.plot. The (proportion of) vacant housing should then be our second variable. Finally, we want to control for the effect of housing density; that is, we want to examine this relationship at different levels of housing density.

levels.plot(acs2012, 'Log10.Median.Hhold.Income', 'Prop.Vacant',
  'Housing.Density', y1 = 5.6)

The function will automatically determine where to place both the p-value statistic associated with the ANOVA (within each density level) and the numbers of observations in each bin. However, I felt the bin counts were getting overplotted by the high-end outliers for the boxplots, so I added a custom value for the y1 argument, the position of these labels; it's a little higher, now, than the maximum value for Log10.Median.Hhold.Income (5.60 instead of 5.37).

The levels plot of median household income versus vacant housing proportion, controlling for housing density

How do we read this plot? The subplots with gray boxes as titles (the facets in ggplot2 parlance) correspond to each of the five density quintiles (from lowest to highest housing density, in this case). Within each of those subplots, the five quintiles of the second variable (var2), the proportion of vacant housing (Prop.Vacant), in this case, are each plotted against the log of median household income.

We can see from this plot that there is a clear, statistically significant negative relationship between vacant housing and median household income at all density levels. This is just as expected; presumably, people with higher incomes can avoid living in neighborhoods with high vacancy rates. Conversely, high vacancy rates do not tend to occur where people have higher incomes (because they are more likely to be able to afford to pay their mortgage and property taxes).

We also see that, with the logarithmic transformation of median household income, there is a slight saturation effect at low levels of vacant housing. That is, when the proportion of housing that is vacant is small, we see very small differences in median household income. But at the highest three quintiles of vacant housing density, the difference is much larger.

Second Plot: Income and White Population

Our second question: Is there a relationship between median household income and white population proportion, controlling for population density? The call to levels.plot is very similar, but we're controlling for population density this time.

We see a clear, statistically significant, positive trend in the log of median household income as the white proportion rises. At the highest population density level, the relationship is linear; at lower population densities, it exhibits the exponential behavior we saw in the relationship between vacant housing and income.

levels.plot(acs2012, 'Log10.Median.Hhold.Income', 'Prop.White',
  'Pop.Density', y1 = 5.6)

The levels plot of median household income versus white population proportion, controlling for population density

The Levels Plot Function

There are many other examples I could provide, but you've seen enough of what this function does. It is a fairly simple, yet effective, visualization tool. The R code I provide below demonstrates how simple it is. Because R is not a very well-designed language (e.g., awful tracebacks, pollution of the global namespace, proliferation of global functions that are very similar, factors), I don't profess to be any good at writing base R code; it is quite likely this function definition could be written with more sophistication. But I think this attempt is more than adequate. I welcome any suggested changes.

# Levels plot function
# -- Presents a plot of one continuous variable (`var1`) across quintiles of
#     a discrete variable (`var2`), faceted by a third density variable (`dens`)

require(ggplot2)
require(grid)
require(reshape2)
levels.plot <- function (q.dat, var1, var2, dens, yaxis = 'fixed', y0 = NULL, y1 = NULL, y.label = NULL) {
  # Calculate cross-tabulation
  ctab <- reshape2::melt(table(subset(q.dat, select = c(var2, dens))), id.vars = var2)

  # Set the y-axis position of quantile N labels
  ctab$y <- rep(ifelse(is.null(y1), max(q.dat[,var1]), y1), dim(ctab)[1])

  # var1 must be a continuous variable
  stopifnot(class(q.dat[,var1]) %in% c('integer', 'numeric'))

  # Iterate over the density levels...
  tests <- c()
  for (q in levels(q.dat[,dens])) {
    test <- aov(as.formula(paste0(var1, ' ~ ', var2)),
                data = q.dat[q.dat[,dens] == q,])
    tests <- c(tests, sprintf('p-value: ~%.4f', summary(test)[[1]][['Pr(>F)']][[1]]))
  }
  # Set the y-position of the p-value text
  tests <- data.frame(p.value = tests, dens = levels(q.dat[,dens]),
    x = 1, y = ifelse(is.null(y0), min(q.dat[,var1]), y0))
  # MUST rename the "dens" column to the variable name so it can be found by ggplot2
  .names <- names(tests)
  .names[2] <- dens
  names(tests) <- .names

  # The initial plot object
  ggplot(q.dat, aes_string(y = var1)) +
    geom_boxplot(mapping = aes_string(x = var2)) +
    geom_text(aes_string(x = var2, y = 'y', label = 'value'),
      data = ctab, vjust = 0.7, size = 4.5) +
    geom_text(aes(x = x, y = y, label = p.value), data = tests,
      hjust = 0.1, vjust = 0.1, size = 4.5) +
    xlab(paste0('2012 ACS ', gsub('\\.', ' ', var2), ' Quintile')) +
    ylab(ifelse(is.null(y.label), gsub('\\.', ' ', var1), y.label)) +
    labs(title = paste0(gsub('\\.', ' ', var2), ' by ', gsub('\\.', ' ', dens), ' Quintiles')) +
    facet_wrap(as.formula(paste0('~ ', dens)), scales = yaxis) +
    theme_bw() +
    theme(text = element_text(size = 16),
      plot.margin = unit(c(0.5, 0.2, 0.5, 0), 'cm'))
}