Kennedy Lake, Emigrant Wilderness, CA Kennedy Lake, Emigrant Wilderness, CA

Random Stuff

Contents

  1. Project Name Generator
  2. Time Management Musings
  3. Code Snippets I Always Lose


Project Name Generator

(Summer 2024)


Time Management Presentation

(Spring 2024)

This was a presentation I gave to a group of FSU PhD students on time management. It was a lot of fun to put together, but there is absolutely no guarantee any of my advice was good! You can see the slides here.


Code Snippets

Block Jackknife Resampling

This is a (kind of outdated) resampling technique for cross-validation and conservative variance estimation, named by John Tukey because like a jack knife, it’s good enough for a lot of jobs and perfect for almost none. The general approach is to re-estimate a model $n$ sub-samples, where each sub-sample leaves out observation $i$. The variance of a parameter of interest $\beta$ can then be estimated by caculating the variance of the $\hat\beta$ estimates from each leave-one-out replicate.

I like it as a very intuitive way of looking under the hood of regression results, especially where observations are grouped into substantively meaningful units (age cohorts, geographic units, etc.). To use it this way, you sequentially leave out each group of observations $j$ from the data, fitting $j$ separate models for the subsamples $i \in j\neq j$. The results allow you to see which sub-groups are individually necessary for the overall effect. If your coefficient of interest moves a lot when you leave out group $j$, or better yet, if your coefficient is no longer statistically distinguishable from zero when you leave out group $j$, then you know $j$ is “important” or driving your results. This can be super useful for thinking up robustness checks, for identifying cases for mixed-methods analysis or nested analysis, or for just figuring out what your results mean.

Drew Stommes and I used it in the appendices of our article about descriptive representation and insurgent violence in India to show that the “important” sub-districts in our regression results were the ones that motivated the substantive question! Here’s some code to get it going.

library(estimatr)
library(tidyverse)
library(ggthemes)

units <- unique(df$UNIT_IDs)

est <- matrix(NA, nrow = length(units), ncol = 3)

for (i in 1:length(units)){
  tmp   <- df %>% filter(UNIT != units[i])
  z_tmp <- cbind(tmp$RESPONSE, tmp$TREATMENT, tmp$COVARIATES)
  r_tmp <- lm_robust(RESPONSE ~ .,
                    data = z_tmp)
  est[i,] <- c(r_tmp$coefficients[1], r_tmp$conf.low[1], r_tmp$conf.high[1]) # This pulls your "treatment" of interest, *iff* the treatment variable is first in the regression call!
  print(i)
}


# Plot jackknife estimates

est <- cbind.data.frame(units, est)
colnames(est) <- c("unit", "est", "cilo", "cihi")
est$sig <- ifelse(est$cilo < 0 & est$cihi > 0, "no", "yes")

ggplot(est) + geom_pointrange(aes(x = factor(unit),
                                  ymin= cilo, ymax = cihi,
                                  y = est, color = factor(sig))) +
  scale_color_ptol() + theme_minimal() + theme(axis.text.y = element_text(size =rel(.8))) +
  geom_hline(aes(yintercept = 0), color = "darkgrey", lty = 2) +
  coord_flip() + labs(title = "Block Jackknife by Unit",
                      y = "Coefficient",
                      x = "Dropped Unit",
                      color = "Significant?")


Geocoding and Unit Aggregation

This is code for spatially aggregating any sort of data where records are geolocated to a particular lat/lon coordinate. It returns a dataframe where the records are assigned to polygons of the user’s choosing, i.e. number of violent events within an administrative unit. This is really useful because we often don’t have theories (or good covariate measurement) at the level of discrete lat/lon points.

library(sf)

# Pull in shapefiles
shapefile <- st_read('shapefile.json') |>  st_make_valid()

# Turn geolocated records into spatialpoints object
df_sf <- df %>% 
  filter(complete.cases(lon, lat)) %>% 
  mutate_at(vars(lon, lat), as.numeric) %>%
  st_as_sf(
    coords = c("lon", "lat"), # Column names for coordinates
    agr = "constant",
    crs = 4326, # same as WGS84
    stringsAsFactors = F,
    remove = T
  )

# Assign records to polygons

df_in_shape <- st_join(df_sf, shapefile, join = st_within)

# Group by polygons, summarise

shape_df <- df_in_shape |>
    group_by(SHAPEID) |> 
    summarise(count = n(),
              shapeName = first(name),
              # Other summary variables as needed
              )

# Join with shapefiles, because shape_df has no rows for polygons that include no records from df
# And put 0 in for polygons with no records from df

shape_df <-  st_join(shapefile, shape_df, by = "id")

cols <- c("count", "other summary variables as needed")
shape_df[cols][is.na(shape_df[cols])] <- 0

Multiple Imputation with estimatr

If you want to use the popular R package Amelia for multiple imputation, but you prefer to estimate your linear models with estimatr::lm_robust() rather than the inferior lm(), you get an error message because lm_robust() returns tidy objects and Amelia therefore can’t internally tidy them…

This little chunk reproduces the “combination” procedure in King et al. (2001), which is what the Amelia::mi_combine() function is implementing.

imp_models <- list() # Should be a k-length list of the lm_robust() objects fitted to each of k imputed datasets from Ameila

mi_combine_estimatr <- function(imp_models, conf.level = 0.95) {
  m    <- length(imp_models)
  out  <- imp_models[[1L]]
  ests <- matrix(
                unlist(
                      lapply(imp_models, function(x) x$coefficients)),
                ncol = m
                )
  ses  <- matrix(
                unlist(
                      lapply(imp_models, function(x) x$std.error)),
                ncol = m
                )
  v_within  <- rowMeans(ses ^ 2)
  v_between <- rowSums((ests - rowMeans(ests)) ^ 2) / (m - 1)
  out$estimate  <- rowMeans(ests)
  out$std.error <- sqrt(v_within + v_between * (1 + 1 / m))
  
  r  <- ((1 + 1 / m) * v_between) / v_within
  df <- (m - 1) * (1 + 1 / r) ^ 2
  miss.info <- (r + 2 / (df + 3)) / (r + 1)
  
  out$statistic <- out$estimate / out$std.error
  out$p.value   <- 2 * stats::pt(out$statistic, df = df, lower.tail = FALSE)
  out$df        <- df
  out$r         <- r
  out$miss.info <- miss.info
  t.c <- stats::qt(1 - (1 - conf.level) / 2, df = df, lower.tail = FALSE)
  out$conf.low  <- out$estimate - t.c * out$std.error
  out$conf.high <- out$estimate + t.c * out$std.error
  
  return(out)
}