# Random Stuff

## Contents

## Project Name Generator

(Summer 2024)

## Time Management Presentation

(Spring 2024)

This was a presentation I gave to a group of FSU PhD students on time management. It was a lot of fun to put together, but there is absolutely no guarantee any of my advice was good! You can see the slides here.

## Code Snippets

### Block Jackknife Resampling

This is a (kind of outdated) resampling technique for cross-validation and conservative variance estimation, named by John Tukey because like a jack knife, it’s good enough for a lot of jobs and perfect for almost none. The general approach is to re-estimate a model $n$ sub-samples, where each sub-sample leaves out observation $i$. The variance of a parameter of interest $\beta$ can then be estimated by caculating the variance of the $\hat\beta$ estimates from each leave-one-out replicate.

I like it as a very intuitive way of looking under the hood of regression results, especially where observations are grouped into substantively meaningful units (age cohorts, geographic units, etc.). To use it this way, you sequentially leave out each *group* of observations $j$ from the data, fitting $j$ separate models for the subsamples $i \in j\neq j$. The results allow you to see which sub-groups are *individually necessary* for the overall effect. If your coefficient of interest moves a lot when you leave out group $j$, or better yet, if your coefficient is no longer statistically distinguishable from zero when you leave out group $j$, then you know $j$ is “important” or driving your results. This can be super useful for thinking up robustness checks, for identifying cases for mixed-methods analysis or nested analysis, or for just figuring out what your results *mean*.

Drew Stommes and I used it in the appendices of our article about descriptive representation and insurgent violence in India to show that the “important” sub-districts in our regression results were the ones that motivated the substantive question! Here’s some code to get it going.

```
library(estimatr)
library(tidyverse)
library(ggthemes)
units <- unique(df$UNIT_IDs)
est <- matrix(NA, nrow = length(units), ncol = 3)
for (i in 1:length(units)){
tmp <- df %>% filter(UNIT != units[i])
z_tmp <- cbind(tmp$RESPONSE, tmp$TREATMENT, tmp$COVARIATES)
r_tmp <- lm_robust(RESPONSE ~ .,
data = z_tmp)
est[i,] <- c(r_tmp$coefficients[1], r_tmp$conf.low[1], r_tmp$conf.high[1]) # This pulls your "treatment" of interest, *iff* the treatment variable is first in the regression call!
print(i)
}
# Plot jackknife estimates
est <- cbind.data.frame(units, est)
colnames(est) <- c("unit", "est", "cilo", "cihi")
est$sig <- ifelse(est$cilo < 0 & est$cihi > 0, "no", "yes")
ggplot(est) + geom_pointrange(aes(x = factor(unit),
ymin= cilo, ymax = cihi,
y = est, color = factor(sig))) +
scale_color_ptol() + theme_minimal() + theme(axis.text.y = element_text(size =rel(.8))) +
geom_hline(aes(yintercept = 0), color = "darkgrey", lty = 2) +
coord_flip() + labs(title = "Block Jackknife by Unit",
y = "Coefficient",
x = "Dropped Unit",
color = "Significant?")
```

### Geocoding and Unit Aggregation

This is code for spatially aggregating any sort of data where records are geolocated to a particular lat/lon coordinate. It returns a dataframe where the records are assigned to polygons of the user’s choosing, i.e. number of violent events within an administrative unit. This is really useful because we often don’t have theories (or good covariate measurement) at the level of discrete lat/lon points.

```
library(sf)
# Pull in shapefiles
shapefile <- st_read('shapefile.json') |> st_make_valid()
# Turn geolocated records into spatialpoints object
df_sf <- df %>%
filter(complete.cases(lon, lat)) %>%
mutate_at(vars(lon, lat), as.numeric) %>%
st_as_sf(
coords = c("lon", "lat"), # Column names for coordinates
agr = "constant",
crs = 4326, # same as WGS84
stringsAsFactors = F,
remove = T
)
# Assign records to polygons
df_in_shape <- st_join(df_sf, shapefile, join = st_within)
# Group by polygons, summarise
shape_df <- df_in_shape |>
group_by(SHAPEID) |>
summarise(count = n(),
shapeName = first(name),
# Other summary variables as needed
)
# Join with shapefiles, because shape_df has no rows for polygons that include no records from df
# And put 0 in for polygons with no records from df
shape_df <- st_join(shapefile, shape_df, by = "id")
cols <- c("count", "other summary variables as needed")
shape_df[cols][is.na(shape_df[cols])] <- 0
```

### Multiple Imputation with `estimatr`

If you want to use the popular `R`

package `Amelia`

for multiple imputation, but you prefer to estimate your linear models with `estimatr::lm_robust()`

rather than the inferior `lm()`

, you get an error message because `lm_robust()`

returns tidy objects and `Amelia`

therefore can’t internally tidy them…

This little chunk reproduces the “combination” procedure in King et al. (2001), which is what the `Amelia::mi_combine()`

function is implementing.

```
imp_models <- list() # Should be a k-length list of the lm_robust() objects fitted to each of k imputed datasets from Ameila
mi_combine_estimatr <- function(imp_models, conf.level = 0.95) {
m <- length(imp_models)
out <- imp_models[[1L]]
ests <- matrix(
unlist(
lapply(imp_models, function(x) x$coefficients)),
ncol = m
)
ses <- matrix(
unlist(
lapply(imp_models, function(x) x$std.error)),
ncol = m
)
v_within <- rowMeans(ses ^ 2)
v_between <- rowSums((ests - rowMeans(ests)) ^ 2) / (m - 1)
out$estimate <- rowMeans(ests)
out$std.error <- sqrt(v_within + v_between * (1 + 1 / m))
r <- ((1 + 1 / m) * v_between) / v_within
df <- (m - 1) * (1 + 1 / r) ^ 2
miss.info <- (r + 2 / (df + 3)) / (r + 1)
out$statistic <- out$estimate / out$std.error
out$p.value <- 2 * stats::pt(out$statistic, df = df, lower.tail = FALSE)
out$df <- df
out$r <- r
out$miss.info <- miss.info
t.c <- stats::qt(1 - (1 - conf.level) / 2, df = df, lower.tail = FALSE)
out$conf.low <- out$estimate - t.c * out$std.error
out$conf.high <- out$estimate + t.c * out$std.error
return(out)
}
```