Chris Blattman recently lamented reviewers asking him to cluster standard errors for a true experiment, which he viewed as incorrect, but had no citation to support his claim. It seems intuitive to me that Chris is right (and everyone commenting on his blog post agreed), but no one could point to something definitive.
I asked on Twitter whether a blog post with some simulations might help placate reviewers and he replied “beggars can’t be choosers”—and so here it is. My full code is on github.
To keep things simple, suppose we have a collection of individuals that are nested in groups, indexed by . For some outcome of interest , there’s a individualspecific effect, and a groupspecific effect, . This outcome also depends on whether a binary treatment has been applied (status indicated by ), which has an effect size of .
We are interested in estimating and correctly reporting the uncertainty in that estimate.
First, we need create a data set with a nested structure. The R code below does this, with a few things hardwired: the and are both drawn from a standard normal and the probability of treatment assignment is 1/2. Note that the function takes a boolean parameter randomize.by.group that lets us randomize by group instead of by individual. We can specify the sample size, the number of groups and the size of the treatment effect.
This function returns a data frame that we can analyze. Here’s an example of the output. Note that for two individuals with the same group assignment, the term is the same, but that the treatment varies within groups.

> CreateClusteredData(2, 10, 1, randomize.by.group = FALSE) y individual group trt epsilon eta 1 0.37803923 1 2 0 0.7122716 1.0903109 2 1.26061541 2 2 1 1.1703046 1.0903109 3 0.42009564 3 1 0 0.1618996 0.5819952 4 0.06142989 4 2 0 1.0288810 1.0903109 5 0.33955484 5 2 1 0.4298657 1.0903109 6 2.21558690 6 2 0 1.1252760 1.0903109 7 0.03748754 7 2 0 1.0528233 1.0903109 8 0.27829270 8 1 0 0.3037025 0.5819952 9 0.76669849 9 2 1 0.6763876 1.0903109 10 1.60490249 10 1 0 1.0229073 0.5819952 
Now we need a function that simulates us running an experiment and analyzing the data using a simple linear regression of the outcome on the treatment indicator. This function below returns the estimate, and the standard error, from one “run” of an experiment:
Let’s simulate running the experiment a 1,000 times (NB If the “%>%” notation looks funny to you— I’m using the magrittr package)

num.replications < 1000 df.sim.results < replicate(num.replications, SimExperiment(20, 1000, 1)) %>% t %>% as.data.frame %>% set_colnames(c("beta.hat", "se")) 
The standard error also has a sampling distribution but let’s just take the median value from all our simulations:

> # median value of standard errors > (median.se < df.sim.results %$% se %>% median) [1] 0.08785142 
If we compare this to the standard deviation of our collection of point estimates, we see the two values are nearly identical (which is good news):

> # standard deviation of the estimated betas > (se < df.sim.results %$% beta.hat %>% sd) [1] 0.08791535 
If we plot the empirical sampling distribution of and label the 2.5% and 97.5% percentiles as well as the 95% CI (constructed using that median standard error) around the true , the two intervals are right on top of each other:
Code for the figure above:

# CI comparison alpha < 0.05 (empirical.ci < df.sim.results %$% beta.hat %>% quantile(., probs = c(alpha/2, 1alpha/2))) (calculated.ci < c(1  1.96 * median.se, 1 + 1.96 * median.se)) g < ggplot(data = df.sim.results, aes(x = beta.hat)) + geom_density() + theme_bw() + xlab("Simulated estimates of beta") + geom_vline(xintercept = empirical.ci, colour = "red") + geom_vline(xintercept = calculated.ci, colour = "blue") 
Main takeaway: Despite the group structure, the plain vanilla OLS run with data from a true experiment returns the correct standard errors (at least for the parameters I’ve chosen for this particular simulation).
What if we randomize at the group level but don’t account for this group structure?
At the end of his blog post, Chris adds another clusterrelated complaint:
Reviewing papers that randomize at the village or higher level and do not account for this through clustering or some other method. This too is wrong, wrong, wrong, and I see it happen all the time, especially political science and public health.
Let’s redo the analysis but change the level of randomization to group and see what happens if we ignore this level of randomization change. As before, we simulate and then compare the median standard error we observed from our simulations to the standard deviation of the sampling distribution of our estimated treatment effect:

> num.replications < 1000 > df.sim.results < replicate(num.replications, SimExperiment(20, 1000, 1, TRUE)) %>% + t %>% + as.data.frame %>% + set_colnames(c("beta.hat", "se")) > (median.se < df.sim.results %$% se %>% median) [1] 0.08871206 > (se < df.sim.results %$% beta.hat %>% sd) [1] 0.4626796 
The OLS standard errors are (way) too small—the median value from OLS is still about 0.08 (as expected) but the sampling distribution of the estimated treatment effect is 0.45. The resultant CIs looks like this:
Eek. Here are two Rspecific fixes, both of which seem to work fine. First, we can use a random effects model (from the lme4 package):

SimExperimentRE < function(num.groups, sample.size, beta, randomize.by.group = FALSE){ "Simulate running an analysis of the experiment with linear regression" df < CreateClusteredData(num.groups, sample.size, beta, randomize.by.group = randomize.by.group) m < lmer(y ~ trt + (1group), data = df) c(as.numeric(fixef(m)[2]), as.numeric(sqrt(diag(vcov(m))[2]))) } 
or we can cluster standard errors. The package I use for this is lfe, which is really fantastic. Note that you put the factor you want to cluster by in the 3rd position following the formula:

SimExperimentFE < function(num.groups, sample.size, beta, randomize.by.group = FALSE){ "Simulate running an analysis of the experiment with linear regression" df < CreateClusteredData(num.groups, sample.size, beta, randomize.by.group = randomize.by.group) m < felm(y ~ trt 00group, data = df) c(as.numeric(coef(m)[2]), as.numeric(m$cse[2])) } 
One closing thought, a noneconometric argument why clustering can’t be necessary for a true experiment with randomization at the individual level: for *any* experiment, presumably there is some latent (i.e., unobserved to the researcher) grouping of the data such that the errors within that group are correlated with each other. As such, we could never use our standard tools for analyzing experiments to get the right standard errors if taking this latent grouping into account was necessary.