Allocating online experimental subjects to cells—doing better than random

Most people running experiments understand that they need to randomize subjects to their experimental groups. When I teach causal inference to my undergraduates, I try to drive this point home and I tell them horror stories about would-be experimenters “randomizing” on things like last name or even worse, effectively letting people opt into their treatment cell by conditioning assignment on some behaviour the user has taken.

But this randomization requirement for valid inference is actually a noble lie of sorts—what you really need is “unconfoundedness” and in experiments I have run on Mechanical Turk—I don’t randomize but rather I stratify, allocating subjects sequentially to experimental groups based on their “arrival” to my experiment. In other words, if I had two cells, treatment and control, my assignments would go like this:

  • Subject 1 goes to Treatment
  • Subject 2 goes to Control
  • Subject 3 goes to Treatment
  • Subject 4 goes to Control
  • Subject 5 goes to Treatment

So long as subjects do not know relative arrival order and cannot condition on it (which is certainly the case), this method of assignment, while not random, is OK for valid causal inference. In fact, arrival time stratification approach does better than OK—it gives you more precise estimates of the treatment effect for a given sample size.

The reason is that this stratification ensures your experiment is better balanced on arrival times, which are very likely correlated with user demographics (because of time-zones) and behaviour (early arrivals are more likely to be heavy users of the site). For example, suppose on Mechanical Turk you run your experiment over several days and the country of each subject, in order is:

  1. US
  2. US
  3. US
  4. US
  5. India
  6. India
  7. India
  8. India.

With randomization, there is a non-zero chance you could get an assignment of Treatment, Treatment, Treatment, Treatment, Control, Control, Control, Control, which is as badly biased as possible (all Americans in the treatment, all Indians in the control). There are a host of other assignments that are better, but not by much, still giving us bad balance. Of course, our calculated standard errors take this possibility into “account” but we don’t have to do this to ourselves—with the stratified on arrival time method, we get a perfectly balanced experiment on this one important user attribute, and hence more precision. As the experiment gets larger and larger this matters less and less, but at any sample size, we do better with stratification if we can pull it off.

We could show the advantages of stratification mathematically, but we can also just see if through simulation. Suppose we have some unobserved attribute of experimental subjects—in my simulation (the R code of which is available at the end of this post), the variable x—that has a strong effect on the outcome, y  (in my simulation the effect is 3 * x) and we have a treatment that has a constant treatment effect (in my simulation, 1)  whenever it is applied. And let us suppose that subjects arrive in x order (smallest to largest). Below I plot the improvement in the average improvement in the absolute difference between the actual treatment effect (which is 1) and the experimental estimate from using stratification rather than randomization for assignment. As we can see, stratification always gives us an improvement, though the advantage is declining in the sample size.

strat_results