February | 2015 | John J. Horton

Most people running experiments understand that they need to randomize subjects to their experimental groups. When I teach causal inference to my undergraduates, I try to drive this point home and I tell them horror stories about would-be experimenters “randomizing” on things like last name or even worse, effectively letting people opt into their treatment cell by conditioning assignment on some behaviour the user has taken.

But this randomization requirement for valid inference is actually a noble lie of sorts—what you really need is “unconfoundedness” and in experiments I have run on Mechanical Turk—I don’t randomize but rather I stratify, allocating subjects sequentially to experimental groups based on their “arrival” to my experiment. In other words, if I had two cells, treatment and control, my assignments would go like this:

Subject 1 goes to Treatment
Subject 2 goes to Control
Subject 3 goes to Treatment
Subject 4 goes to Control
Subject 5 goes to Treatment
…

So long as subjects do not know relative arrival order and cannot condition on it (which is certainly the case), this method of assignment, while not random, is OK for valid causal inference. In fact, arrival time stratification approach does better than OK—it gives you more precise estimates of the treatment effect for a given sample size.

The reason is that this stratification ensures your experiment is better balanced on arrival times, which are very likely correlated with user demographics (because of time-zones) and behaviour (early arrivals are more likely to be heavy users of the site). For example, suppose on Mechanical Turk you run your experiment over several days and the country of each subject, in order is:

US
US
US
US
India
India
India
India.

With randomization, there is a non-zero chance you could get an assignment of Treatment, Treatment, Treatment, Treatment, Control, Control, Control, Control, which is as badly biased as possible (all Americans in the treatment, all Indians in the control). There are a host of other assignments that are better, but not by much, still giving us bad balance. Of course, our calculated standard errors take this possibility into “account” but we don’t have to do this to ourselves—with the stratified on arrival time method, we get a perfectly balanced experiment on this one important user attribute, and hence more precision. As the experiment gets larger and larger this matters less and less, but at any sample size, we do better with stratification if we can pull it off.

We could show the advantages of stratification mathematically, but we can also just see if through simulation. Suppose we have some unobserved attribute of experimental subjects—in my simulation (the R code of which is available at the end of this post), the variable x—that has a strong effect on the outcome, y (in my simulation the effect is 3 * x) and we have a treatment that has a constant treatment effect (in my simulation, 1) whenever it is applied. And let us suppose that subjects arrive in x order (smallest to largest). Below I plot the improvement in the average improvement in the absolute difference between the actual treatment effect (which is 1) and the experimental estimate from using stratification rather than randomization for assignment. As we can see, stratification always gives us an improvement, though the advantage is declining in the sample size.

For the past several months, I’ve been a very happy Blue Apron customer. I’ve always liked to cook, but my interest and devotion to cooking has waxed and waned over the years, with the main determinant being the (in)convenience of shopping. Blue Apron basically solves the shopping problem: they send you a box of all the ingredients you need to make a number of meals, together with detailed and visual preparation instructions.

But this blog post isn’t about cooking–it’s about Blue Apron’s interesting marketing strategy. Blue Apron makes extensive use of free meals that existing customers can send to friends and family who aren’t already signed up. My anecdotal impression is that this meal-sharing approach is remarkably successful: it’s how my family joined, and how we’ve since gotten siblings, parents, extended family and many friends to join. They in turn have gushed about Blue Apron on Facebook and earned their own free boxes to send, and so on.

What I think is interesting about Blue Apron’s marketing is that although free samples are costly, these kind of customer-chooses free samples are amazingly well targeted: presumably their customers know how likely their friends and family are to like the service and thus can tailor their invitations accordingly. This is particularly important for a product that has (somewhat) niche appeal. It would be interesting to compare the conversion of these social referrals to other acquisition channels.

This marketing strategy also raises some design questions. First, which of your existing customers are the best ones to target with free boxes to give away? Presumably the most enthusiastic ones are better targets (the ones that never cancel deliveries, order many meals a week, and so on). It also makes sense to target customers with lots of social connections to the target Blue Apron demographic. However, it is perhaps not as simple as identifying super-users. These customers, while likely to give great word-of-mouth endorsements, might also not be as persuasive with would-be customers who might infer that the service is not for them based on the others’ enthusiasm: if the gourmet chef aunt loves the service, the pb&j-fan nephew may infer that it’s only for experts and abstain.

Even once you identify the super-users, how do you prompt them to optimally allocate the free samples? Presumably customers will start by sending the meals to people they think will value them the most and then work their way down, offering diminishing returns to Blue Apron. However, individuals probably have differently-shaped curves, with some who rapidly deplete their interested friends and others that are “deep” with potential customers. Blue Apron also has to consider not just the probability that a free sample leads to a new customer, but also the probability that the new customer will bring in other customers, and so on.

There is some academic work on marketing in networks. For example Sinan Aral and Dylan Walker have a great paper on the design of products to encourage viral growth. One of their conclusions is that more passive broadcast messaging is more effective than “active-personalized” features because the passive broadcast approach is used more often & thus the greater usage outweighs the lower effectiveness. It would be really interesting to compare in the Blue Apron context analogous interventions, say comparing a Facebook experiment where the customer could choose who to send their free meals to, or one where they just generically share that they have free meals to give, and then individuals in their network could self-select based on their interest.

John J. Horton

Monthly Archives: February 2015

Allocating online experimental subjects to cells—doing better than random

Marketing in Networks: The Case of Blue Apron