Experimenting with targeted features

Suppose you run a website and you have some experience or feature that you think might be good for some subset of your users (but ineffective, at best, for others).  You might try to (1) identify who would benefit based on observed characteristics  then (2) alter the experience only for a targeted subset users expected to benefit.

To make things concrete, in some cities, Uber offers “UberFamily” which means the Uber comes with a car seat. For us (I have two kids), UberFamily is awesome, but the option takes up valuable screen real estate and for a user that Uber thinks does not have kids, adding it to the app screen is a waste. So Uber would like to both (a) figure out if it is likely that I have kids and then (b) adjust the experience based on that model. But they’d also like to know if it’s worth it in general to offer this service even among those they think could use it. This isn’t the example that motivated this blog post, but it makes the scenario clear.

If you are testing features of this sort, then you want to both (a) assess your targeting and (b) assess the feature itself. How should you proceed? I’m sure there’s probably some enormous literature on this question (there’s a literature on everything), but I figure by offering my thoughts and potentially being wrong on the Internet, I can be usefully corrected.

I think what you want to do is not test your targeting experimentally but rather role out the feature for everyone you reasonably can than evaluate your targeting algorithms on your experimental data. So, you would run the experiment with a design that maximizes power to detect treatment effects (e.g., 50 to treatment, 50 control). In other words, completely ignore your targeting algorithm recommendations.

Then, after the experimental data comes in, look for heterogeneous treatment effects conditioned on the predictive model score, where the score can be thought of as a measure of how much we think a person should have benefitted from the treatment. The simplest thing you could do to would be to normalize all scores (so the scores have the same mean and variance across algorithms, making model coefficients directly interpretable across algorithms). Then just run the regression:

y = \beta_0 + \beta_1 (score \times trt) + \beta_2 score + \beta_3 trt

Hopefully, if the treatment was better for people the model thought would be helped, then \hat{\beta_1} should be positive (assuming the y is such that bigger is better).

You’d also want to finding the minimum score such that you should be targeting people i.e., the score such that the expected benefit from targeting is first positive. You can then simply select the algorithm with the greatest expected improvement, given the minimum score for targeting.

This seems like a reasonable approach (and maybe bordering on obvious but it wasn’t obvious to me at first). Any other suggestions?

Simple distributed text editing with Mechanical Turk

When I write a sentence, there’s about a 10% chance it will have typo or grammatical error of some kind.  It’s often painful to find them later,  as like most people, I tend to “fill in the gaps” or glide over typos when reading my own writing.  Fortunately, this kind of editing, unlike, say, reading for structure or consistency, is very parallelizable. In fact, reading each sentence alone, out of order, might even be better than reading the whole document, sentence by sentence.

As an experiment, I wrote a little script that splits a document up into sentences, with one sentence per line (the script is here). With this CSV, I can use Mechanical Turk to create HITs, with one HIT per sentence. The instructions for workers to label each sentence as “OK” or “Not OK” with an optional field to explain their reasoning. The Mturk interface looks like this:

Screenshot 2015-12-02 10.53.31


After splitting the sentences, I went through the CSV file to remove blank lines and LaTeX commands by hand, though one could easily add this feature to the script.

I posted the HITs on MTurk this morning, paying 2 cents, with 4 HITS per sentence (so each sentence will be checked 4 times by different workers).  The text was a paper I’m working on. Results starting coming in remarkably quickly—here it as after 30 minutes:

Screenshot 2015-12-02 10.50.13

I’m not thrilled with the hourly rate (I try to shoot for $5/hour) but this average is always very sensitive to workers who take a long time. So far, the comments are very helpful, especially since with multiple ratings, you can find problematic sentences—for example:

Screenshot 2015-12-02 10.58.50

The “86” is the line number from the LaTeX document, which is nice because it makes it easier to go find the appropriate sentence to fix. Here are some more samples of the kinds of responses I’m getting:

Screenshot 2015-12-02 11.06.07

Overall, I think it’s a successful experiment, though it was already well known that MTurk workers can do editing tasks well, from soylent.


Documenting your work as you go with GNU Make

I’ve long been a convert to using ‘Make’ to turn LaTeX into a PDF. However, you can also easily use Make to backup your work as you go along and take “snapshots” of what a draft looked like at a moment in time (see example Makefile below). This complements using both github, Dropbox and some external backup (I might add another block to the Makefile that pushes a snapshot to Amazon S3).

My folder structure for a project:

  • writeup (where I store the LaTeX & BibTeX)
  • code
  • data
  • backups (where I store entire snapshots of the directory, w/o backups included, for obvious reasons)
  • snapshots (where just the PDF draft is stored)

How much *should* MTurk cost (if you’re Amazon)?

Amazon recently announced that they are going to double their percentage fee, from 10% to 20%. At least among the people I follow on Twitter (lots of academics that use MTurk for research), this has caused much consternation. A price increase is clearly bad for workers and bad for requestors (who gets hurt worse will depend on relative elasticities), but what about Amazon?

When I first got interested in online labor markets, I wrote  a short paper (“Online Labor Markets” ) in which I tried to figure out what was the optimal ad valorem charge from the platform’s perspective.  The relevant section is below, but the main conclusion was that most online platforms were pricing as if demand and/or supply was highly elastic.  In other words, that even a small increase in price would send nearly all customers elsewhere.

Screenshot 2015-06-25 09.04.14

The basic reason is that when the platform doubles its fees from 10% to 20%, they double their revenue (if everything stays the same) but only increase the cost of Using MTurk for users by about 10%. Of course, there is some decline in usage (demand curves slope down)  which reduces profits, but it has to be a huge reduction to make up for the direct increase in revenue to the platform. This is more or less the same argument for why cutting taxes doesn’t increase revenue unless tax rates are incredibly high—the consensus estimate is that the revenue maximizing tax rate is in the mid 70%.

My guess is that MTurk has fairly few substitutes and someone at Amazon decided it should be making more money as a service. Fortunately for us as observers, because of the great work of Panos Ipeirotis, we’ll get to see what happens.





The impact & potential of online work – some new references

In the last month, two new reports have come up looking at online work and online labor markets, with a focus on their potential for economic development.

There is:

  • A report from the McKinsey Global Institute, “Connecting talent with opportunity in the digital age” The McKinsey report covers a lot of the economic rationale for online work and digitization a bit more broadly.
  • A report from the World Bank “Jobs without Borders” (pdf) which focuses more on what these markets could do for workers in less developed countries. It brings together quite  a bit of disparate data on the size on online marketplaces, worker composition and so on.

The McKinsey report relies on some of the work that  went into my NBER working paper w/ Ajay Agrawal, Liz Lyons and Nico Lacetera on “Digitization and the Contract Labor Market” which in turn leans heavily on data from oDesk (now Upwork). This paper—along with all the others from the conference—are now available as a book from the University of Chicago Press.


Allocating online experimental subjects to cells—doing better than random

Most people running experiments understand that they need to randomize subjects to their experimental groups. When I teach causal inference to my undergraduates, I try to drive this point home and I tell them horror stories about would-be experimenters “randomizing” on things like last name or even worse, effectively letting people opt into their treatment cell by conditioning assignment on some behaviour the user has taken.

But this randomization requirement for valid inference is actually a noble lie of sorts—what you really need is “unconfoundedness” and in experiments I have run on Mechanical Turk—I don’t randomize but rather I stratify, allocating subjects sequentially to experimental groups based on their “arrival” to my experiment. In other words, if I had two cells, treatment and control, my assignments would go like this:

  • Subject 1 goes to Treatment
  • Subject 2 goes to Control
  • Subject 3 goes to Treatment
  • Subject 4 goes to Control
  • Subject 5 goes to Treatment

So long as subjects do not know relative arrival order and cannot condition on it (which is certainly the case), this method of assignment, while not random, is OK for valid causal inference. In fact, arrival time stratification approach does better than OK—it gives you more precise estimates of the treatment effect for a given sample size.

The reason is that this stratification ensures your experiment is better balanced on arrival times, which are very likely correlated with user demographics (because of time-zones) and behaviour (early arrivals are more likely to be heavy users of the site). For example, suppose on Mechanical Turk you run your experiment over several days and the country of each subject, in order is:

  1. US
  2. US
  3. US
  4. US
  5. India
  6. India
  7. India
  8. India.

With randomization, there is a non-zero chance you could get an assignment of Treatment, Treatment, Treatment, Treatment, Control, Control, Control, Control, which is as badly biased as possible (all Americans in the treatment, all Indians in the control). There are a host of other assignments that are better, but not by much, still giving us bad balance. Of course, our calculated standard errors take this possibility into “account” but we don’t have to do this to ourselves—with the stratified on arrival time method, we get a perfectly balanced experiment on this one important user attribute, and hence more precision. As the experiment gets larger and larger this matters less and less, but at any sample size, we do better with stratification if we can pull it off.

We could show the advantages of stratification mathematically, but we can also just see if through simulation. Suppose we have some unobserved attribute of experimental subjects—in my simulation (the R code of which is available at the end of this post), the variable x—that has a strong effect on the outcome, y  (in my simulation the effect is 3 * x) and we have a treatment that has a constant treatment effect (in my simulation, 1)  whenever it is applied. And let us suppose that subjects arrive in x order (smallest to largest). Below I plot the improvement in the average improvement in the absolute difference between the actual treatment effect (which is 1) and the experimental estimate from using stratification rather than randomization for assignment. As we can see, stratification always gives us an improvement, though the advantage is declining in the sample size.



Marketing in Networks: The Case of Blue Apron

For the past several months, I’ve been a very happy Blue Apron customer. I’ve always liked to cook, but my interest and devotion to cooking has waxed and waned over the years, with the main determinant being the (in)convenience of shopping. Blue Apron basically solves the shopping problem: they send you a box of all the ingredients you need to make a number of meals, together with detailed and visual preparation instructions.

But this blog post isn’t about cooking–it’s about Blue Apron’s interesting marketing strategy. Blue Apron makes extensive use of free meals that existing customers can send to friends and family who aren’t already signed up. My anecdotal impression is that this meal-sharing approach is remarkably successful: it’s how my family joined, and how we’ve since gotten siblings, parents, extended family and many friends to join. They in turn have gushed about Blue Apron on Facebook and earned their own free boxes to send, and so on.

What I think is interesting about Blue Apron’s marketing is that although free samples are costly, these kind of customer-chooses free samples are amazingly well targeted: presumably their customers know how likely their friends and family are to like the service and thus can tailor their invitations accordingly.  This is particularly important for a product that has (somewhat) niche appeal. It would be interesting to compare the conversion of these social referrals to other acquisition channels.

This marketing strategy also raises some design questions. First, which of your existing customers are the best ones to target with free boxes to give away? Presumably the most enthusiastic ones are better targets (the ones that never cancel deliveries, order many meals a week, and so on). It also makes sense to target customers with lots of social connections to the target Blue Apron demographic. However, it is perhaps not as simple as identifying super-users. These customers, while likely to give great word-of-mouth endorsements, might also not be as persuasive with would-be customers who might infer that the service is not for them based on the others’ enthusiasm: if the gourmet chef aunt loves the service, the pb&j-fan nephew may infer that it’s only for experts and abstain.

Even once you identify the super-users, how do you prompt them to optimally allocate the free samples? Presumably customers will start by sending the meals to people they think will value them the most and then work their way down, offering diminishing returns to Blue Apron. However, individuals probably have differently-shaped curves, with some who rapidly deplete their interested friends and others that are “deep” with potential customers. Blue Apron also has to consider not just the probability that a free sample leads to a new customer, but also the probability that the new customer will bring in other customers, and so on.

There is some academic work on marketing in networks. For example Sinan Aral and Dylan Walker have a great paper on the design of products to encourage viral growth. One of their conclusions is that more passive broadcast messaging is more effective than “active-personalized” features because the passive broadcast approach is used more often & thus the greater usage outweighs the lower effectiveness. It would be really interesting to compare in the Blue Apron context analogous interventions, say comparing a Facebook experiment where the customer could choose who to send their free meals to, or one where they just generically share that they have free meals to give, and then individuals in their network could self-select based on their interest.

Trusting Uber with Your Data

Screenshot 2014-12-09 12.14.54

There is a growing concern about companies that posses enormous amounts of our personal data (such as Google, Facebook and Uber) using it for bad purposes. A recent NYTimes op-ed proposed that we need information fiduciaries that would ensure data was being used properly:

Codes of conduct developed by companies are a start, but we need information fiduciaries: independent, external bodies that oversee how data is used, backed by laws that ensure that individuals can see, correct and opt out of data collection.

I’m not convinced that there has been much harm from all of this data collection—these companies usually want to collect this data for purposes no more nefarious than up-selling you or maybe price discriminating against you (or framed in more positive terms, offering discounts to customers with a low willingness to pay). More commonly, they collect data because they need it to make the service work—or it just gets captured as a by-product of running a computer-mediated platform and deleting the data is more hassle than storing it.

Even if we assume there is some great harm from this data collection, injecting a third party to oversee how the data is used seems very burdensome. For most of these sites, nearly every product decision touches upon data that will be collected or using data already collected. When I was employed at oDesk I worked daily with the database to figure out precisely what “we” were capturing and how it should or could be used in the product. I also designed features that would capture new data. From this experience, I can’t imagine any regulatory body being able to learn enough about a particular site to regulate it efficiently, unless they want a regulator sitting in every product meeting at every tech company—and who also knows SQL and has access to the firm’s production database.

One could argue that the regulator could stick to broad principles. But if their mandate is to decide who can opt out of certain kinds of data collection and what data can be used for what purpose, then they will need to make decisions at a very micro level. Where would you get regulators that could operate at this micro-level and simultaneously make decisions about what was good for society? I think you couldn’t and the end result would probably be either poor, innovation-stifling mis-regulation or full-on regulatory capture—with regulations probably used as a cudgel to impose large burdens on new entrants.

So should sites just be allowed to do whatever they want data-wise? Probably—at least until we have more evidence of some actual rather than hypothetical harm. If there are systematic abuses, these sites—with their millions of identifiable users—would make juicy targets for class action lawsuits. The backlash (way overblown, in my opinion) from the Facebook experiment was illustrative of how popular pressure can change policies and that these companies are sensitive to customer demands: the enormous sums these companies pay for upstart would-be rivals suggest they see themselves as being in an industry with low switching costs. We can also see market forces leading to new entrants whose specific differentiator is supposedly better privacy protections e.g, DuckDuckGo and Ello.

Not fully trusting Uber is not a good enough reason to introduce a regulatory body that would find it nearly impossible to do its “job”—and more likely, this “job” would get subverted into serving the interests of the incumbents they are tasked with regulating.

Human Capital in the “Sharing Economy”

Most of my academic research has focused on online labor markets. Lately, I’ve been getting interested in another kind of online service—namely for the transfer of human capital, or in non-econ jargon, teaching. There have been a number of new companies in this space—Coursera, edX, Udactiy and so on—but one that strikes me as fundamentally different—and different in an important way—is Udemy.

Unlike other ed tech companies, Udemy is an actual marketplace:  instructors from around the world can create online courses in whatever their area of expertise and then let students access those courses, often for a fee but not always. Instructors decide the topic, the duration and price: students are free to explore the collection of courses and decide what courses are right for their needs and their budget. The main reason I think this marketplace model is so important is that it creates strong incentives for instructors to create new courses, thus partially fixing the “supply problem” in online education (which I’ll discuss below).

Formal courses are a great way to learn some topic, but not everything worth learning has been turned into a course. Some topics are just too new for a course to exist yet. It takes time to create courses and for fields that change very rapidly—technology being a notable example—no one has had the time to create a course. The rapid change in these fields also reduces the incentives for would-be instructors—many of which likely to not even think of themselves as teachers—to make courses, as the material can rapidly become obsolete. Universities can and do create new courses, but it’s hard to get faculty to take on more work. Further, the actual knowledge that needs to be “course-ified” is often held by practitioners and not professors.

I recently worked with Udemy to develop a survey of their instructors. We asked a host of questions (and I hope to blog about some of the other interesting ones) but one that I think is particularly interesting was “Where did you acquire the knowledge that you teach in your course?” We wanted to see whether a lot of what was being taught on Udemy was knowledge that was acquired through some means other than formal schooling. In the figure below, I plot the fraction of respondents selecting different answers.


We can see that the most common reason is a mixture of formal education and on-the-job experience (about 45%). The next most common answer was strictly on the job experience at a little less than 30%. Less than 10% of instructors were teaching things they had learned purely in school.

These results strongly support the view that Udemy is in fact taking knowledge acquired from non-academic sources and turning it into formal courses. Given that Udemy is filling in a gap left by traditional course offerings, it is perhaps not surprising that the answers skew towards “on the job training” but it is even more pronounced than I would have expected. I also think it’s interesting that the most common answer was  a “mixture” suggesting that for instructors their on-the-job training was a complement to their formal education.

Online education and ed tech is exciting in general—it promises to potentially overcome the Baomol’s cost disease characterization of the education sector and let us educate a lot more people at a lower cost.  However, I suspect that business models that simply take offline courses and move them online will not create the incentives needed to bring the large amounts of practical knowledge into the course format; by creating a marketplace, Udemy creates those incentives. By having an open platform for instructors, it can potentially tap the expertise of a much larger cross section of the population. Expertise does not just reside within academia and Udemy—unlike platforms that simply take traditional courses and put them online—can unearth this expertise and catalyze its transformation into courses. Any by forcing these courses to compete in a market for students, they create strong incentives for both quality and timeliness.

Although having a true marketplace has many advantages, running marketplace businesses is quite difficult—they create challenges like setting pricing policies, building and maintaining a reputation system, ensuring product quality without controlling production, mediating disputes and so on. But taking on these challenges seem worth it, particularly as businesses are getting better at running marketplaces (see Uber, Lyft, Airbnb, Elance-oDesk, etc.). In future blog posts, I hope to talk about some other interesting aspects of the survey and how they related to market design. There are some really interesting questions raised by Udemy, such as how should instructors position their courses vis-a-vis what’s already offered, how they set prices, and on the Udemy side, how you share revenue, how you market courses, how you have your reputation/feedback system work, how you decide to screen courses and so on—it’s a really rich set of interesting problems.