# Reputation systems are great for buyers (and good for sellers too)

In a recent NYTimes article about Uber drivers organizing in response to fare cuts, there was a description of the rating system and how it affects drivers:

They [drivers] are also constrained by the all-important rating system — maintain an average of around 4.6 out of 5 stars from customers in many cities or risk being deactivated — to behave a certain way, like not marketing other businesses to passengers.

Using “marketing a side business” as an example of behavior the reputation system curtails is like saying “the police prevent many crimes, like selling counterfeit maple syrup“—technically true, but it gives the wrong impression about what’s typical.

Bad experiences on ride-sharing apps presumably mirrors bad experiences in taxis: drivers having a dirty car, talking while driving, being rude, driving dangerously or inefficiently and so on. I’d wager that “marketing a side business” complaints more or less never happen. If they do happen, it’s probably because the driver was particularly aggressive or annoying about promoting their business (or the passenger was idiosyncratically touchy). It certainly doesn’t seem to be against Uber’s policy—an Uber spokesperson said recently that Uber not only condones it, but encourages it.

Being subject to a reputation system is certainly personally costly to drivers—who likes being accountable?—but it’s not clear to me that even drivers as a whole should dislike them, so long as they apply to every driver. Bad experiences from things like poor driving or unclean vehicles are not just costly to passengers, but are also costly to other drivers, as they reduce the total demand for transportation services (NB: Chris Nosko & Steve Tadelis have a really nice paper quantifying the effects of these negative spillovers on other sellers, in the context of eBay). The problem with quality in the taxi industry historically is that competition doesn’t “work” to fix quality problems.

Competition can’t solve quality problems because a passenger only learns someone was bad after already having the bad experience. Because of the way taxi hails work, passengers can’t meaningfully harm the driver by taking their business elsewhere in the future, like they could with a bad experience at a restaurant. As such, the bad apple drivers don’t have incentives up front to be good or to improve. (The same also goes for the other problem of bad passengers, which there are and the reputation system helps deal with.) Reputation systems—while far from perfect—solve this problem.

While reputation systems seem like something only a computer-mediated platform like Uber and Lyft can have,  there’s no reason (other than cost) why regulated taxis couldn’t also start having reputation systems. Taxis could ask for passenger feedback in the car using the touch screen, and then use some of the advertising real estate outside the car to show average driver feedback scores to would-be passengers. This would probably be more socially useful than the usual NYC advertisements on top of yellow cabs, such as for gentleman’s clubs, e-cigarettes, and yellow cabs.

Disclosure: I worked with their data science team in the summer of 2015. However, the direction of causality is that I wanted to work with Uber because they are amazing; I don’t think Uber is amazing because I worked for them.

# One weird trick for eye-balling a means comparison

Often, I’m in a seminar or reading a paper and I want to quickly see if the the difference in two means is likely to be due to chance or not. This comparison requires computing the standard error of the difference in means, which is $SE(\Delta) = \sqrt{SE_1^2 + SE_2^2}$, where $SE_1$ is the standard error of the first mean and $SE_2$ is the standard error of the second mean. (Let’s call the difference in means $\Delta$.)

Squaring and taking square roots in your head (or on paper for that matter) is a hassle, but if the two standard errors are about the same, we can approximate this as $SE(\Delta) \approx \frac{3}{2} \times SE_1$, which is a particularly useful approximation. The reason is that the 95% CI for $\Delta$ is $4 \times SE(\Delta) = 6 SE_1$ (i.e., 6 of our “original” standard errors).  As such, we can construct the 95% CI for the difference Greek-geometer style, by taking the origin CI, diving it into 4ths and then adding one more SE to each end.

The figure below illustrates the idea – we’re comparing A & B and so we construct a confidence interval for the difference between them, that is 6 SE’s in height. And we can easily see if that CI includes the top of B.

## What if the SE’s are different?

Often the means we compare don’t have the same standard error, and so the above approximation would be poor. However, so long as the standard errors are not so different, we can compute a better approximation without any squaring or taking square roots.  One approximation for the true standard error that’s fairly easy to remember is:

$\sqrt{SE_1^2 + SE_2^2} \approx \frac{3}{2}SE_1 + \frac{2}{3}(SE_1 - SE_2)$.

This is just the Taylor series approximation of the correct formula about $SE_1 - SE_2 \approx 0$ (and using $\sqrt{2} \approx 3/2$ and $1/\sqrt{2} \approx 2/3$).

# Monte Carlo Clusterjerk

Chris Blattman recently lamented reviewers asking him to cluster standard errors for a true experiment, which he viewed as incorrect, but had no citation to support his claim. It seems intuitive to me that Chris is right (and everyone commenting on his blog post agreed), but no one could point to something definitive.

I asked on Twitter whether a blog post with some simulations might help placate reviewers and he replied “beggars can’t be choosers”—and so here it is. My full code is on github.

To keep things simple, suppose we have a collection of individuals that are nested in groups, indexed by $g$. For some outcome of interest $y$, there’s a individual-specific effect, $\epsilon$ and a group-specific effect, $\eta$. This outcome also depends on whether a binary treatment has been applied (status indicated by $W$), which has an effect size of $\beta$.

$y = \beta \times W + \eta_g + \epsilon$

We are interested in estimating $\beta$ and correctly reporting the uncertainty in that estimate.

First, we need create a data set with a nested structure. The R code below does this, with a few things hard-wired: the $\eta$ and $\epsilon$ are both drawn from a standard normal and the probability of treatment assignment is 1/2. Note that the function takes a boolean parameter randomize.by.group that lets us randomize by group instead of by individual. We can specify the sample size, the number of groups and the size of the treatment effect.

This function returns a data frame that we can analyze. Here’s an example of the output. Note that for two individuals with the same group assignment, the $\eta$ term is the same, but that the treatment varies within groups.

Now we need a function that simulates us running an experiment and analyzing the data using a simple linear regression of the outcome on the treatment indicator. This function below returns the estimate, $\hat{\beta}$ and the standard error, $SE(\hat{\beta})$ from one “run” of an experiment:

Let’s simulate running the experiment a 1,000 times (NB If the “%>%” notation looks funny to you— I’m using the magrittr package)

The standard error also has a sampling distribution but let’s just take the median value from all our simulations:

If we compare this to the standard deviation of our collection of $\hat{\beta}$ point estimates, we see the two values are nearly identical (which is good news):

If we plot the empirical sampling distribution of $\hat{\beta}$ and label the 2.5% and 97.5% percentiles as well as the 95% CI (constructed using that median standard error) around the true $\beta$, the two intervals are right on top of each other:

Code for the figure above:

Main takeaway: Despite the group structure, the plain vanilla OLS run with data from a true experiment returns the correct standard errors (at least for the parameters I’ve chosen for this particular simulation).

# What if we randomize at the group level but don’t account for this group structure?

At the end of his blog post, Chris adds another cluster-related complaint:

Reviewing papers that randomize at the village or higher level and do not account for this through clustering or some other method. This too is wrong, wrong, wrong, and I see it happen all the time, especially political science and public health.

Let’s redo the analysis but change the level of randomization to group and see what happens if we ignore this level of randomization change. As before, we simulate and then compare the median standard error we observed from our simulations to the standard deviation of the sampling distribution of our estimated treatment effect:

The OLS standard errors are (way) too small—the median value from OLS is still about 0.08 (as expected) but the sampling distribution of the estimated treatment effect is 0.45. The resultant CIs looks like this:

Eek. Here are two R-specific fixes, both of which seem to work fine. First, we can use a random effects model (from the lme4 package):

or we can cluster standard errors. The package I use for this is lfe, which is really fantastic. Note that you put the factor you want to cluster by in the 3rd position following the formula:

One closing thought, a non-econometric argument why clustering can’t be necessary for a true experiment with randomization at the individual level: for *any* experiment, presumably there is some latent (i.e., unobserved to the researcher) grouping of the data such that the errors within that group are correlated with each other. As such, we could never use our standard tools for analyzing experiments to get the right standard errors if taking this latent grouping into account was necessary.

# Experimenting with targeted features

Suppose you run a website and you have some experience or feature that you think might be good for some subset of your users (but ineffective, at best, for others).  You might try to (1) identify who would benefit based on observed characteristics  then (2) alter the experience only for a targeted subset users expected to benefit.

To make things concrete, in some cities, Uber offers “UberFamily” which means the Uber comes with a car seat. For us (I have two kids), UberFamily is awesome, but the option takes up valuable screen real estate and for a user that Uber thinks does not have kids, adding it to the app screen is a waste. So Uber would like to both (a) figure out if it is likely that I have kids and then (b) adjust the experience based on that model. But they’d also like to know if it’s worth it in general to offer this service even among those they think could use it. This isn’t the example that motivated this blog post, but it makes the scenario clear.

If you are testing features of this sort, then you want to both (a) assess your targeting and (b) assess the feature itself. How should you proceed? I’m sure there’s probably some enormous literature on this question (there’s a literature on everything), but I figure by offering my thoughts and potentially being wrong on the Internet, I can be usefully corrected.

I think what you want to do is not test your targeting experimentally but rather role out the feature for everyone you reasonably can than evaluate your targeting algorithms on your experimental data. So, you would run the experiment with a design that maximizes power to detect treatment effects (e.g., 50 to treatment, 50 control). In other words, completely ignore your targeting algorithm recommendations.

Then, after the experimental data comes in, look for heterogeneous treatment effects conditioned on the predictive model score, where the score can be thought of as a measure of how much we think a person should have benefitted from the treatment. The simplest thing you could do to would be to normalize all scores (so the scores have the same mean and variance across algorithms, making model coefficients directly interpretable across algorithms). Then just run the regression:

$y = \beta_0 + \beta_1 (score \times trt) + \beta_2 score + \beta_3 trt$

Hopefully, if the treatment was better for people the model thought would be helped, then $\hat{\beta_1}$ should be positive (assuming the y is such that bigger is better).

You’d also want to finding the minimum score such that you should be targeting people i.e., the score such that the expected benefit from targeting is first positive. You can then simply select the algorithm with the greatest expected improvement, given the minimum score for targeting.

This seems like a reasonable approach (and maybe bordering on obvious but it wasn’t obvious to me at first). Any other suggestions?

# Simple distributed text editing with Mechanical Turk

When I write a sentence, there’s about a 10% chance it will have typo or grammatical error of some kind.  It’s often painful to find them later,  as like most people, I tend to “fill in the gaps” or glide over typos when reading my own writing.  Fortunately, this kind of editing, unlike, say, reading for structure or consistency, is very parallelizable. In fact, reading each sentence alone, out of order, might even be better than reading the whole document, sentence by sentence.

As an experiment, I wrote a little script that splits a document up into sentences, with one sentence per line (the script is here). With this CSV, I can use Mechanical Turk to create HITs, with one HIT per sentence. The instructions for workers to label each sentence as “OK” or “Not OK” with an optional field to explain their reasoning. The Mturk interface looks like this:

After splitting the sentences, I went through the CSV file to remove blank lines and LaTeX commands by hand, though one could easily add this feature to the script.

I posted the HITs on MTurk this morning, paying 2 cents, with 4 HITS per sentence (so each sentence will be checked 4 times by different workers).  The text was a paper I’m working on. Results starting coming in remarkably quickly—here it as after 30 minutes:

I’m not thrilled with the hourly rate (I try to shoot for \$5/hour) but this average is always very sensitive to workers who take a long time. So far, the comments are very helpful, especially since with multiple ratings, you can find problematic sentences—for example:

The “86” is the line number from the LaTeX document, which is nice because it makes it easier to go find the appropriate sentence to fix. Here are some more samples of the kinds of responses I’m getting:

Overall, I think it’s a successful experiment, though it was already well known that MTurk workers can do editing tasks well, from soylent.

# Documenting your work as you go with GNU Make

I’ve long been a convert to using ‘Make’ to turn LaTeX into a PDF. However, you can also easily use Make to backup your work as you go along and take “snapshots” of what a draft looked like at a moment in time (see example Makefile below). This complements using both github, Dropbox and some external backup (I might add another block to the Makefile that pushes a snapshot to Amazon S3).

My folder structure for a project:

• writeup (where I store the LaTeX & BibTeX)
• code
• data
• backups (where I store entire snapshots of the directory, w/o backups included, for obvious reasons)
• snapshots (where just the PDF draft is stored)

# How much *should* MTurk cost (if you’re Amazon)?

Amazon recently announced that they are going to double their percentage fee, from 10% to 20%. At least among the people I follow on Twitter (lots of academics that use MTurk for research), this has caused much consternation. A price increase is clearly bad for workers and bad for requestors (who gets hurt worse will depend on relative elasticities), but what about Amazon?

When I first got interested in online labor markets, I wrote  a short paper (“Online Labor Markets” ) in which I tried to figure out what was the optimal ad valorem charge from the platform’s perspective.  The relevant section is below, but the main conclusion was that most online platforms were pricing as if demand and/or supply was highly elastic.  In other words, that even a small increase in price would send nearly all customers elsewhere.

The basic reason is that when the platform doubles its fees from 10% to 20%, they double their revenue (if everything stays the same) but only increase the cost of Using MTurk for users by about 10%. Of course, there is some decline in usage (demand curves slope down)  which reduces profits, but it has to be a huge reduction to make up for the direct increase in revenue to the platform. This is more or less the same argument for why cutting taxes doesn’t increase revenue unless tax rates are incredibly high—the consensus estimate is that the revenue maximizing tax rate is in the mid 70%.

My guess is that MTurk has fairly few substitutes and someone at Amazon decided it should be making more money as a service. Fortunately for us as observers, because of the great work of Panos Ipeirotis, we’ll get to see what happens.

# The impact & potential of online work – some new references

In the last month, two new reports have come up looking at online work and online labor markets, with a focus on their potential for economic development.

There is:

• A report from the McKinsey Global Institute, “Connecting talent with opportunity in the digital age” The McKinsey report covers a lot of the economic rationale for online work and digitization a bit more broadly.
• A report from the World Bank “Jobs without Borders” (pdf) which focuses more on what these markets could do for workers in less developed countries. It brings together quite  a bit of disparate data on the size on online marketplaces, worker composition and so on.

The McKinsey report relies on some of the work that  went into my NBER working paper w/ Ajay Agrawal, Liz Lyons and Nico Lacetera on “Digitization and the Contract Labor Market” which in turn leans heavily on data from oDesk (now Upwork). This paper—along with all the others from the conference—are now available as a book from the University of Chicago Press.

# Allocating online experimental subjects to cells—doing better than random

Most people running experiments understand that they need to randomize subjects to their experimental groups. When I teach causal inference to my undergraduates, I try to drive this point home and I tell them horror stories about would-be experimenters “randomizing” on things like last name or even worse, effectively letting people opt into their treatment cell by conditioning assignment on some behaviour the user has taken.

But this randomization requirement for valid inference is actually a noble lie of sorts—what you really need is “unconfoundedness” and in experiments I have run on Mechanical Turk—I don’t randomize but rather I stratify, allocating subjects sequentially to experimental groups based on their “arrival” to my experiment. In other words, if I had two cells, treatment and control, my assignments would go like this:

• Subject 1 goes to Treatment
• Subject 2 goes to Control
• Subject 3 goes to Treatment
• Subject 4 goes to Control
• Subject 5 goes to Treatment

So long as subjects do not know relative arrival order and cannot condition on it (which is certainly the case), this method of assignment, while not random, is OK for valid causal inference. In fact, arrival time stratification approach does better than OK—it gives you more precise estimates of the treatment effect for a given sample size.

The reason is that this stratification ensures your experiment is better balanced on arrival times, which are very likely correlated with user demographics (because of time-zones) and behaviour (early arrivals are more likely to be heavy users of the site). For example, suppose on Mechanical Turk you run your experiment over several days and the country of each subject, in order is:

1. US
2. US
3. US
4. US
5. India
6. India
7. India
8. India.

With randomization, there is a non-zero chance you could get an assignment of Treatment, Treatment, Treatment, Treatment, Control, Control, Control, Control, which is as badly biased as possible (all Americans in the treatment, all Indians in the control). There are a host of other assignments that are better, but not by much, still giving us bad balance. Of course, our calculated standard errors take this possibility into “account” but we don’t have to do this to ourselves—with the stratified on arrival time method, we get a perfectly balanced experiment on this one important user attribute, and hence more precision. As the experiment gets larger and larger this matters less and less, but at any sample size, we do better with stratification if we can pull it off.

We could show the advantages of stratification mathematically, but we can also just see if through simulation. Suppose we have some unobserved attribute of experimental subjects—in my simulation (the R code of which is available at the end of this post), the variable x—that has a strong effect on the outcome, y  (in my simulation the effect is 3 * x) and we have a treatment that has a constant treatment effect (in my simulation, 1)  whenever it is applied. And let us suppose that subjects arrive in x order (smallest to largest). Below I plot the improvement in the average improvement in the absolute difference between the actual treatment effect (which is 1) and the experimental estimate from using stratification rather than randomization for assignment. As we can see, stratification always gives us an improvement, though the advantage is declining in the sample size.