Category Archives: economics

Data openness by private firms

The New York Times has a story today about social scientists working with company data and being unable or unwilling to make it public. The story begins:

When scientists publish their research, they also make the underlying data available so the results can be verified by other scientists.

I think the first sentence is probably more a description of how we’d like the world to be than how it actually is right now, especially in the social sciences. The main so-what of the story is that private companies are collecting enormous amounts of high quality data that lets you do fascinating social science, but companies are understandably reluctant to make this data public, primarily for privacy reasons (and probably also because they are afraid of giving up some competitive advantage).

I think the options for any organization that does or might do research are:

1) Do research for business purposes. Make neither the findings nor the data public.
2) Do research for business purposes. Make the findings but not the full data public.
3) Do research for business purposes. Make the findings and data public.  
4) Do research. Make findings and data public.

Most companies probably aren’t interested in (4) and this is probably academia’s biggest comparative advantage. Barring (4), I think from a social perspective, privacy issues aside, the best outcomes in order are (3) > (2) > (1).   I can understand (1) in some cases, but at least in the kind of companies I’m familiar with, the advantages of keeping everything secret probably aren’t that great. 
The advantages of (2) or (3) over (1):  
a)  If you’re a software company and you release a feature that works, it will probably get copied anyway, regardless of whether you publish a paper, so you might as well get the thought leadership credit for coming up with the idea in the first place. This paper is/was the basis for Google’s secret sauce—posting it to the InfoLab servers back in 1999 didn’t doom the company and probably did a lot to increase the perceptions that they were doing something smarter (even though there were antecedents of this idea going back many years—including in Economics, by my academic grandfather).  
b) If you give them access and them publish, you can get outside academics to work on your problems for free (the Netflix prize is an obvious example).  You can recruit those academics to come work for you, or at least get their grad students to come work for you. 
c) If you let your internal researchers publish, you can get them to work at reduced cost or get researchers you otherwise wouldn’t be able to attract (see Scott Stern’s paper on scientists “paying” to do science).

On (2) versus (3), I think there is a real dilemma: openness and privacy concerns are in tension. Furthermore, just releasing more aggregated or somehow obfuscated versions of the data is not risk free: there’s actually an emerging literature in Computer Science on how to release data in ways that are guaranteed to still have the right privacy properties (CMU UPenn professor Aaron Roth recently taught a course on the topic). The fact that smart people are working on it is exciting, since they might figure out provably risk-free ways to release data publicly, but it’s also evidence that this isn’t a trivially easy problem—seemingly innocuous data disclosures would let someone unravel the obfuscation.  

As a coda, I have a personal anecdote to share about this story. One of the people discussed in the article is Bernardo Huberman: 

The chairman of the conference panel — Bernardo A. Huberman, a physicist who directs the social computing group at HP Labs here — responded angrily. In the future, he said, the conference should not accept papers from authors who did not make their data public. He was greeted by applause from the audience. 

When I was a grad student, I taught a course to Harvard sophomore economics majors called “Online Labor” (syllabus).  I assigned some of Huberman’s papers on motivation. I emailed him to ask for the data from one of his papers. He wrote back: 

Dear Dr. Horton:
Thank you for your interest in my work and I certainly feel pleased when I learn that you liked my paper enough to assign it to your class.
As to your request, let me talk with the person who now handles the youtube data (we lately used it to uncover the persistence paradox) and I’ll get back to you.
Incidentally if you are interested in the role that attention and status (its marker) play among people I could send you a paper that reports on a experiment (as opposed to observational data) that elucidates it quite cleanly across cultures.

I got the data within days—I can state that he privately practices what he preaches publicly.

Update: I incorrectly stated that Aaron Roth was a professor at CMU—he did his PhD at CMU. He’s a professor at UPenn. Apologies.

Economics of the Cold Start Problem in Talent Discovery

Supply train steaming into a railhead
Tyler Cowen recently highlighted this paper by Marko Terviö as an explanation for labor shortages in certain areas of IT. The gist of the model is that in hiring novices, firms cannot fully recoup their hiring costs if the novices’ true talents will become common knowledge post-hire. It’s a great paper, but what people might not know is that the theory it proposes has been tested and found to perform very well. For her job market paper, Mandy Pallais conducted a large experiment on oDesk where she essentially played the role of the talent-revealing firm.

Here’s the abstract from her paper:

… I formalize this intuition in a model of the labor market in which positive hiring costs and publicly observable output lead to inefficiently low novice hiring. I test the models relevance in an online labor market by hiring 952 workers at random from an applicant pool of 3,767 for a 10-hour data entry job. In this market, worker performance is publicly observable. Consistent with the models prediction, novice workers hired at random obtain significantly more employment and have higher earnings than the control group, following the initial hiring spell. A second treatment confirms that this causal effect is likely explained by information revelation rather than skills acquisition. Providing the market with more detailed information about the performance of a subset of the randomly-hired workers raised earnings of high productivity workers and decreased earnings of low-productivity workers. 

In a nutshell, as a worker, you can’t get hired unless you have feedback, and you can’t get feedback unless you’ve been hired. This “cold start” problem is one of the key challenges of online labor markets, where there are far fewer signals about a worker’s ability and less common knowledge about what different signals even mean (quick: what’s the MIT of Romania?). I would argue that scalable talent discovery and revelation is the most important applied problem in online labor/crowdsourcing.

Although acute in online labor markets, the problem of talent discovery and revelation is no cake walk in traditional markets. Not surprisingly, several new start-ups (e.g.,  smarterer and gild) are focusing on scalable skill assessment, and there is excitement in the tech community about using talent revealing sites like StackOverflow and Github as replacements for traditional resumes. It is not hard to imagine these low-cost tools or their future incarnations being paired with scalable tools to create human capital, like the automated training programs and courses offered by Udacity, Kahn Academy, codeacademy and MITx. Taken together, they could create a kind of substitute for the combined training/signaling role that traditional higher education plays today.

Like what you read? 
Why not follow me on twitter or subscribe to this blog via RSS?