Category Archives: data

Data openness by private firms

The New York Times has a story today about social scientists working with company data and being unable or unwilling to make it public. The story begins:

When scientists publish their research, they also make the underlying data available so the results can be verified by other scientists.

I think the first sentence is probably more a description of how we’d like the world to be than how it actually is right now, especially in the social sciences. The main so-what of the story is that private companies are collecting enormous amounts of high quality data that lets you do fascinating social science, but companies are understandably reluctant to make this data public, primarily for privacy reasons (and probably also because they are afraid of giving up some competitive advantage).

I think the options for any organization that does or might do research are:

1) Do research for business purposes. Make neither the findings nor the data public.
2) Do research for business purposes. Make the findings but not the full data public.
3) Do research for business purposes. Make the findings and data public.  
4) Do research. Make findings and data public.

Most companies probably aren’t interested in (4) and this is probably academia’s biggest comparative advantage. Barring (4), I think from a social perspective, privacy issues aside, the best outcomes in order are (3) > (2) > (1).   I can understand (1) in some cases, but at least in the kind of companies I’m familiar with, the advantages of keeping everything secret probably aren’t that great. 
The advantages of (2) or (3) over (1):  
a)  If you’re a software company and you release a feature that works, it will probably get copied anyway, regardless of whether you publish a paper, so you might as well get the thought leadership credit for coming up with the idea in the first place. This paper is/was the basis for Google’s secret sauce—posting it to the InfoLab servers back in 1999 didn’t doom the company and probably did a lot to increase the perceptions that they were doing something smarter (even though there were antecedents of this idea going back many years—including in Economics, by my academic grandfather).  
b) If you give them access and them publish, you can get outside academics to work on your problems for free (the Netflix prize is an obvious example).  You can recruit those academics to come work for you, or at least get their grad students to come work for you. 
c) If you let your internal researchers publish, you can get them to work at reduced cost or get researchers you otherwise wouldn’t be able to attract (see Scott Stern’s paper on scientists “paying” to do science).

On (2) versus (3), I think there is a real dilemma: openness and privacy concerns are in tension. Furthermore, just releasing more aggregated or somehow obfuscated versions of the data is not risk free: there’s actually an emerging literature in Computer Science on how to release data in ways that are guaranteed to still have the right privacy properties (CMU UPenn professor Aaron Roth recently taught a course on the topic). The fact that smart people are working on it is exciting, since they might figure out provably risk-free ways to release data publicly, but it’s also evidence that this isn’t a trivially easy problem—seemingly innocuous data disclosures would let someone unravel the obfuscation.  

As a coda, I have a personal anecdote to share about this story. One of the people discussed in the article is Bernardo Huberman: 

The chairman of the conference panel — Bernardo A. Huberman, a physicist who directs the social computing group at HP Labs here — responded angrily. In the future, he said, the conference should not accept papers from authors who did not make their data public. He was greeted by applause from the audience. 

When I was a grad student, I taught a course to Harvard sophomore economics majors called “Online Labor” (syllabus).  I assigned some of Huberman’s papers on motivation. I emailed him to ask for the data from one of his papers. He wrote back: 

Dear Dr. Horton:
Thank you for your interest in my work and I certainly feel pleased when I learn that you liked my paper enough to assign it to your class.
As to your request, let me talk with the person who now handles the youtube data (we lately used it to uncover the persistence paradox) and I’ll get back to you.
Incidentally if you are interested in the role that attention and status (its marker) play among people I could send you a paper that reports on a experiment (as opposed to observational data) that elucidates it quite cleanly across cultures.

I got the data within days—I can state that he privately practices what he preaches publicly.

Update: I incorrectly stated that Aaron Roth was a professor at CMU—he did his PhD at CMU. He’s a professor at UPenn. Apologies.

All public government data should be easily machine readable

The Bureau of Labor Statistics (BLS) has an annual budget of over $640 million (FY 2011), a budget  they use to create and then distribute detailed labor market data and analysis to policy makers, researchers, journalists and the general public. I can’t speak to the “creation” part of their mission, but on the “distribution” part, the are failing—organizations with tiny fractions of their resources do a far better job.

It’s not the case that government IT is invariably bad—the Federal Reserve Bank of St. Louis has an amazing interface (FRED) and API for working with their data. Unfortunately, not all government statistics are available here, especially some of the more interesting BLS series.

The essential problem with BLS is that all of their work products—reports, tables etc.—are designed to be printed out, not accessed electronically. Many BLS tables are embedded in PDFs, which makes the data they contain essentially impossible to extract; non-PDF, text-based tables, which are better, are difficult to parse electronically: structure is conveyed by tabs and white space, column headings are split over multiple lines with no separators; heading lengths vary etc.

Why does it matter? For one, when users can access data electronically, via an API,  they can combine it with other sources, look for patterns, test hypotheses, find bugs / measurement errors, create visualization and do all sorts of other things that make the data more useful.

BLS does offer a GUI tool for downloading data, but it’s kludgy, requires a Java Applet, requires series to be hand-selected and then returns an Excel(!) spreadsheet w/ extraneous headers and formatting. Furthermore, it’s not clear what series and what transformations are needed from GUI-data to make the more refined, aggregated tables.

To illustrate how hard it is to get the data out, I wrote a python script to extract the results this table (which shows the expected and estimated changes in employment for a number of industries). What I wanted to do was make this, which I think is far easier to understand than the table alone:

To actually create this figure, I needed to get data into in R by way of a CSV file.  The code required to get table data into a useful CSV file, while not rocket science, isn’t trivial—there’s lots of one-off/hacky things to work around the limitations of the table. Getting the nested structure of the industries e.g., (“Durable Goods” is a subset of “Manufacturing” and “Durable Goods” has 4 sub-classifications) required recursion (see the “bread_crumb” function). FWIW, here’s the code:

Most of the code is dealing with the problems shows in this sketch:

My suggestion: BLS should borrow someone from FRED and help them create a proper API.