Empirics in the very long-run

Much of my empirical work uses proprietary data from firms. I’m fully aware of the problem this creates for science—my work is, by and large, not (easily) reproducible. There are some things I do to try to enhance the credibility of my work despite this limitation, which I’ll save for another blog post, but I have an idea that I want to try and I think other similarly situated researchers should try: ask the data provider to agree to some release date in the future—potentially far in the future.

Research: Can I release this data next year?

Lawyer: No way.

Research: How about 5 years?

Lawyer: No.

How about 15?

Lawyer: Hmm (thinks they might be around in 15 years). Still no.

How about 40 years from now?

Lawyer: (50 year old lawyer contemplates own mortality) Uh, sure, OK.

Given my age, I’m not likely to be the one to work with this 40 year old data, but I’m pretty sure there will be empirical economists in 40 years who might like to revisit some aspect of the data I’ve worked with, hopefully with much more sophisticated methods and theories.  

How could you actually make this happen? Well, obviously picking a storage medium that can stand the test of time could be challenging. I’ve heard good things about M-DISC and I’ve been burning 100 GB back-ups with a burner I bought. The discs supposedly can last for 1,000+ years.

The second part is the software and data. It’s too much of a burden to rewrite all your code in some future proof language and picking winners will be hard anyway. I think the most promising approach is to just use whatever you normally use, but save not just your code and data, but all the dependencies for the whole OS. In other words, use something like Docker image of the code and data needed to produce you paper, with a script that orchestrates the production of the paper from raw data to final PDF. I feel pretty confident that 50 years from now, some variant of linux will still be in use that can run—or easily run a virtual environment that mimics what we use today.

In addition to the technical issue, there is also a social one, which is how do you get your data out there in 50 or 100 years? If the timeline is long but not too long, I think adding a section to your will with instructions to your executor could be sufficient. I’ve been working on a “research will” which is how I’d like my various projects wrapped up if I were to suddenly pass. If there were some time-related data releases, you could add sections to your will that assigned the data releasing to a young colleague or perhaps some institution that will be persistent (e.g., the head of the dept at your university). If the timeline is really long (say 100+ years), I don’t have a great answer, but perhaps a university library would be willing to take on this role. I’d be curious to hear other ideas on how to make sure your “time capsule” gets opened.