Category Archives: statistics

Would a job by any other name pay as much?

I’m working on a project where it would be useful to know what an oDesk job is likely to pay at the time it is posted. Although there are plenty of structured predictors available (e.g., the category, skills, estimated duration etc.), presumably the job description and the job title contain lots of wage-relevant information. The title in particular is likely to identify the main skill needed, the task to be done and perhaps the quality of the person the would-be employer is looking for (e.g., “beginner”, or “senior”).

Unfortunately, I haven’t done any natural language processing before, so I’m a bit out of my element. However, there are good tutorials online as well as R packages that can guide you through the rough parts. I thought writing up my explorations might be useful to others that want to get started with this approach. The gist of the code I wrote is available at here.

What I did:

I took 20K recent hourly oDesk jobs that where the freelancer worked at least 5 hours. I calculated the log wage over the course of the contract. Incidentally, oDesk wages—like real wages—are pretty well approximated by a normal distribution.

2) I used the RTextTools package to create a document term matrix from the job titles (this is just a matrix of 1 & 0 where the rows are jobs and the columns are relatively frequent words that are not common English words—if the job title contained that word, it gets a 1, otherwise a 0).

3) I fit a linear model using the lasso for regularization (using the glmnet package). I used cross validation to select the best lambda. A linear model probably isn’t ideal for this, but at least it gives nicely interpretable coefficients.

So, how does it do? Here are a sample of the coefficients that didn’t get set to zero by the lasso, ordered by magnitude (point sizes are scaled by the log number of times that word appears in the 10K training sample):

The coefficients can be interpreted as % changes from the mean wage in the sample when that corresponding word (or word fragment) is present in the title. Nothing too surprising I think: at the extremes, SEO is a very low paying job, whereas developing true applications is high paying.

In terms of out of sample prediction, the R-squared was a little over 0.30. I’ll have to see how much of an improvement can be obtained from using some of the structured data available, but explaining 30% of the variation just using the titles is a higher than I would have expected before fitting the model.

Some light data munging with R, with an application to ranking NFL Teams

I recently submitted this blog to R-bloggers, which aggregates R-related blog posts. It’s a fantastic site and has been invaluable to me as I’ve learned R. One of my favorite kinds of articles is the hands-on, “hello world”-style weekend project that dips into a topic/technology, so here’s my first attempt at one in this style.

First, some background: I’ve been working with Greg on a project that analyzes the results of two-person contests. An important part of the problem is comparing different ranking systems that can adjust for the strength of the opponent (e.g., Elo rating system, TrueSkill, Glicko, etc.). As I understand it, all of these systems are working around the intractability of treating this as a purely Bayesian solution and try to deal with things like trends in ability, the distribution of the unobserved component, etc.

We’re still collecting data from a pilot, but in the interim, I wanted to start getting my feet wet with some real competition data. Sports statistics provide a readily available source of competition data, so my plan was:

Pull some data on NFL games on the 2011 season to date.
Fit a simple model that produces a rank ordering of teams.
Pull data on ESPN’s PowerRanking of NFL teams (based on votes by their columnists), using the XML package.
Make a comparison plot, showing how the two ranks compare, using ggplot2.

For the model, I wanted something really simple (hoping no one from FootballOutsiders is reading this). In my model, the difference in scores between the two teams is simply the difference in their “abilities,” plus an error term:

$\Delta S_{ij} = \alpha^H_i - \alpha^A_j + \epsilon$

where the alpha’s are team-and-venue (e.g., home or away) specific random effects. For our actual rating, we can order teams based on the sum of their estimate home and away effects, i.e.:

$\hat{\alpha}_i^H + \hat{\alpha}_i^A$

Estimating the 32 x 2 parameters—given how little data we actually have—would probably lead to poor results. Instead, I used the excellent lme4 package which approximates a Bayesian estimation where we start with a prior that the alpha parameters are normally distributed.

Putting the last thing first, here’s the result of 4), comparing my “homebrew” ranking to the ESPN ranking, as of Week 5 (before the October 9th games):

No real comment on my model other than it thinks (a) that ESPN vastly overrates the Chargers and (b) more highly of the Ravens.

The code for all the steps is posted below, with explanatory comments:

All public government data should be easily machine readable

The Bureau of Labor Statistics (BLS) has an annual budget of over $640 million (FY 2011), a budget they use to create and then distribute detailed labor market data and analysis to policy makers, researchers, journalists and the general public. I can’t speak to the “creation” part of their mission, but on the “distribution” part, the are failing—organizations with tiny fractions of their resources do a far better job.

It’s not the case that government IT is invariably bad—the Federal Reserve Bank of St. Louis has an amazing interface (FRED) and API for working with their data. Unfortunately, not all government statistics are available here, especially some of the more interesting BLS series.

The essential problem with BLS is that all of their work products—reports, tables etc.—are designed to be printed out, not accessed electronically. Many BLS tables are embedded in PDFs, which makes the data they contain essentially impossible to extract; non-PDF, text-based tables, which are better, are difficult to parse electronically: structure is conveyed by tabs and white space, column headings are split over multiple lines with no separators; heading lengths vary etc.

Why does it matter? For one, when users can access data electronically, via an API, they can combine it with other sources, look for patterns, test hypotheses, find bugs / measurement errors, create visualization and do all sorts of other things that make the data more useful.

BLS does offer a GUI tool for downloading data, but it’s kludgy, requires a Java Applet, requires series to be hand-selected and then returns an Excel(!) spreadsheet w/ extraneous headers and formatting. Furthermore, it’s not clear what series and what transformations are needed from GUI-data to make the more refined, aggregated tables.

To illustrate how hard it is to get the data out, I wrote a python script to extract the results this table (which shows the expected and estimated changes in employment for a number of industries). What I wanted to do was make this, which I think is far easier to understand than the table alone:

To actually create this figure, I needed to get data into in R by way of a CSV file. The code required to get table data into a useful CSV file, while not rocket science, isn’t trivial—there’s lots of one-off/hacky things to work around the limitations of the table. Getting the nested structure of the industries e.g., (“Durable Goods” is a subset of “Manufacturing” and “Durable Goods” has 4 sub-classifications) required recursion (see the “bread_crumb” function). FWIW, here’s the code:

Most of the code is dealing with the problems shows in this sketch:

My suggestion: BLS should borrow someone from FRED and help them create a proper API.