Governments can do things at scale, especially things that tie into our shared understanding of life and society. For example, any entity with resources and a wide theater of activity can collect large and important data sets — but it’s generally only the government that releases the data for the rest of us to see, analyze, and build on. In a world where increases in inequality run parallel to increases in technical capabilities, open government data can be an invaluable resource.
The data set that I’ve used most often in the past few years comes from the National Science Foundation: the Survey of Earned Doctorates. Every US institution of higher education that produces PhDs, EdDs, or some other doctoral degree is represented in this data. As their site explains,
The Survey of Earned Doctorates (SED) is an annual census conducted since 1957 of all individuals receiving a research doctorate from an accredited U.S. institution in a given academic year. The SED is sponsored by the National Center for Science and Engineering Statistics (NCSES) within the National Science Foundation (NSF) and by five other federal agencies: the National Institutes of Health, U.S. Department of Education, U.S. Department of Agriculture, National Endowment for the Humanities, and National Aeronautics and Space Administration. The SED collects information on the doctoral recipient's educational history, demographic characteristics, and postgraduation plans. Results are used to assess characteristics of the doctoral population and trends in doctoral education and degrees.
There is a qualitative side, drawn from an actual survey that one fills out upon completion of the doctorate. There is also a wide variety of quantitative data. That data includes Academic Institution Profiles, rich single-institution data on
- Number of earned doctorates
- Number of full-time graduate students
- Total federal obligations
- Total R&D expenditures
While this is all interesting data, for my purposes the first category is the best: how many doctorates each institution produces in a calendar year, subdivided into numerous disciplinary categories.
While it is possible to download aggregated data summarizing all institutions up to the most recent published years, data for single institutions is published on individual web pages. The release cycle for this data has a two-year delay: 2013 data was released in 2015; 2014 data came out in 2016, and so on. Nevertheless, in my work in higher education services and research, having this level of granularity in understanding doctoral production trends at US universities, colleges, and other research institutions is invaluable.
Thus I was excited to receive a notification that the individual institution data had updated. After reviewing the structure of the website holding the data, refactoring a few parts of the source code I’d written to download the several hundred spreadsheets that covered every institution in the new data release, and then carrying out the download process, I had added another year to my running archive of this data.
As I’ve written elsewhere, after relying on this data for several years as a source of facts about VPhD’s slice of the higher education services market, I’d become very familiar with a lot of its features and the things they told me about the state of doctoral education in the United States. One of these was what I came to think of as the upper boundary of doctoral production.
In the past decade, the institutions that produced the greatest number of doctorates every year posted very high triple-digit totals - but never more than a thousand. The University of California at Berkeley, the University of Michigan at Ann Arbor, the University of Wisconsin-Madison — these universities reliably graduated 700, 800, and occasionally even 900 doctorates in a year. As at every level of higher education, production at each also trended upward. Thus when I started spot-checking the downloads I was expecting to see that trend continue.
I was surprised to find that it did not. In fact, in the most recent year the data indicated that these universities had not only produced in excess of a thousand doctorates, but that the totals appeared to be around twice of what I would have otherwise expected!
In addition to the spreadsheets, the NSF also presents each institution’s data in tables on the main SED site. Imagining that I might somehow have introduced the anomalous data that I was seeing in the downloads, I checked the web pages with data from those leading institutions. I found the same figures.
By this point, I was beginning to think that there had been some error in the preparation or publication of the data, for this doubling was not only present in the year just released in every individual institutional spreadsheet and table that I checked, but also in the nine prior years that made up the remaining columns of each table.
I was also a little distraught. This resource, which I had relied on for years, and which I knew better than any other set of external data, was now broken! I found myself wondering whether others who relied on it would have noticed, and—if not—whether anyone was making decisions on the basis of the data in its current state. Was there a way to fix it? How could I alert the NSF, or the National Center for Science and Engineering Statistics, who managed this particular set of the data? Should I make the drive out to Northern Virginia to the NSF offices? Or tweet at them?
Setting aside that question, I knew that I needed to push my analysis a little further. Did the data have any other features that could make my claims more compelling, or suggest anything more about how the error came to be?
Why, yes. It did.
Every number in every cell in every institution’s data was even.
For example, examine this table for Virginia Tech:
Not a single odd number in any category in any year. Every other institution showed the same.
This is not exactly Benford’s Law, but it seemed like a distant low-grade relation. A 28 x 10 matrix representing real-world phenomena where every number is even? Implausible, at best. To have that same pattern repeated across several hundred distinct entities? Impossible.
Checking the older years against what I had in my archive from past explorations of the data only confirmed this. The 2015 figures that were released for the first time in 2017 were half of the 2015 figures shown in this newly-released 2018 update.
My personal experience as a consumer of government services and as a contractor for and consultant to government entities has showed me that the people who do the day-to-day work of making it all run can be as frustrated as the rest of us about the slow pace of change. It has also shown me that they are more often than not looking for ways to create change, small and large. And they love it when everyday citizens step up and point out where things could be better.
In the end, I decided it would probably be easiest and best to find the contact information of somebody on the NCSES team who seemed like they would be close to or aware of the Academic Institution Profile data, and just call them up. On a Friday morning, I did it, and left a voicemail: “Hi, I’m an independent higher education researcher. I use the Survey of Earned Doctorates and related data sets a lot. I’ve discovered what I think is a processing error in one of those data sets. I’d like to share my findings with somebody in your office.”
In the best case, I thought it might take a few days to hear back. But to my surprise, my phone rang that afternoon. It was the program officer for whom I had left the voicemail. He said, “I heard your message, and although I wasn’t really sure what it was going to amount to, I decided to call you back.”
In a few minutes, we were looking at the same web page of institutional data, and I was sharing my knowledge of the data and my observations about the current update. When I pointed out the universally-even numbers, he agreed that there had to have been some error in the release. He promised to alert the team in charge of the data so they could review the issue and correct it. He also promised to follow up with me when the corrections had been made. We talked a bit more about working with data, and then rang off.
Less than a week later, I checked my voicemail to find a message from him, letting me know that the data had been re-run and was now published in its correct state.
I was pleased that I'd been able to make a small but meaningful contribution to a national data set. I was also glad to have had a chance to give this data team at the NCSES a positive and meaningful interaction with an average citizen who cared enough about what they were doing to try to help.
A few days later, the NSF published some findings from the most recent Survey of Earned Doctorates, including Number of doctorates awarded by US institutions in 2016 close to all-time high. The full report on the 2016 data is worth a read.