So you’ve read Stephen Ruggles, and now you want to engage in some quantitative history of your own. Here are the basics of data analysis.
Sources of data
There are a few general-purpose websites that act as central repositories for data. They are generally affiliated with disciplines or government bureaus. If you are interested in education history since 1960, say, there is an incredible amount of information on the Department of Education’s site; if you are interested in
- The Inter-university Consortium for Political and Social Research has an enormous quantity of data related to historical political behavior; American and European demographic surveys;
- IPUMS, the Integrated Public Use Microdata Series, contains incredible resources of census from many countries, particularly in the North Atlantic.
Other sources, often the ones of the greatest historical interest, may only exist as a single file somewhere.
- The trans-atlantic slave trade database contains information about slave voyages.
- A great deal of library data, both about books and including information about their language, is available from a number of different sources, including the Bookworm project I work on; see me if you want some help creating an extract of some sort.
- Along similar lines: Data for Research, from JSTOR, allows you export both wordcount and metadata information about journal articles that can be useful for the history of scholarship.
- The IMDB contains far more information about the history of movies and television than cultural historians have used.
- Information about 19th century whaling ship crew members in Massachusetts is available from the Whaling Crew List Database of the New Bedford Whaling Museum.
Create your own
Sometimes the data you want exists but not in a standard digital form. So if there’s something you know where to find the precise numbers for that isn’t in a standard format like a csv, let me know. Maybe your love of regular expressions can carry you through into a clean copy.
Note that as with all primary sources, you have to work from what already exists. Saying “I’d really like a dataset that shows the property holdings of every settler in Kansas in the 1880s” is fine, but unless the data was collected by the census, it’s unlikely it will exist in the form that you want.
Tools for analysis
How do you crunch numbers?
The easiest way, and not a bad one, is to use a spreadsheet.
There are two places you may encounter limits to spreadsheets.
They make it extremely to share your work. You can share results: but if you have a special correction, it tends to get lost. This can be catastrophically embarassing: just ask Reinhart and Rogoff. But it also makes it difficult to update your research. If you made a mistake at the beginning, it’s hard to wade through to the end.
They can’t handle extremely large collections of data. You’ll know this one when you encounter it.
Scripted tools for data analysis.
If you take my “Humanities Data Analysis” class, you will the language R. This is the most widely tool for statistical analysis nowadays, and beats SPSS or Stata or whatever you learned in your intro statistics class. And it’s free. The downside is that it’s basically writing computer code. But that’s essentially true of everything else.
If you know some programming already, there are extremely well developed tools for analysis in the Python language. Hardcore scientists sometimes use languages like Julia or Matlab; people with money to spend on computer
Special online tools
There are a variety of sites online that you can upload your data to and perform particularly types of analysis, particularly making a chart. These suffer many of the same shortcomings as spreadsheets, and you’ll usually use them in addition to rather than instead of the traditional forms of analysis. One particularly good resource is Palladio, from Stanford, which is explicitly designed for “Humanities data visualization:” it’s up to you to decide if there is such a thing.
Cleaning data is hard, and falls below the level of skills appropriate for this course. Regexes are a first and nearly necessary precondition to most data cleaning.
There is one great freely available, interactive tool for data cleaning.