You can download a copy of the code described in this page as an ipython notebook or a pdf

Part 1 - Data crawling

The first challenge in order to predict the 2018 house results was to obtain publically available historical data from various public sources. These data are:

Historical congressional election results for all the districts that exists in 2008.
National unemployement rates
Presidental job approval
House seats repartition
Candidate fundraising

1.1 Historical congressional election results

We got the midterm house results by crawling two sources: Wikipedia.org and ballotpedia.org.

1.1.1 Wikipedia

First, we extracted the list of all 2016 districts from this page: 2016 United States House of Representatives elections

Then, we extracted the congressional election results from each of these district pages.

1.1.2 Ballotpedia

The 2018 house results was not available on Wikipedia so we had to found another source: Ballotpedia. On this websites we were able to retrieve the historical results by district from 2012 to 2018.

1.2 National unemployement rates

We downloaded the national unemployement rate from 1948 to 2018 by month from the Bureau of Labor Statistics website.

1.3 Presidental job approval

We were able to scrap the informations used by Gallup to build this page.

1.4 House seats repartition

We extracted the number of seats by party and by year from the following Wikipedia page: List of United States House of Representatives elections, 1856–present

1.5 Candidate fundraising

We got the candidate fundraising data from from 2009 to 2018 on followthemoney.org.
The candidate names are not formatted in the same way as our data from Wikipedia and Ballotpedia, so we used a fuzzy search algorithm to match them

Part 2 - Detect and manually fix errors and add missing results

Wikipedia is a great collaborative knowledge base but sometimes it lakes of structure. This is why in some edge-cases, the crawler didn’t do a good job or the data. On the other hand, the data was particularly messy sometimes. For example, we found that for some elections there was more than 1 winner or even none or duplicate candidates. We had to write some test to detect such errors and then we manually fixed them.

Part 3 - Data derivation

With the data at hand, we were able to derivate the following new predictors:

Whether this is a presidential year or not
Whether the president can stand for re-election
The year of the first time an incumbent has been elected
The number of past victories of a candidate