Data Crawling
You can download a copy of the code described in this page as an ipython notebook or a pdf
Part 1 - Data crawling
The first challenge in order to predict the 2018 house results was to obtain publically available historical data from various public sources. These data are:
- Historical congressional election results for all the districts that exists in 2008.
- National unemployement rates
- Presidental job approval
- House seats repartition
- Candidate fundraising
1.1 Historical congressional election results
We got the midterm house results by crawling two sources: Wikipedia.org and ballotpedia.org.
1.1.1 Wikipedia
First, we extracted the list of all 2016 districts from this page: 2016 United States House of Representatives elections
Then, we extracted the congressional election results from each of these district pages.
1.1.2 Ballotpedia
The 2018 house results was not available on Wikipedia so we had to found another source: Ballotpedia. On this websites we were able to retrieve the historical results by district from 2012 to 2018.
1.2 National unemployement rates
We downloaded the national unemployement rate from 1948 to 2018 by month from the Bureau of Labor Statistics website.
1.3 Presidental job approval
We were able to scrap the informations used by Gallup to build this page.
1.4 House seats repartition
We extracted the number of seats by party and by year from the following Wikipedia page: List of United States House of Representatives elections, 1856–present
1.5 Candidate fundraising
We got the candidate fundraising data from from 2009 to 2018 on followthemoney.org.
The candidate names are not formatted in the same way as our data from Wikipedia and Ballotpedia, so we used a fuzzy search algorithm to match them
Part 2 - Detect and manually fix errors and add missing results
Wikipedia is a great collaborative knowledge base but sometimes it lakes of structure. This is why in some edge-cases, the crawler didn’t do a good job or the data. On the other hand, the data was particularly messy sometimes. For example, we found that for some elections there was more than 1 winner or even none or duplicate candidates. We had to write some test to detect such errors and then we manually fixed them.
Part 3 - Data derivation
With the data at hand, we were able to derivate the following new predictors:
- Whether this is a presidential year or not
- Whether the president can stand for re-election
- The year of the first time an incumbent has been elected
- The number of past victories of a candidate