You can download a copy of the code described in this page as an ipython notebook or a pdf

We extracted data for 20 variables which built the basis for the Exploratory Data Analysis. All together we had 9974 observations.

These are the variables we used for our analysis:

won: Response variable. 1 if the candidate won, 0 otherwise
district: Congressional districts name
is_incumbent: Whether an incumbent is running for re-election
name: The candidate’s name
party: The candidate’s party
percent: The percentage of votes received by the candidate
state: The name of the state
votes: The number of votes received by the candidate
year: The year of the election.
first_time_elected: The year of the first election won in this district by the candidate (NaN if not applicable).
count_victories: The number of elections won in this district by the candidate
unemployement_rate: The unemployment rate of the district (or of the country if the information was not available) for the month before the election.
is_presidential_year: 1 if there is a presidential election this year, 0 otherwise.
president_can_be_re_elected: Can the president stand for re-election ? 1 = Yes, 0 = No
president_party: The president’s party (R or D)
president_overall_avg_job_approval: The presidential job approval ratings (available from Truman - to Trump). Source: Gallup.
last_D_house_seats: The number of house seats won by democrats at the last elections
last_R_house_seats: The number of house seats won by republicans at the last elections
last_house_majority: Which party have the majority (R or D)
fundraising: How much money does the candidate raised for the campaign

Our data for the training set will be all data excluding the year 2018. The test set will be restricted to the year 2018 only. Both datasets are very good balanced regarding the values in the response variable:

So we don’t need to resample data, stratify or generate any synthetic samples.

Data imputation

The first data quality corrections were already done when extracting the data from the different sources. Sometimes we faced the situation that the data quality was bad and we even had to add data manually. With the final preparation of the data we had to deal with the remaining missing values. The solution for most of the missing values was to group the data for state, district and/or year and take the mean. This is the result of the data imputation process. This table shows where we had missing values and how we corrected them:

Variable	# NaN	Data imputation	# NaN after imputation
district	0		0
is_incumbent	112	if grouped sum for state/district/year > 0 then assign 0, else 1	0
name	0		0
party	0		0
percent	15	calculated with mean of votes from state district and year	0
state	0
votes	58	replace with mean of votes from state and district	0
won (response variable)	0		0
year	0		0
first_time_elected	4445	take value from year if won=1 else 0	0
count_victories	0		0
unemployement_rate	979	replace with mean from state and district	0
is_presidential_year	102	Set to 0	0
president_can_be_re_elected	102	Set to 0	0
president_party	102	Set to 0	0
president_overall_avg_job_approval	1060	Version 1: replace with mean from state and district. Version 2: model based imputation	0
last_D_house_seats	102	Replace with mean from state and district	0
last_R_house_seats	102	Replace with mean from state and district	0
last_house_majority	102	Replace with the most common value from state and district	0
fundraising	7161	Version 1: replace with mean from state and district. Version 2: model based imputation	0

The reason for most of these NaN values was because some observations go back to the year 1824 where many information was not yet available. Beside mean imputation we also implemented a function with the possibility for model based imputation. But when testing with the classification models we didn’t see any improvements so we kept the mean imputation method.

Variable selection

We are only using 20 variables in our dataset so a dimension reduction was not that important but for the modeling part we wanted to know about the feature importance.

We used 7 category variables including the response variable. To test if there is a significant relationship between a predictor variable and the response variable we used the Chi-Square test:

The printed result was this:

Important for the prediction model: president_party (p-value: +0.231, chi2: +2.9)
Important for the prediction model: state (p-value: +1.000, chi2: +18.5)
Important for the prediction model: district (p-value: +1.000, chi2: +15.3)
Important for the prediction model: last_house_majority (p-value: +0.933, chi2: +0.0)
NOT important for the prediction model: name (p-value: +0.000, chi2: +9938.0)

At the end we could not simply remove the variable “name” from the variable set at this point because it was needed for feature engineering which was done in the modeling part.

To get a sense about the feature importance we used a Random Forest model with one-hot-coding. This method was then also used in the modeling part to decide which variables can be dropped. Variables like state or district for which we had to create dummy variables got a lower feature importance value because they were distributed over several columns.

These are the first 20 variables sorted by importance:

(0.128, ‘percent’),
(0.122, ‘votes’),
(0.092, ‘fundraising’),
(0.053, ‘unemployement_rate’),
(0.050, ‘first_time_elected’),
(0.040, ‘year’),
(0.031, ‘last_D_house_seats’),
(0.030, ‘last_R_house_seats’),
(0.024, ‘count_victories’),
(0.022, ‘president_overall_avg_job_approval’),
(0.015, ‘state_California’),
(0.014, ‘is_incumbent’),
(0.0128, ‘won’),
(0.011, ‘district_District 1’),
(0.011, ‘is_presidential_year’),
(0.011, ‘district_District 2’),
(0.009, ‘district_District 4’),
(0.009, ‘state_Texas’),
(0.009, ‘district_District 3’),
(0.008, ‘state_New York’),

Also we used the scatterplot matrix to find out which variables will be important and give us insights about their relevance. During the EDA phase we also experimented already with combining some variables so this is just one of the examples we analyzed. Here we differentiated the observations by the “won” factors:

Variable selection

Though we can’t see clear linear correlations or patterns, by differentiating by the winners against the losers, we can see how the green and orange observations are very distinct, in some cases.

Some examples:

The fund raising of winners is much higher than losers
Winners reach higher count of victories

Then, we plot only the candidates who won the elections, differentiated by party:

Variable selection

Still we can’t see clear linear correlations or patterns, although we notice that there are some differences:

Democrats winners reach higher count_victories, sign of a bigger political longevity or district stability
For a higher unemployment rate, Democrats win more than Republicans
Democrat winners raise slightly less funds than Republican winners

Further we created heat map matrices that give us insights about correlations between the continuous variables. In the plots we differentiated if Democrats or Republicans won. This matrix shows the correlations when the Democrats won together with a colorbar on the right side:

Variable selection

This matrix shows the correlations when the Republicans won:

Variable selection

We can see that the strength of the relationships between Democrats and Republicans is very often similar. As expected, the correlations between the variables percent, votes and won will be high.
We also wanted to know about the spread of the data and how the values for Democrats and Republicans compare to each other so we created some boxplots to show this:

Variable selection

In the modeling phase we will use this information from the EDA phase to be able to create features that improve the model performance and concentrate on the most important features.

Baseline

For the Baseline model we created a very simple data model to check if we have a promising set of features. We used a simple prediction for each district, by just taking the max occurring winner per each district before 2018. Compared to the actual election results in 2018, we get 77% accuracy.
So we are confident to be able to create good models in the modeling phase