A Data-Driven Political Landscape
In an effort to promote transparency, the Dutch government has created a website that can be used to access data about votes in the Dutch House of Representatives (Tweede Kamer). We've downloaded some of the data to see if we can create a data-driven political landscape based on voting behaviour.
Political parties that are placed close together exhibit similar voting behaviour. This means that they vote against or in favor of similar issues. The X and Y axes in this visualisation are dimensionless, it's really about the distance between the data points.
We can see that the far-right party PVV of Geert Wilders is at the top of the visualisation. Although FvD tries to present itself as a sensible alternative to the larger and established parties, it is obvious that they vote similar to the right-wing PVV.
On the bottom-left, the leftist parties are visualised. The green party (GroenLinks) votes roughly the same as the labour party (PvdA) and the party for animal wellfare (PvdD) votes similar to the socialist party SP.
On the bottom-right we see center (D66, CDA) and right-wing (VVD) parties. The fact that these center and right-wing parties are placed close together in this visualisation might be an indication that these center parties have shifted towards the right in the past years. One would have expected D66 to be somewhere inbetween PvdA and VVD if it were a truely central party but it seems that their place in the cabinet coalition with VVD and CDA has shifted their voting behviour to the right.
The SGP and ChristenUnie are Christian parties. We see that the SGP leans more towards the PVV-FvD than the ChristenUnie does. Indeed the SGP is seen as the more conservative of the two.
All in all, this visualisation does reveal the true voting characteristics of the parties and can be used to make a more informed decision when voting in new elections or holding parties accountable for their promises.
The Open Data Portal, as the website is called, allows us to query which party has voted in favor of or against motions and other decisions. For the purposes of this analysis, we're not going to look at the subject of the vote.
There are over 15.000 unique decisions in this dataset (for which every party has voted). After extracting and cleaning the relevant data, we're left with the following input dataset:
besluit (decision) has a unique identifier for each decision
fractie (political party) has the names of the political parties in it
stem (vote) contains the vote of a party for a certain decision where Voor means in favor and Tegen means against.
We transform the input dataset to generate a new dataset where every row contains the voting behaviour of a policital party. There are two columns for each decision. A value of 1 is put in the first of those columns if a party voted against this decision and a 1 is put in the second of those columns if the party voted in favor of this decision. Doing this yields a dataset with very many columns (over 27.000) and only as many rows as there are political parties: 13.
Correlation between Parties
We have created a long list of ones and zeros for each party based on their voting behaviour, but now we need to find some sort of similarity metric to compare them. How does the voting behaviour of one party correlate with the voting behaviour of another party? One answer is to view each voting behaviour as a signal over time and calculate the correlation between various signals. To help you understand this, the votes of the first 20 decisions of four parties is visualised as a line chart in the image below.
We can calculate the correlation between each of these signals and display the result in the matrix below. The larger the blue circle, the more the voting of a party correlates with that of a different party. A red circle indicates that the parties voting behaviour negatively correlates with another party. The large blue circles along the diagonal are the largest because it represents the correlation with the voting behaviour of the exact same party.
Dimensionality reduction is a really cool technique that allows us to display high-dimensional data in a two-dimensional visualisation. Dimensionality reduction works by placing data points in such a way that similar data points are placed close together and very different data points are placed far apart. The dataset we use as an input for this dimensionality reduction has 13 rows (one for each party) and 13 columns (one for each correlation value with another party). Sometimes we refer to columns as being dimensions, so this would be a 13-dimensional dataset. At what number of dimensions a dataset could be called high-dimensional is a topic for another day.
We apply a dimensionality reduction called PCA onto this dataset. You can specify the number of dimensions that the dataset should be reduced to. Since we want to visualise the result in a scatter plot, we choose this value to be 2. The final visualisation along with an interpretation can be seen at the start of this article.
The code that has been used for this analysis can be found here: https://colab.research.google.com/gist/BastiaanGrisel/2d64124d847ca1c5c176fba7cc4edbd0/analysis.ipynb
Every analysis comes with its assumptions, we'll discuss the most obvious ones here.
We have ignored parties that haven't voted on enough issues because we wouldn't have enough data to compare them to other parties.
In favor of and against are mutually exclusive so we could have removed half of the columns and reduced the dimensionality. If there is a 1 in in favor of then that automatically means that there is a 0 in against.
Calculating the correlation between 27.000-dimensional vectors might lead to unexpected results even though it did make for a good visualisation in the end. More investigation should be done about why certain parties are similar.
Using PCA to reduce the dimensionality from 15 to 2 might introduce errors where similar data points are placed far apart or vice versa. This is not the case in this analysis since 91% of the variance in the original dataset is retained in the two-dimensional visualisation.
What do these visualisations look like if you seperate out each subject of the vote (climate, economy, etc)?
How does this visualisation change over time? What if you generate this visualisation based on a 2-month time frame and animate the result over time?
Make a voting tool where you are asked to vote on some issues and the tool would return the parties your voting pattern correlates with.
Create an app that allows people at home to vote on issues as well. Some sort of direct-polling application that could be used to give the ordinary people a vote in the House of Representatives.