A Data-Driven Political Landscape

The image on the left is generated by analysing 15.000 decisions in the Dutch parliament. Parties that are plotted close together had similar voting behaviour.

Step 1 - Access the Data

The Dutch government has open-sourced it’s parliament data. It’s accessible via an API that can be found here: https://opendata.tweedekamer.nl/

A sample of the dataset that has been used in this analysis can be viewed on the right.

Step 2 - Prepare the data for analysis

Now things get interesting! We transform the input dataset to generate a new dataset where every row contains the voting behaviour of a political party.

There are two columns for each decision. A value of 1 is put in the first of those columns if a party voted against this decision and a 1 is put in the second of those columns if the party voted in favor of this decision.

Doing this yields a dataset with very many columns (over 27.000) and only as many rows as there are political parties: 13.

Step 3 - Find the Correlation

We have created a long list of ones and zeros for each party based on their voting behaviour, but now we need to find some sort of similarity metric to compare them. How does the voting behaviour of one party correlate with the voting behaviour of another party? One answer is to view each voting behaviour as a signal over time and calculate the correlation between various signals. To help you understand this, the votes of the first 20 decisions of four parties is visualised as a line chart in the image on the right.

We can calculate the correlation between each of these signals and display the result in the matrix below. The larger the blue circle, the more the voting of a party correlates with that of a different party. A red circle indicates that the parties voting behaviour negatively correlates with another party. The large blue circles along the diagonal are the largest because it represents the correlation with the voting behaviour of the exact same party.

Step 4 - Add some Machine Learning

Dimensionality reduction is a really cool technique that allows us to display high-dimensional data in a two-dimensional visualisation. Dimensionality reduction works by placing data points in such a way that similar data points are placed close together and very different data points are placed far apart.

The dataset we use as an input for this dimensionality reduction has 13 rows (one for each party) and 13 columns (one for each correlation value with another party). Sometimes we refer to columns as being dimensions, so this would be a 13-dimensional dataset.

We apply a dimensionality reduction called PCA onto this dataset. You can specify the number of dimensions that the dataset should be reduced to. Since we want to visualise the result in a scatter plot, we choose this value to be 2. The final visualisation along with an interpretation can be seen at the start of this article.

A video explaining the concept behind Dimensionality Reduction

Step 5 - Open Source the Result

I’ve uploaded the code that was used to analyse the data and generate the visualisation to Google Colab. This is a platform where people can see and run the code. They can learn from it, or build on top of it to generate their own analyses.

The code can be found here: https://colab.research.google.com/gist/BastiaanGrisel/2d64124d847ca1c5c176fba7cc4edbd0/analysis.ipynb#scrollTo=WmG8pg2DwhyO

Discussion

Every analysis comes with its assumptions, we'll discuss the most obvious ones here.

  • We have ignored parties that haven't voted on enough issues because we wouldn't have enough data to compare them to other parties.

  • In favor of and against are mutually exclusive so we could have removed half of the columns and reduced the dimensionality. If there is a 1 in in favor of then that automatically means that there is a 0 in against.

  • Calculating the correlation between 27.000-dimensional vectors might lead to unexpected results even though it did make for a good visualisation in the end. More investigation should be done about why certain parties are similar.

  • Using PCA to reduce the dimensionality from 15 to 2 can introduce errors where similar data points are placed far apart or vice versa. This is not the case in this analysis since 91% of the variance in the original dataset is retained in the two-dimensional visualisation.

Next Steps

  • What do these visualisations look like if you seperate out each subject of the vote (climate, economy, etc)?

  • How does this visualisation change over time? What if you generate this visualisation based on a 2-month time frame and animate the result over time?

  • Make a voting tool where you are asked to vote on some issues and the tool would return the parties your voting pattern correlates with.

  • Create an app that allows people at home to vote on issues as well. Some sort of direct-polling application that could be used to give the ordinary people a vote in the House of Representatives.

Previous
Previous

Big Data VR Challenge