How to Build and Deploy a Data Science App for Free
In this article, I summarise my experience building and deploying a Streamlit web app to compare the similarity in voting behaviour of Members of the German parliament. After reading this article and having a look at the Github repository, you should be able to build and deploy data science apps with Streamlit yourself.
In one of the lectures of my master’s program in data science, I came across the concept of similarity measures and wondered whether those can be applied to the voting behaviour of German members of parliament. I wanted to find out whether you could measure if an MP’s alignment with the party line or at least identify those that aren’t much aligned. The second question that popped up in my mind was whether some parties are more aligned than others. Wanting to also learn more about building and deploying data science apps, I figured this would be a good study project to combine both of these topics.
The first step was to find a dataset containing the voting behaviour of German MPs. Fortunately, a quick google search yielded promising results. I found the following dataset which contains voting data from 1949 until 2022.
As a start, I only investigated the data for the last complete legislative period from 2017 to 2021. I filtered the large dataset down to the specified timeframe and stored it as a csv file. This allows me to go forward with a more lightweight dataset, speeding up processing times.
First, I thought of visualising similarity in a network graph with networkx and plotly. This approach showed some results but the resulting network graph was not very insightful due to the large number of nodes. You can have a look at the whole code in this Kaggle notebook.
To make things easier to understand, I suspected that it might make sense to apply Principal component analysis to reduce the feature space to 3D so that it can be easily visualised. To create an immersive visualisation for the 3D feature space, I chose a plotly 3D scatterplot. This resulted in the following visualisation:
With the 3d scatterplot as the centerpiece, I now can turn my jupyter notebook into a .py file so that I can run it as a Streamlit application. In case you aren’t familiar with Streamlit, it is an easy to use data science application library with which you can turn code into a running web app with just a few lines of additional code.
import pandas as pd
import numpy as np
import plotly.express as px
import streamlit as st
from sklearn.decomposition import PCA
def main():
df_features = pd.read_csv('data/cleaned_features.csv')
df = pd.read_csv('data/voting_1721.csv')
names = []
parties = []
for index in df_features['id_de_parliament']:
party = df.loc[df['id_de_parliament'] == index, 'party_text'].values[0]
lname = df.loc[df['id_de_parliament'] == index, 'lastname'].values[0]
fname = df.loc[df['id_de_parliament'] == index, 'firstname'].values[0]
names.append(fname + ' ' + lname)
parties.append(party)
pca3d = PCA(n_components=3)
comps3d = pca3d.fit_transform(df_features.iloc[:, 1:])
df_3dpca = pd.DataFrame()
df_3dpca['id_de_parliament'] = df_features['id_de_parliament']
df_3dpca['pc1'] = comps3d[:, 0]
df_3dpca['pc2'] = comps3d[:, 1]
df_3dpca['pc3'] = comps3d[:, 2]
df_3dpca['name'] = names
df_3dpca['party'] = parties
color_mapper = {'': 'white', 'AfD': 'lightblue', 'CDU': 'black', 'CSU': 'darkblue', 'FDP': 'yellow',
'GRÜNE': 'green', 'Linke': 'magenta', 'SPD': 'red'}
fig = px.scatter_3d(df_3dpca, x='pc1', y='pc2', z='pc3',
color='party', color_discrete_map=color_mapper, hover_data=['name'],)
st.title('Similary of German Members of Parliament by Voting Behavior')
st.plotly_chart(fig)
if __name__ == "__main__":
main()
You can also host the apps you developed on Streamlit Cloud for free. Once I am finished developing my app, I can push it to a remote Github repository and connect to my Streamlit Cloud account. Streamlit has an extensive documentation on how to do this. In my case I didn’t create a Conda or virtual environment and therefore had to set this up afterwards. This was necessary to create a requirements file listing all libraries needed for the app. If you aren’t sure how to do this, simply google it. Or have a look at my source code in the linked Github repository.
When I ran the app for the first time, I faced an error with my automatically created requirements file. The certifi library stored a file at a local repository that was out of scope of the Github repository. I tried to remove the file location in my requirements file and replaced it with the library name certifi. After rebooting the app in Streamlit Cloud, the error was gone.
Conclusion
In this article I summarised my process from an initial idea to a deployed data science app. This is of course a simple example but the underlying process can be used for more complex and extensive applications as well. I personally became a fan of Streamlit and especially the cloud functionalities during this project. In my humble opinion, Streamlit is a powerful tool and should be in every data scientists arsenal. The ability to go from initial idea to deployed prototype in hours will help you in becoming a superb data scientist.
The whole code is stored in the following GitHub repository. You can also view the resulting app.