Building a high scale machine learning pipeline with Apache Spark and Kafka

Gutefrage.net is a question and answer website where questions are asked, answered and edited by its community of users. With over 100+ million page views per month it is one of the biggest websites in Germany. As we have 18 million questions and 70 million answers, it was a huge challenge for us to be able to present the answers in a relevant order for a specific question. Thanks to Spark, Kafka and a couple of other components we managed to set up a machine learning pipeline that efficiently processes incoming events from the website, calculates various features for every answer and in the end feeds them to a neural network that decides how ‘good’ a specific answer is. In this talk I’d like to present You our infrastructure, the obstacles we faced on the road creating the pipeline and how we monitored the success of our application.

Bedő Dániel
Senior Engineer, gutefrage.net

Dániel Bedő has 10+ years of experience as a software engineer. As a passionate Scala developer working for gutefrage.net GmbH. in Munich, Germany, he was part of the team that helped migrate gutefrage.net to a micro-service architecture. In the recent years he shifted the focus to big data, utilising machine learning to improve the user experience on gutefrage.net and gain insights into the available datasets.