English

Evolution of Data Ingestion at Prezi

Everybody says and knows that using data to build your product is a necessity. Now, that easier said than done. We at Prezi have gone through several iterations in getting the right data and learned that the key to your success will ultimately lie in how you are capturing that data to begin with and providing quick feedback to teams. In this talk, we will cover how we took our need to better understand how our users use our product and how we ended up designing a system for event processing to get those insights. Even though this does not sound hard, we burnt ourselves a couple of times and we redesigned our data ingestion pipeline a couple of times to get to the state where we are today. We will start by covering how our data ingestion pipeline evolved from starting with semi-structured event data copied to S3 with a bash script to using Avro with Confluent schema registry ingesting events from Apache Kafka with Apache Gobblin to S3. Even though Apache Kafka helps a lot in scaling, just using Kafka is not your silver bullet. We had to introduce multiple components like using Avro format and the schema registry to solve for the missing pieces. We will also cover how we built around this ecosystem to make sure engineers can’t break our system, make it less painful to instrument the right events and how instrumentation works across the various platforms we support (Mac, Windows, IOS, Android, Web). Lastly, we will cover how this framework allows us to move faster by benefiting data science projects and self-service data in general.

Krasznai Gergely
Senior Data Analyst, Prezi

Németh Tamás
Tech Lead of Data Platform Team, Prezi