English

Building a data warehouse using Spark SQL

Why on Earth would you want to replace your data warehouse with a bunch of files lying around “in the cloud” and expect that everyone from engineers to data analysts and scientists will tap more into that data to power their analyses and their research and in general, do their work?

Separating the storage of data from the computational work and giving everyone the right tool for their job within a single, coherent environment has made it easy for our engineers, analysts, data scientists, and even non-technical people to collaborate on and work with data.

Working with data is hard. Far from the La La Land of Machine Learningstan and the United States of AI, there’s many of us who still need to deal with messy data, ETL, backfilling, failed jobs, inefficient SQL queries, overloading production databases, partitioning, and with the advent of cloud computing: unterminated Spark clusters and infrastructure cost. This talk is about how we tackled some of those challenges while building out our new data warehouse using S3 and Spark.

This talk is for the rest of us.

Ratky Gábor
CTO,  Secret Sauce Partners, Inc.

I’m the technical lead at Secret Sauce Partners since its inception in 2010. We started out as a small startup focused on solving problems in underserved markets. Today, our services are used by over 50 million people on some of the largest online apparel stores and marketplaces. Every day we collect, extract, transform, and process a ton of data to better understand our customers and our products. My job is to make sure we can efficiently scale our business, not just from a technical perspective, but also as an organization: a group of holly jolly people trying to work together.

Previously I’ve spent three great years at EPAM and another four commuting between San Francisco and Budapest.