=>Download presentation material
Orchestrating Data Pipelines with Prefect
Modern data applications are as complex as ever, and along with this complexity also comes an ever-growing number of ways these applications can fail. Data sources can go down intermittently. Input data may come in malformed. Training machine learning models may not converge. To guard against this, a non-trivial amount of effort is spent creating code paths to handle these failures gracefully. These include task retrying, execution timeouts, and notifications in the event of failure. Collectively, the effort spent guarding against failure is called negative engineering. On the other hand, positive engineering refers to the work to accomplish the initial goals in the first place. Unfortunately, the effort spent on negative engineering significantly outweighs the time spent on positive engineering in most cases.
Workflow orchestration frameworks are specifically designed to reduce the effort spent on negative engineering. These tools allow users to schedule and monitor workflows. Added to monitoring, failure paths can be added so that the application knows how to respond in the event something unexpected happens. This talk will cover basic workflow orchestration functionality such as restarting workflows that failed, skipping redundant tasks, and handling downstream tasks differently if upstream tasks failed. We’ll familiarize ourselves with the features that let us handle these problems gracefully.
We will construct a basic workflow with native Python first. We’ll add workflow orchestration capabilities to it using Prefect, an open-source Python-based workflow orchestration framework. In the process, we’ll use features such as retries, notifications, parameterization, and scheduling. This talk is meant for anyone that is already running Python code and struggling to handle errors gracefully.
As the last bit, one unique feature of Prefect is that it was built natively on Dask, allowing it to distribute and parallelize tasks in workflows and scale-out over a cluster just by changing a few lines of code.
Kevin Kho
Community Engineer, Prefect
Kevin Kho is an Open Source Community Engineer at Prefect, an open-source workflow orchestration management system. Previously, he was a data scientist at Paylocity, where he worked on adding machine learning features to their Human Capital Management (HCM) Suite. Outside of work, he is a contributor for Fugue, an abstraction layer for distributed computing. He also organizes the Orlando Machine Learning and Data Science Meetup.