How to incrementally scale existing workflows on Spark, Dask or Ray?
Using Spark, Dask, or Ray is not an all-or-nothing thing. It may seem daunting for new practitioners expecting to translate existing Pandas pipelines to these big data frameworks. In reality, distributed computing can be incrementally adopted. There are many use cases where only one or two steps of a pipeline require expensive computation. This talk covers the strategies and best practices around moving portions of workloads to distributed computing through the open-source Fugue project. The Fugue API has a suite of standalone functions compatible with Pandas, Spark, Dask, and Ray. Collectively, these functions allow users to scale any part of their pipeline when ready for full-scale production workloads on big data.
Kevin Kho is a maintainer for the Fugue project, an abstraction layer for distributed computing. Previously, he was an Open Source Community Engineer at Prefect, a workflow orchestration management system. Before working on data tooling, he was a data scientist for 4 years.