English

Data-aware execution and visualization in Spark

Highly skewed and temporally inhomogeneous key distributions can lead to slower-than-expected execution on common workloads. Understanding the data characteristics in addition to data flow during the execution is extremely beneficial for batch and streaming use cases as well. We show that by making Spark Data-Aware, a more stable and faster execution can be achieved. We expand our data-awareness with record-level tracing capabilities to co-locate services and data-processing workloads effectively.

Zvara Zoltán
Research Scientist, Hungarian Academy of Sciences

Zoltán Zvara is a PhD student at the Eötvös Lorand University, Hungary, and a researcher at the Hungarian Academy of Sciences. His focus of research is the execution and scheduling of general-purpose distributed data processing frameworks. Zoltán lead various Spark-related research projects to success in cooperation with well-known industrial partners.