At Zeotap, we manage over 1000 data pipelines, many interlinked, across its 1st and 3rd party data assets, both in batch and streaming mode. These data pipelines were written using various compute engines like Apache Spark, Dataflow(Apache Beam), BigQuery, etc. At times, many of these pipelines would run on clusters that would face performance bottlenecks or would block due to resource unavailability(spot nodes). At those times we wished we could take a framework code and run it using another, say Spark onto Bigquery. But alas, the coupling of the domain with the platform at the code level would preempt this. We saw value in a unifying language, more precisely DSL, to combine all these data processes and make them interoperable.. Zeoflow is the result of creating such a unified data processing DSL. It is based on Free Monads and works as a high-level programming model and hence compute engines like Spark, Beam or any other can become plug-and-play interpreters. Additionally, data pipelines have other requirements it has to address like data read/write, data quality & metric reporting, and writing easily testable & debuggable code. Hence, we went ahead and created an ensemble of applications to better manage and decouple all of the above aspects based on Free Monads and Applicatives.
The presentation will take you through the journey of what started as a small library with reusability and ease of modeling business rules in mind and the design principles, requirements, and functional domain modeling we followed while choosing constructs like Free and State Monads. This helped it grow into an extensible DSL-based application suite that can operate over simple SQL engines to complex Beam-based pipelines. This is a production system at zeotap and the project is in its final stages of getting open-sourced. We try to show the beauty of pure FP constructs which have been used to solve a complex real-world use case of data pipelines and data processing, in general.