Data Parallel Data Flow
Repa Flow is a library for data parallel data flow programming in Haskell. A flow is a bundle of independent streams, and the library provides operators such as map, fold and filter that apply to all streams in a bundle. Data parallelism is introduced by evaluating each stream in a separate thread on a multi-core machine.Like a souped-up version of the Haskell conduit or pipes library, Repa Flow adds support for efficient chunked streams of unboxed data; bucketed files; and analytic operators such as the SQL-like groupBy and the Hadoop-like shuffle. Repa Flow uses three separate array fusion methods to gain good numeric performance, all while maintaining a pleasant user-facing API.
I’ll use Repa to introduce the Data Parallel Data Flow paradigm, and discuss the general concept of flow polarity, meaning that data is pulled from stream sources but pushed to stream sinks.If a given flow operator (like map or zip) has multiple input or output ports then polarities can be assigned to these ports in various ways. However, only some assignments permit the operator to run in constant space.This fact is intrinsic to the data-flow model, rather than being specific to any particular implementation. In Repa Flow, only operators that run in constant space are exported by the library, which ensures that all programs written with the library also run in constant space by construction.