Realizing the Promise of Portable Data Processing with Apache Beam
The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the Big Data ecosystem together; it enables users to "run-anything-anywhere".
This talk will briefly cover the capabilities of the Beam model for data processing, as well as the current state of the Beam ecosystem. We'll discuss Beam architecture and dive into the portability layer. We'll offer a technical analysis of the Beam's powerful primitive operations that enable true and reliable portability across diverse environments. Finally, we'll demonstrate a complex pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Spark on Amazon Web Services, Apache Flink on Google Cloud, Apache Apex on-premise), and give a glimpse at some of the challenges Beam aims to address in the future.
Outline/structure of the Session
We'll start by discussing the Apache Beam programming model and how it can effectively express data-parallel pipelines.
Then, we'll dive into Beam's vision of portability: "write once, run anywhere" for any big data problem, both from the theoretical and practical standpoint. We'll show the current state of the vision, and demonstrate a reasonably complex pipeline running live on multiple execution engines.
Finally, we'll discuss how Apache Beam serves as a glue that connects the big data ecosystem together, bringing the gap between various big data technologies.
The audience will learn how to leverage the power of Apache Beam to solve a broad range of big data analytics problems, anywhere from a simple batch computation to a complex streaming system with early/speculative results and handling of late data.
Software Architects; Software Engineers; Data Analysts; Data Scientists; Operations/IT
Basic understanding of the Big Data Analytics space would be beneficial, as well as key terms and technologies, including Apache Spark, Apache Flink, and/or Apache Beam.
schedule Submitted 1 year ago
People who liked this proposal, also liked:
Radek Ostrowski - Dipping into the Big Data River: Stream Analytics at ScaleRadek OstrowskiBig Data EngineerCBA/Toptal
schedule 1 year agoSold Out!
This presentation explains the concept of Kappa and Lambda architectures and showcases how useful business knowledge can be extracted from the constantly flowing river of data.
It also demonstrates how a simple POC could be built in a day with only getting your toes wet by leveraging Docker and other technologies like Kafka, Spark and Cassandra.