
Dean Wampler
Director of Engineering
IBM Research
location_on United States
Member since 5 years
Dean Wampler
Specialises In
Dean Wampler is an expert in data systems, focusing on applications of ML/AI. He leads an engineering team at IBM Research that helps research organizations leverage quantum computing, advanced simulations, cloud services, and other technologies for accelerated discovery of new drugs and materials, modeling climate trends, etc. Dean is the author of several O’Reilly books, including "Programming Scala, Third Edition", and he blogs on various topics at Medium. Dean contributes to several open source projects and he co-organizes and speaks at many technology conferences and Chicago-based user groups. Dean has a Ph.D. in Physics from the University of Washington.
-
keyboard_arrow_down
Lessons Learned from 15 Years of Scala in the Wild
45 Mins
Keynote
Advanced
Scala 3 was introduced last year. It introduced significant changes to the language, many of which were motivated by the lessons learned from the past 15 or so years of actual use in many open-source and commercial applications.
I'll explore these lessons and how Scala 3 addresses them. Many revolve around the pros and cons of implicits. Also, changes to the type system make it more "regular", robust, and expressive. Finally, the new, optional, and controversial "Python-like" syntax promotes even more brevity. It also acknowledges how influential and pervasive Python has become across our industry.
But there are many practical areas where future work is required, many of which are larger than the scope of Scala itself. We still live in "dependency hell". We still use too many obsolete idioms that hide accidental complexity, rather than forcing us to fix it. What should we do about these issues?
-
keyboard_arrow_down
Cluster-wide Scaling of Machine Learning with Ray
45 Mins
Invited Talk
Intermediate
Popular ML techniques like Reinforcement learning (RL) and Hyperparameter Optimization (HPO) require a variety of computational patterns for data processing, simulation (e.g., game engines), model search, training, and serving, and other tasks. Few frameworks efficiently support all these patterns, especially when scaling to clusters.
Ray is an open-source, distributed framework from U.C. Berkeley’s RISELab that easily scales applications from a laptop to a cluster. It was created to address the needs of reinforcement learning and hyperparameter tuning, in particular, but it is broadly applicable for almost any distributed Python-based application, with support for other languages forthcoming.
I'll explain the problems Ray solves and how Ray works. Then I'll discuss RLlib and Tune, the RL and HPO systems implemented with Ray. You'll learn when to use Ray versus alternatives, and how to adopt it for your projects.
-
keyboard_arrow_down
Reactive Designs & Language Paradigms
50 Mins
Talk
Intermediate
Can reactive designs be implemented in any programming language? Or, are some languages and programming paradigms better for building reactive systems? How do traditional design approaches, like Object-Oriented Design (OOD) and Domain-Driven Design (DDD), apply to reactive applications. The Reactive Manifesto strikes a balance between specifying the essential features for reactive systems and allowing implementation variations appropriate for each language and execution environment. We’ll compare and contrast different techniques, like Reactive Streams, callbacks, Actors, Futures, and Functional Reactive Programming (FRP), and we’ll see examples of how they are realized in various languages and toolkits. We’ll understand their relative strengths and weaknesses, their similarities and differences, from which we’ll draw lessons for building reactive applications more effectively.
-
keyboard_arrow_down
Copious Data, the “Killer App” for Functional Programming
30 Mins
Talk
Advanced
But the MapReduce computing model is hard to use. It’s very course-grained and relatively inflexible. Translating many otherwise intuitive algorithms to MapReduce requires specialized expertise. The industry is already starting to look elsewhere…
However, the very name MapReduce tells us its roots, the core concepts of mapping and reducing familiar from Functional Programming (FP). We’ll discuss how to return MapReduce and Copious Data, in general, to its ideal place, rooted in FP. We’ll discuss the core operations (“combinators”) of FP that meet our requirements, finding the right granularity for modularity, myths of mutability and performance, and trends that are already moving us in the right direction. We’ll see why the dominance of Java in Hadoop is harming progress. You might think that concurrency is the “killer app” for FP and maybe you’re right. I’ll argue that Copious Data is just as important for driving FP into the mainstream. Actually, FP has a long tradition in data systems, but we’ve been calling it SQL…
The world of Copious Data (permit me to avoid the overexposed term Big Data) is currently dominated by Apache Hadoop, a clean-room version of the MapReduce computing model and a distributed, (mostly) reliable file system invented at Google.
-
keyboard_arrow_down
Reactive Design & Language Paradigms
60 Mins
Keynote
Advanced
Can reactive designs be implemented in any programming language? Or, are some languages and programming paradigms better for building reactive systems? How do traditional design approaches, like Object-Oriented Design (OOD) and Domain-Driven Design (DDD), apply to reactive applications. The Reactive Manifesto strikes a balance between specifying the essential features for reactive systems and allowing implementation variations appropriate for each language and execution environment. We’ll compare and contrast different techniques, like Reactive Streams, callbacks, Actors, Futures, and Functional Reactive Programming (FRP), and we’ll see examples of how they are realized in various languages and toolkits. We’ll understand their relative strengths and weaknesses, their similarities and differences, from which we’ll draw lessons for building reactive applications more effectively.
-
keyboard_arrow_down
Scala and the JVM as a Big Data Platform: Lessons from the Spark Project
60 Mins
Talk
Advanced
Apache Spark is implemented in Scala and it’s user-facing Scala API is very similar to Scala’s own collections API. The power and concision of this API are bringing many developers to Scala.
On the other hand, while the JVM is an excellent, general-purpose platform for scalable computing, its management of objects is suboptimal for high-performance data crunching. Hence, the Spark project has recently started a project called ””Tungsten”” to build internal optimizations based on custom data layouts, manual memory management (both on-heap and off-heap), etc.
Using these and other examples from the Spark project, this talk discusses the strengths and weaknesses of Scala and the JVM for Big Data.
-
keyboard_arrow_down
Scala and the JVM as a Big Data Platform: Lessons from the Spark Project
60 Mins
Talk
Advanced
Apache Spark is implemented in Scala and it’s user-facing Scala API is very similar to Scala’s own collections API. The power and concision of this API are bringing many developers to Scala.
On the other hand, while the JVM is an excellent, general-purpose platform for scalable computing, its management of objects is suboptimal for high-performance data crunching. Hence, the Spark project has recently started a project called ””Tungsten”” to build internal optimizations based on custom data layouts, manual memory management (both on-heap and off-heap), etc.
Using these and other examples from the Spark project, this talk discusses the strengths and weaknesses of Scala and the JVM for Big Data.
-
keyboard_arrow_down
Workshop - Workshop on Streaming Data with Kafka and Microservices
480 Mins
Workshop
Intermediate
When we think of modern data processing, we often think of batch-oriented ecosystems like Hadoop, including processing engines like Spark. However, the sooner we can extract useful information from our data, the better, which is driving an evolution towards stream processing or “fast data”. Many of the legacy tools, including Spark, provide various levels of support for stream processing, but deeper architectural changes are emerging.
Then we’ll work through code examples that use Akka Streams and Kafka Streams with Kafka to implement a machine-learning example where a machine learning model is updated periodically to simulate the problem of periodic retraining and serving of ML models in a streaming context. In particular, if you periodically retrain the model using one tool chain, for example, once a day, how to do you incorporate the updated model into a running pipeline for scoring without restarting the pipeline?
-
keyboard_arrow_down
Workshop - Streaming Data with Kafka and Microservices
480 Mins
Workshop
Intermediate
When we think of modern data processing, we often think of batch-oriented ecosystems like Hadoop, including processing engines like Spark. However, the sooner we can extract useful information from our data, the better, which is driving an evolution towards stream processing or “fast data”. Many of the legacy tools, including Spark, provide various levels of support for stream processing, but deeper architectural changes are emerging.
Then we’ll work through code examples that use Akka Streams and Kafka Streams with Kafka to implement a machine-learning example where a machine learning model is updated periodically to simulate the problem of periodic retraining and serving of ML models in a streaming context. In particular, if you periodically retrain the model using one tool chain, for example, once a day, how to do you incorporate the updated model into a running pipeline for scoring without restarting the pipeline?
-
keyboard_arrow_down
Streaming Data with Kafka and Microservices
50 Mins
Talk
Intermediate
When we think of modern data processing, we often think of batch-oriented ecosystems like Hadoop, including processing engines like Spark. However, the sooner we can extract useful information from our data, the better, which is driving an evolution towards stream processing or “fast data”. Many of the legacy tools, including Spark, provide various levels of support for stream processing, but deeper architectural changes are emerging.
-
keyboard_arrow_down
Streaming Data with Kafka and Microservices
50 Mins
Talk
Intermediate
When we think of modern data processing, we often think of batch-oriented ecosystems like Hadoop, including processing engines like Spark. However, the sooner we can extract useful information from our data, the better, which is driving an evolution towards stream processing or “fast data”. Many of the legacy tools, including Spark, provide various levels of support for stream processing, but deeper architectural changes are emerging.
-
keyboard_arrow_down
Streaming Data with Kafka and Microservices
50 Mins
Talk
Intermediate
When we think of modern data processing, we often think of batch-oriented ecosystems like Hadoop, including processing engines like Spark. However, the sooner we can extract useful information from our data, the better, which is driving an evolution towards stream processing or “fast data”. Many of the legacy tools, including Spark, provide various levels of support for stream processing, but deeper architectural changes are emerging.
-
keyboard_arrow_down
Hands-on Kafka Streaming Microservices with Akka Streams and Kafka Streams
300 Mins
Workshop
Intermediate
If you're building streaming data apps, your first inclination might be to reach for Spark Streaming, Flink, Apex, or similar tools, which run as services to which you submit jobs for execution. But sometimes, writing conventional microservices, with embedded stream processing, is a better fit for your needs.
In this hands-on tutorial, we start with the premise that Kafka is the ideal backplane for reliable capture and organization of data streams for downstream consumption. Then, we build several applications using Akka Streams and Kafka Streams on top of Kafka. The goal is to understand the relative strengths and weaknesses of these toolkits for building Kafka-based streaming applications. We'll also compare and contrast them to systems like Spark Streaming and Flink, to understand when those tools are better choices. Briefly, Akka Streams and Kafka Streams are best for data-centric microservices, where maximum flexibility is required for running the applications and interoperating with other systems, while systems like Spark Streaming and Flink are best for richer analytics over large streams where horizontal scalability through "automatic" partitioning of the data is required.
Each engine has particular strengths that we'll demonstrate:- Kafka Streams is purpose built for reading data from Kafka topics, processing it, and writing the results to new topics. With powerful stream and table abstractions, and an "exactly-once" capability, it supports a variety of common scenarios involving transformation, filtering, and aggregation.
- Akka Streams emerged as a dataflow-centric abstraction for the general-purpose Akka Actors model, designed for general-purpose microservices, especially when _per-event_ low-latency is important, such as for complex event processing, where each event requires individual handling. In contrast, many other systems are efficient at scale, when the overhead is amortized over sets of records or when processing "in bulk". Also because of its general-purpose nature, Akka Streams supports a wider class of application problems and third-party integrations, but it's less focused on Kafka-specific capabilities.
Kafka Streams and Akka Streams are both libraries that you integrate into your microservices, which means you must manage their lifecycles yourself, but you also get lots of flexibility to do this as you see fit.
In contrast, Spark Streaming and Flink run their own services. You write "jobs" or use interactive shells that tell these services what computations to do over data sources and where to send results. Spark and Flink then determine what processes to run in your cluster to implement the dataflows. Hence, there is less of a DevOps burden to bear, but also less flexibility when you might need it. Both systems are also more focused on data analytics problems, with various levels of support for SQL over streams, machine learning model training and scoring, etc.
For the tutorial, you'll be given an execution environment and the code examples in a GitHub repo. We'll experiment with the examples together, interspersed with short presentations, to understand their strengths, weaknesses, performance characteristics, and lifecycle management requirements. -
keyboard_arrow_down
Stream All the Things!!
50 Mins
Keynote
Intermediate
Streaming data architectures aren't just "faster" Big Data architectures. They must be reliable and scalable as never before, more like microservice architectures.
This talk has three goals:- Justify the transition from batch-oriented big data to stream-oriented fast data.
- Explain the requirements that streaming architectures must meet and the tools and techniques used to meet them.
- Discuss the ways that fast data and microservice architectures are converging.
Big data started with an emphasis on batch-oriented architectures, where data is captured in large, scalable stores, then processed using batch jobs. To reduce the gap between data arrival and information extraction, these architectures are now evolving to be stream oriented, where data is processed as it arrives. Fast data is the new buzz word.
These architectures introduce new challenges for developers. Whereas a batch job might run for hours, a stream processing system typically runs for weeks or months, which raises the bar for making these systems reliable and scalable to handle any contingency.
The microservice world has faced this challenge for a while. Microservices are inherently message driven, responding to requests for service and sending messages to other microservices, in turn. Hence, they are also stream oriented, in the sense that they must respond reliably to never-ending input. So, they offer guidance for how to build reliable streaming data systems. I'll discuss how these architectures are merging in other ways, too.
We'll also discuss how to pick streaming technologies based on four axes of concern:- Low latency: What's my time budget for handling this data?
- High volume: How much data per unit time must I handle?
- Data processing: Do I need machine learning, SQL queries, conventional ETL processing, etc.?
- Integration with other tools: Which ones and how is data exchanged between them?
We'll consider specific examples of streaming tools and how they fit on these axes, including Spark, Flink, Akka Streams, and Kafka.
-
No more submissions exist.
-
No more submissions exist.