Kongo: Building a Scalable Streaming IoT Logistics Application using Apache Kafka and Cassandra
Imagine that you have (or want) a Logistics (or any other IoT) Empire with 100s of warehouses and trucks, transporting 1000s of Goods every hour. Warehouses and trucks have sensors measuring and reporting 10s of environmental metrics, and Goods have RFID tags so their location can be constantly tracked. Complex rules need to be checked in real-time including if the environment is safe for each Goods, and if Goods can safely be transported together. This is a demanding distributed spatiotemporal processing problem. How can you build a low-latency (sub-second latency) application that will cope with delivering events to multiple consumers, massively scale with increasing numbers of "things" and business tempo, cope with event delays, failures and load spikes, and ensure robustness by persisting events for future reprocessing?
Join with me in a journey of exploration upriver for highlights of my experiences over the last 6 months building "Kongo", an Apache Kafka IoT streaming logistics demonstration application. We'll start the journey with an overview of the logistics problem domain, look at the application design choices and evolution leveraging different Kafka components and patterns, augment the application with Kafka Streams and Kafka Cassandra Connect, and end the journey by scaling Kongo on a production Instaclustr managed AWS Kafka cluster.
Outline/Structure of the Talk
We'll explore the example logistics problem domain, and the requirements and design of an initial standalone Kongo simulation and application.
Next, we start Kafkafying Kongo by adding Kafka Producers, Consumers and (de-)serialization, explore the pros and cons of having one or many topics, ways to handle high fan outs (each event sent to many subscribers), and if event order matters (it does).
We then explore another Kafka API, Kafka Connect, and augment Kongo by using a Kafka Cassandra sink connector to persist a new event type (resulting from checking rules) to Cassandra for future processing and querying, and play whack-a-mole with distributed connect workers.
We take a look at a 4th Kafka API, Streams, which enables more complex streams processing use cases (state, transformation, etc). We'll add a streams enhancement to Kongo to check if trucks are overloaded, along the way encountering Topology Exceptions, and using some Transactional magic pixie dust!
We complete the journey by deploying Kongo to a Instaclustr production Kafka cluster on AWS, and do some benchmarking to see how well it scales with sub-second latencies. We'll revisit the high fan out design choices, work out how to scale consumers, and how we finally achieved linear scalability, and a throughput of 400,000 events a second on 3 node x 4 cores per instance Kafka cluster.
Overview of Apache Kafka use cases, architecture and APIs, including Producers, Consumers, Connect, and Streams.
Overview of an example IoT problem and application implemented using Apache Kafka and Cassandra.
Pros and cons of Apache Kafka design choices including number of topics, number of consumers, high fan out patterns, event ordering, etc.
Example applications of Kafka connect and streams.
How to scale a Kafka application in production.
Interest in scalable low-latency Open Source technologies (E.g. Apache Kafka, Cassandra) for IoT application development.
Prerequisites for Attendees
Some knowledge of, or interest in, pub-sub and event-based distributed systems.