Bringing Continuous Delivery to Big Data Applications
In this presentation I will talk about our experience at SEEK implementing Continuous- Integration & Delivery (CI/CD) in two of our Big Data applications.
I will talk about the Data Lake project and its use of micro-services to break down data ingestion and validation tasks, and how it enables us to deploy changes to production more frequently. Data enters SEEK’s data lake through a variety of sources, including AWS S3, Kinesis and SNS. We use a number of loosely coupled serverless microservices and Spark jobs to implement a multi-layer data ingestion and validation pipeline. Using the microservices architecture enables us to develop, test and deploy the components of the pipeline independently and while the pipeline is operating.
We use Test-Driven Development to define the behaviour of micro-services and verify that they transform the data correctly. Our deployment pipeline is triggered on each code check-in and deploys a component once its tests pass. The data processing pipeline is idempotent so if there is a bug or integration problem in a component we can fix it by replaying the affected data batches through the component.
In the last part of the talk, I’ll dive deeper into some of the challenges we solved to implement a CI/CD pipeline for our Spark applications written in Scala.
Outline/Structure of the Case Study
I will start with introducing the Data Lake project, go through its architecture and its use of microservices, with a brief recap of the advantages of microservices architecture and Continuous Delivery.
Next, I will show the Continuous Delivery pipeline that takes each code commit, tests it and deploys to production if the tests pass. The data dictionary is also automatically generated from the source code using the same pipeline.
I will then talk about Test-Driven Development and how we use that to write more robust components. This was challenging at the beginning and we needed to provide some training for the team to get more familiar with TDD practices through code retreats and pair programming. I will provide examples of some Python services that are each built for a specific purpose.
After that, I’ll show how we use Docker to enable other teams to validate their changes before they submit Pull Requests to the Data Lake. As mentioned earlier, SEEK Data Platform can receive data from a variety of sources, and any team can contribute in the form of new data sources or transformations in a few simple steps. Using Docker gives the teams the ability to use the same tools that are used in the production system to develop their code.
In the last part of the talk, I’ll dive deeper into how we implement the CI/CD pipeline for our Spark applications (Scala) that are built to process large data sets, using only Scala test frameworks and Spark local mode. I will talk about a few approaches that seemed reasonable at the beginning but didn’t work in practice and how we found a practical solution.
The main takeaway for attendees will be some techniques to better test their big data applications, especially Spark applications. They will also learn about the advantages of breaking big data applications into microservices.
Software Developers, Data Engineers
Prerequisites for Attendees
Familiarity with the following concepts and technologies will be very helpful:
Continuous Delivery, Micro Services, Python, Scala, Spark.