Hadoop has been a major influence on organisations decisions on how to manage and process large volumes of data outside of the traditional OLTP database market. Whilst it has been relatively successful in terms of market adoption our thesis is that is that it has not been successful on delivering on its promises due to an architecture that is close to 15 years old and based on some flawed assumptions (such as the importance of data locality).
This talk intends to explain the history of how we got to Hadoop, its strengths and weaknessess and why a lot of the core assumptions are not nescessarily relevant in 2017 where technical constraints are vastly different from 2004 when Hadoop was born. We will show the results of the AGL move from Hadoop to cloud based ETL (Extract, Transform, Load) utilising Apache Spark against Blob storage in an environment where true elastic compute is available.
We also want to explain how Event Sourcing has come about as an idea/technology and how it fits well with the Beyond Hadoop world when performing tasks like ETL and how it vastly simplifies the architecture we have chosen at AGL Energy.
Outline/structure of the Session
The talk is structured in three parts. The intention is to present only photos/images on the screen rather than text/code as most of the day will heavily text/code based for participants which gets overwhelming.
1. The Rise of Hadoop
This talks about the history of data processing for large organisations effecitvely comparing the IBM Mainframe (or equivalent) approach to resilency to the Google model following the Pets vs Cattle pattern. It also discusses the scale of cloud infrastructure and why some of the Hadoop decisions no longer make sense.
In the next few years we will hear a lot about Event Sourcing. This portion compares introduces Event Sourcing using a brief history of Bookkeeping/Accounting to explain what event sourcing is and why a lot of legacy systems were created (primarily due to lack of compute which is no longer a constraint). The intention is to start people thinking about the benefits of applying functional programming concepts of immutability, and pure functions to help reason about data pipelines.
3. How we have combined these two concepts at AGL
A sumamry of the outcomes from AGL. With metrics.
- That Hadoop should not be the go-to solution for companies seeking Big Data and/or Data Warehousing solutions.
- That EventSourcing can greatly simplify the types of work people often do in the Hadoop ecosystem.
People who have implemented or are considering implementing Apache Hadoop