Building a Scalable Data Science Pipeline at REA
REA Group is a multinational digital advertising company specializing in property, most well known for realestate.com.au.
REA has a 5+ year history of using machine learning to segment and profile consumer intent; for example determining if a user on our site is mostly likely a buyer, seller, renter or investor. While we have had success in applied data science, the time from ideation to a shipped product traditionally took a considerably amount of time.
This talk will explore how REA rebuilt its data science pipeline to optimise data scientist autonomy. The focus will be on the technical solutions and social challenges faced by engineering and data science teams.
Outline/Structure of the Case Study
- Outline of the challenges faced when trying to productionise machine learning models in a two team model (data engineering and data science as separate teams)
- Data science team would work on a model (mostly in isolation) and then hand it over to engineering to implement.
- Overview of our tech stack at this point in time (EMR, RedShift, EC2 & SQS).
First we implemented Airflow to ease our job orchestration pains
- What is airflow?
- Pipeline as code demo/example
- How did this help
- Dependency management
- Backfilling data
Then we started using BigQuery to enable organisation wide data publishing
- What is BigQuery
- How it allowed us to share and manage data
Some trends we started to notice
- BigQuery was a really powerful and relatively cheap processing engine.
- Data scientists can write SQL really well and therefore create features at scale in BigQuery themselves without the need for data engineers to translate it to ‘real’ code.
...but we needed a way to make this process robust.
Introducing Breeze (internal tool)
The way Breeze works is that it enables data scientists to run BigQuery SQL tasks in airflow. The interface for this are yaml files stored in a git repo owned by engineering.
- Can chain dependencies together. For example first clean some data and then build feature x and y.
- Scheduled to run daily
- Alerting if anything fails
- Demo of a basic breeze yaml file
- Upskilling of the data science team to use git and do PRs.
We wanted to add some sort of testing, so that we could detect errors early, enable refactoring and all those other good things that come with testing.
- Ability to add test fixtures for SQL and UDFs (User Defined Functions).
- A bit of convincing and training required to get data scientists writing tests but we got there in the end.
Adding Data Integrity Checks
Testing that the SQL does what you expect is great but it doesn't help when the data itself can change unexpectedly. For example, the business decides to turn off a site section that you heavily rely on for predicting a particular behaviour.
- Add validation sql to check outputs are as expected.
- Demo validation yaml
Building on the framework
Once we had these core components it wasn't hard to add additional yaml templates
- Export / Import to S3, SFTP & GCS.
- Export to our internal email and marketing systems.
- Export to S3, run a sagemaker model and import back into BQ.
How other teams might benefit from the process we went through.
- Engineering effort is now more focussed on adding to the platform instead of implementing individual data features/models.
- Data scientists can write production code themselves and not be waiting on engineering.
- The speed in which we can build a new model has dramatically improved
- Data cleaning and feature creation is lighting fast on BigQuery.
- Not as much control on the process, probably not the same amount of rigour in testing.
- Mostly only a batch pipeline.
- We have hit some limitations with BigQuery
- Lack of tooling with SQL such as style checks etc.
- Giving data scientists autonomy to productionise their own features & models give you fast feedback loops and lowers time from ideation to shipped product.
- Techniques and examples of how engineering teams can build robust pipelines when you have less control of the end to end process and are working with SQL.
- Basic introduction to some products in market that we use, AWS SageMaker, AirFlow & BigQuery.
- Finding common ground with data scientists and working with them to up-skill where required is crucial.
Anyone interested in the process, tooling and trade-offs supporting data scientists getting models into production.
Prerequisites for Attendees
Interest in productionisation of models. Basic understanding of the process undertaken when building a machine learning model: Data sourcing, cleaning, feature creation, model creation and deployment.