Taming the Spark beast for Deep Learning predictions at scale
Predicting at scale is a challenging pursuit, especially when working with Deep Learning models. This is because Deep Learning models tend to have high inference time. At idealo.de, Germany's biggest price comparison platform, the Data Science team was tasked with carrying out image tagging to improve our product galleries.
One of the biggest challenges we faced was to generate predictions for more than 300 million images within a short time while keeping the costs low. Moreover, a resolution for the scaling problem became critical since we intended to apply other Deep Learning models on the same big dataset. We ended up formulating a batch-prediction solution by employing an Apache Spark setup that ran on an AWS EMR cluster.
Spark is notorious for being difficult to configure and tune. As a result, we had to carry on several optimisation steps in order to meet the scale requirements that adhered to our time and financial constraints. In this talk, I would present our Spark setup and focus on the journey of optimising the Spark tagging solution. Additionally, I would also talk briefly about the underlying deep learning model which was used to predict the image tags.
Outline/Structure of the Talk
- Business Motivation to solve the problem
- Deep Learning model
- Optimisation to the Deep Learning model
- Introducing Spark setup (Terraform + AWS EMR + spark Dataframes + pandas_udf + s3)
- First iteration of batch prediction
- Second iteration: Reducing number of spark operations
- Third iteration: Choosing better Spark parameters to suit our requirements
- Fourth iteration: Parallelisation of operations
- Cost and Time involved for the final solution
- How to choose right parameters for working with spark
- How to use pandas_udf vectorizer to handle batch inference on Deep Learning models
- How to reduce spark operations
- How to parallelise operations which leads to a speedup
Machine learning practitioners who are interested in learning how to do batch predictions effectively at scale.
Prerequisites for Attendees
- Knowledge of machine learning basics
- Some experience of floating Machine Learning models to production
- Basics of Convolutional Neural Networks (nice to have)
- Some idea of working with Apache Spark (nice to have)
schedule Submitted 1 week ago
People who liked this proposal, also liked:
Dat Tran - Image ATM - Image Classification for EveryoneDat TranHead of Data Scienceidealo.de
schedule 2 months agoSold Out!
At idealo.de we store and display millions of images. Our gallery contains pictures of all sorts. You’ll find there vacuum cleaners, bike helmets as well as hotel rooms. Working with huge volume of images brings some challenges: How to organize the galleries? What exactly is in there? Do we actually need all of it?
To tackle these problems you first need to label all the pictures. In 2018 our Data Science team completed four projects in the area of image classification. In 2019 there were many more to come. Therefore, we decided to automate this process by creating a software we called Image ATM (Automated Tagging Machine). With the help of transfer learning, Image ATM enables the user to train a Deep Learning model without knowledge or experience in the area of Machine Learning. All you need is data and spare couple of minutes!
In this talk we will discuss the state-of-art technologies available for image classification and present Image ATM in the context of these technologies. We will then give a crash course of our product where we will guide you through different ways of using it - in shell, on Jupyter Notebook and on the Cloud. We will also talk about our roadmap for Image ATM.