Predicting at scale is a challenging pursuit, especially when working with Deep Learning models. This is because Deep Learning models tend to have high inference time. At, Germany's biggest price comparison platform, the Data Science team was tasked with carrying out image tagging to improve our product galleries.

One of the biggest challenges we faced was to generate predictions for more than 300 million images within a short time while keeping the costs low. Moreover, a resolution for the scaling problem became critical since we intended to apply other Deep Learning models on the same big dataset. We ended up formulating a batch-prediction solution by employing an Apache Spark setup that ran on an AWS EMR cluster.

Spark is notorious for being difficult to configure and tune. As a result, we had to carry on several optimisation steps in order to meet the scale requirements that adhered to our time and financial constraints. In this talk, I would present our Spark setup and focus on the journey of optimising the Spark tagging solution. Additionally, I would also talk briefly about the underlying deep learning model which was used to predict the image tags.

3 favorite thumb_down thumb_up 0 comments visibility_off  Remove from Watchlist visibility  Add to Watchlist

Outline/Structure of the Talk

  • Agenda
  • Business Motivation to solve the problem
  • Deep Learning model
  • Optimisation to the Deep Learning model
  • Introducing Spark setup (Terraform + AWS EMR + spark Dataframes + pandas_udf + s3)
  • First iteration of batch prediction
  • Second iteration: Reducing number of spark operations
  • Third iteration: Choosing better Spark parameters to suit our requirements
  • Fourth iteration: Parallelisation of operations
  • Cost and Time involved for the final solution
  • Summary

Learning Outcome

  1. How to choose right parameters for working with spark
  2. How to use pandas_udf vectorizer to handle batch inference on Deep Learning models
  3. How to reduce spark operations
  4. How to parallelise operations which leads to a speedup

Target Audience

Machine learning practitioners who are interested in learning how to do batch predictions effectively at scale.

Prerequisites for Attendees

  • Knowledge of machine learning basics
  • Some experience of floating Machine Learning models to production
  • Basics of Convolutional Neural Networks (nice to have)
  • Some idea of working with Apache Spark (nice to have)
schedule Submitted 2 months ago

Public Feedback

comment Suggest improvements to the Speaker

  • Liked Dr. Vikas Agrawal

    Dr. Vikas Agrawal - Non-Stationary Time Series: Finding Relationships Between Changing Processes for Enterprise Prescriptive Systems

    45 Mins

    It is too tedious to keep on asking questions, seek explanations or set thresholds for trends or anomalies. Why not find problems before they happen, find explanations for the glitches and suggest shortest paths to fixing them? Businesses are always changing along with their competitive environment and processes. No static model can handle that. Using dynamic models that find time-delayed interactions between multiple time series, we need to make proactive forecasts of anomalous trends of risks and opportunities in operations, sales, revenue and personnel, based on multiple factors influencing each other over time. We need to know how to set what is “normal” and determine when the business processes from six months ago do not apply any more, or only applies to 35% of the cases today, while explaining the causes of risk and sources of opportunity, their relative directions and magnitude, in the context of the decision-making and transactional applications, using state-of-the-art techniques.

    Real world processes and businesses keeps changing, with one moving part changing another over time. Can we capture these changing relationships? Can we use multiple variables to find risks on key interesting ones? We will take a fun journey culminating in the most recent developments in the field. What methods work well and which break? What can we use in practice?

    For instance, we can show a CEO that they would miss their revenue target by over 6% for the quarter, and tell us why i.e. in what ways has their business changed over the last year. Then we provide the prioritized ordered lists of quickest, cheapest and least risky paths to help turn them over the tide, with estimates of relative costs and expected probability of success.

  • Liked Juan Manuel Contreras

    Juan Manuel Contreras - Beyond Individual Contribution: How to Lead Data Science Teams

    Juan Manuel Contreras
    Juan Manuel Contreras
    Head of Data Science
    schedule 2 months ago
    Sold Out!
    45 Mins

    Despite the increasing number of data scientists who are being asked to take on managerial and leadership roles as they grow in their careers, there are still few resources on how to manage data scientists and lead data science teams. There is also scant practical advice on how to serve as head of a data science practice: how to set a vision and craft a strategy for an organization to use data science.

    In this talk, I will describe my experience as a data science leader both at a political party (the Democratic Party of the United States of America) and at a fintech startup (, share lessons learned from these experiences and conversations with other data science leaders, and offer a framework for how new data science leaders can better transition to both managing data scientists and heading a data science practice.

  • Liked Badri Narayanan Gopalakrishnan

    Badri Narayanan Gopalakrishnan / Shalini Sinha / Usha Rengaraju - Lifting Up: Deep Learning for implementing anti-hunger and anti-poverty programs

    45 Mins
    Case Study

    Ending poverty and zero hunger are top two goals United Nations aims to achieve by 2030 under its sustainable development program. Hunger and poverty are byproducts of multiple factors and fighting them require multi-fold effort from all stakeholders. Artificial Intelligence and Machine learning has transformed the way we live, work and interact. However economics of business has limited its application to few segments of the society. A much conscious effort is needed to bring the power of AI to the benefits of the ones who actually need it the most – people below the poverty line. Here we present our thoughts on how deep learning and big data analytics can be combined to enable effective implementation of anti-poverty programs. The advancements in deep learning , micro diagnostics combined with effective technology policy is the right recipe for a progressive growth of a nation. Deep learning can help identify poverty zones across the globe based on night time images where the level of light correlates to higher economic growth. Once the areas of lower economic growth are identified, geographic and demographic data can be combined to establish micro level diagnostics of these underdeveloped area. The insights from the data can help plan an effective intervention program. Machine Learning can be further used to identify potential donors, investors and contributors across the globe based on their skill-set, interest, history, ethnicity, purchasing power and their native connect to the location of the proposed program. Adequate resource allocation and efficient design of the program will also not guarantee success of a program unless the project execution is supervised at grass-root level. Data Analytics can be used to monitor project progress, effectiveness and detect anomaly in case of any fraud or mismanagement of funds.

  • Liked Dat Tran

    Dat Tran - Image ATM - Image Classification for Everyone

    Dat Tran
    Dat Tran
    Head of AI
    Axel Springer AI
    schedule 4 months ago
    Sold Out!
    45 Mins

    At we store and display millions of images. Our gallery contains pictures of all sorts. You’ll find there vacuum cleaners, bike helmets as well as hotel rooms. Working with huge volume of images brings some challenges: How to organize the galleries? What exactly is in there? Do we actually need all of it?

    To tackle these problems you first need to label all the pictures. In 2018 our Data Science team completed four projects in the area of image classification. In 2019 there were many more to come. Therefore, we decided to automate this process by creating a software we called Image ATM (Automated Tagging Machine). With the help of transfer learning, Image ATM enables the user to train a Deep Learning model without knowledge or experience in the area of Machine Learning. All you need is data and spare couple of minutes!

    In this talk we will discuss the state-of-art technologies available for image classification and present Image ATM in the context of these technologies. We will then give a crash course of our product where we will guide you through different ways of using it - in shell, on Jupyter Notebook and on the Cloud. We will also talk about our roadmap for Image ATM.

  • Liked Gaurav Shekhar

    Gaurav Shekhar - AIOps - Prediction of Critical Events

    45 Mins
    Case Study

    With the rise of cloud, distributed architectures, containers, and microservices, a rise in data overload is visible. With growing amounts of DevOps processes; alerts, repeated mundane jobs etc. have put new demands to both synthesize meaning from this influx of information and connect it to broader business objectives.

    AIOps is the application of artificial intelligence for IT operations. AIOps uses machine learning and data science to give IT operations teams a real-time understanding of any issues affecting the availability or performance of the systems under their care. Rather than reacting to issues as they arise in the application environment, AIOps platforms allow IT operations teams to proactively manage performance challenges faster, and in real-time

    This case study focuses on solving the following business needs:

    1. With an ever-increasing rise in alerts, a large number of incidents were getting generated. There was a need to develop a framework that can generate correlations and identify correlated events, thereby reduce overall incidents volume.

    2. For many incidents a reactive strategy does not work and can lead to a loss of reputation; there was a need to develop predictive capabilities that can detect anomalous events and predict critical events well in advance.

    3. Given the pressures of reducing the Resolution time and short window of opportunity available to the analysts, there was a need to provide search capabilities so that the analysts can have a head start as to how similar incidents were solved in past.

    Data from multiple systems sending alerts, including traditional IT monitoring, log events in text format, application and network performance data etc were made available for the PoC.

    The solution framework developed had a discovery phase where the base data was visualized and explored, a NLP driven text mining layer where log data in text format was pre-processed, clustered and correlations were developed to identify related events using Machine Learning algorithms. Topic Mining was used to get a quick overview of a large number of event data. Next, a temporal mining layer explored the temporal relationship between nodes and cluster groups, necessary features were developed on top of the associations generated from temporal layers. Advanced Machine learning algorithms were then developed on these features to predict critical events almost 12 hours in advance. Last but not the least a search layer that computed the similarity of any incident with those in Service Now database was developed that provided analysts insights readily available information on similar incidents and how they were solved in past so that the analysts do not have to reinvent the wheel.

  • Liked Shalini Sinha

    Shalini Sinha / Ashok J / Yogesh Padmanaban - Hybrid Classification Model with Topic Modelling and LSTM Text Classifier to identify key drivers behind Incident Volume

    45 Mins
    Case Study

    Incident volume reduction is one of the top priorities for any large-scale service organization along with timely resolution of incidents within the specified SLA parameters. AI and Machine learning solutions can help IT service desk manage the Incident influx as well as resolution cost by

    • Identifying major topics from incident description and planning resource allocation and skill-sets accordingly
    • Producing knowledge articles and resolution summary of similar incidents raised earlier
    • Analyzing Root Causes of incidents and introducing processes and automation framework to predict and resolve them proactively

    We will look at different approaches to combine standard document clustering algorithms such as Latent Dirichlet Allocation (LDA) and K-mean clustering on doc2vec along-with Text classification to produce easily interpret-able document clusters with semantically coherent/ text representation that helped IT operations of a large FMCG client identify key drivers/topics contributing towards incident volume and take necessary action on it.

  • Liked Pankaj Kumar

    Pankaj Kumar / Abinash Panda / Usha Rengaraju - Quantitative Finance :Global macro trading strategy using Probabilistic Graphical Models

    90 Mins

    { This is a handson workshop in pgmpy package. The creator of pgmpy package Abinash Panda will do the code demonstration }

    Crude oil plays an important role in the macroeconomic stability and it heavily influences the performance of the global financial markets. Unexpected fluctuations in the real price of crude oil are detrimental to the welfare of both oil-importing and oil-exporting economies.Global macro hedge-funds view forecast the price of oil as one of the key variables in generating macroeconomic projections and it also plays an important role for policy makers in predicting recessions.

    Probabilistic Graphical Models can help in improving the accuracy of existing quantitative models for crude oil price prediction as it takes in to account many different macroeconomic and geopolitical variables .

    Hidden Markov Models are used to detect underlying regimes of the time-series data by discretising the continuous time-series data. In this workshop we use Baum-Welch algorithm for learning the HMMs, and Viterbi Algorithm to find the sequence of hidden states (i.e. the regimes) given the observed states (i.e. monthly differences) of the time-series.

    Belief Networks are used to analyse the probability of a regime in the Crude Oil given the evidence as a set of different regimes in the macroeconomic factors . Greedy Hill Climbing algorithm is used to learn the Belief Network, and the parameters are then learned using Bayesian Estimation using a K2 prior. Inference is then performed on the Belief Networks to obtain a forecast of the crude oil markets, and the forecast is tested on real data.