Low Latency Polyglot Model Scoring using Apache Apex

location_city Sydney schedule Sep 19th 02:55 - 03:25 PM place Grand Lodge people 73 Interested

Data science is fast becoming a complementary approach and process to solve business challenges today. The explosion of frameworks to help data scientists build models bears a testimony to this. However when a model needs to be turned into a production version in very low latency and enterprise grade environments, there are a very few choices with each one having their own strengths and weaknesses. Adding to this is the current disconnect between a data scientists world which is all about modelling and an engineers world which is about SLAs and service guarantees. A framework like Apache Apex can complement each of these roles and provide constructs for both these worlds. This would help enterprises to drastically cut down the cost of model deployment to production environments.

The talk will present Apache Apex as a framework that can enable engineers and data scientists to build low latency enterprise grade applications. We will cover the foundations of Apex that contribute to the low latency processing capabilities of the platform. Subsequently aspects of the platform that make it qualify as an enterprise grade platform are discussed. Finally, we will cover the main aspects of the title of the talk wherein models developed in Java, R and Python can co-exist in the same scoring application framework thus enabling a true polyglot framework.


Outline/Structure of the Talk

The session would logically be divided into 3 sections

  1. A general overview of the Apache Apex platform and the features that make it a low latency processing framework
  2. Features of the framework that make it enterprise ready
  3. Features of the framework that can accomodate models developed in R,Java, Python to enable a true polyglot platform for low latency scoring.

Learning Outcome

The following would be the learning outcomes

  1. Basic understanding of the Apache Apex platform as a low latency processing framework
  2. Alternative approaches to build low latency machine learning model scoring applications with true enterprise grade capabilities.

Target Audience

Software Engineers, Data Scientists, and Architects

Prerequisites for Attendees

  • An general idea of true streaming vs min-batch vs batch processing models
  • Typical process that is involved today in trying to turn a model into a production version
schedule Submitted 2 years ago

Public Feedback

comment Suggest improvements to the Speaker
  • Josh Graham
    By Josh Graham  ~  2 years ago
    reply Reply

    In terms of the business benefit of reducing "the cost of model deployment to production environments" and having data scientists work better with engineering / devops teams, how would you distinguish this from Apache Amaterasu (which we also have a proposed talk on)?

    • Ananth Gundabattula
      By Ananth Gundabattula  ~  2 years ago
      reply Reply

      Thanks for reviewing the proposal Josh. 

      Apache Amaterasu aims to ease the pain points in building and running data pipelines. The framework broadly aims to ensure proper execution by coordinating resource management and the pipeline definitions.

      Apache Apex on the other hand is all about low latency execution. Let us consider the use case of building a fraud monitoring application which is responsible for declining a fraudulent transaction as it is being processed. Such a use case requires a very low latency execution platform as its foundation so that the fraud monitoring application can be built using the constructs provided by the platform. The first part of the presentation talks about such constructs provided by the Apache Apex platform.

      This use case is a little bit more complicated than just about low latency execution platform. With machine learning framework becoming more prevalent, such a fraud transaction processing would be scoring the incoming transaction for fraud/not fraud. A data scientist might have built his/her model using R. However it is uncommon to use R as a platform to achieve low latency processing platform. Hence the more sensible apporoach would be to invoke the R scoring function as part of the low latency platform. The subsequent section of the talk describes the aspect of integrating scoring components like R code as developed by a data scientist with the low latency aspects as developed by the software engineer. By allowing for direct integration of R scoring components with the other aspects of feature engineering, a project can save time by avoiding a time consuming process of trans-coding an R function into the target platform programming language like scala/java etc.

      Once these two aspects are resolved using the foundational aspects of the platform, a project would need certain enterprise constructs like security, resource management and checkpointing. The third part of the talk would be describing these aspects.

      Thus Apache Amaterasu significantly differs from Apache apex in terms of feature set and even goals.