schedule May 14th 10:45 - 11:15 AM place Wesley Theatre people 212 Interested

Machine Learning is often discussed in the context of data science, but little attention is given to the complexities of engineering production ready ML systems. This talk will explore some of the important challenges and provide advice on solutions to these problems.

 
 

Outline/Structure of the Talk

I will talk through the following topics, describing the issues you will come across and then discussing solutions to these problems.

  • Data problems
    • Dealing with lagged / late arriving data / incorrect data / broken pipelines and how to adjust your pipeline semantics and modelling strategy to handle these issues. E.g. Graceful degradation of model in face of missing data etc
  • Deployment problems
    • The challenges of going from kubernetes styled deployments where you move the data to the models, to big data deployments where you move the model to the data. Network issues, scoring efficiency issues and dev ops concerns will be discussed in this section.
  • Metrics problems
    • What information is useful to capture and how you can use it to improve your system and work out when things have gone wrong. For example looking at distributions of training label vs. production predicted labels.
  • Big Data Iteration Speed problems
    • Ways to turn big data problems into small data problems to improve time to market speed (e.g. sampling techniques)
    • Discussion of how to spot a bias vs. variance problem and how to prioritise your time as a result
    • Brief discussion on choice of frameworks

The examples will be rooted in a common example use case of building a churn prediction model.

Learning Outcome

Most people to walk out with a stronger appreciation of the complexities of building ML systems. Additionally, it will make data engineers more aware of the things that they need to be considering when building these systems, as well as a toolbox of possible solutions.

Data scientists will understand more about how ML systems function after they hand off their models and will hopefully learn a thing or two about how they can improve the ways they work both on their modelling and with engineering teams.

Target Audience

Data Engineers and Data Scientists

Prerequisites for Attendees

N/A

schedule Submitted 1 year ago

Public Feedback

comment Suggest improvements to the Speaker

  • Liked Aidan O'Brien
    keyboard_arrow_down

    Aidan O'Brien - DevOps 2.0: Evidence-based evolution of serverless architecture through automatic evaluation of “infrastructure as code” deployments

    Aidan O'Brien
    Aidan O'Brien
    PhD student
    CSIRO
    schedule 1 year ago
    Sold Out!
    30 Mins
    Talk
    Intermediate

    The scientific approach teaches us to formulate hypotheses and test them experimentally in order to advance systematically. DevOps and software architecture in particular, do not traditionally follow this approach. Here decisions like “scaling up to more machines or simply employing a batch queue” or “using Apache Spark or sticking to a job scheduler across multiple machines” are worked out theoretically rather than implemented and tested objectively. Furthermore, the paucity of knowledge in unestablished systems like serverless cloud architecture hampers the theoretical approach.

    We therefore partnered with James Lewis and Kief Morris to establish a fundamentally different approach for serverless architecture design that is based on scientific principles. For this, the serverless architecture stack needs to firstly be fully defined through code/text, e.g. AWS CloudFormation, so that it can easily and consistently be deployed. This “architecture as text”-base can then be modified and re-deployed to systematically test hypotheses, e.g. is an algorithm faster or a particular autoscaling group more efficient. The second key element to this novel way of evolving architecture is the automatic evaluation of any newly deployed architecture without manually recording runtime or defining interactions between services, e.g. Epsagon’s monitoring solution.

    Here we describe the two key aspects in detail and showcase the benefits by describing how we improved runtime by 80% for the bioinformatics software framework GT-Scan, which is used by Australia’s premier research organization to conduct medical research.

  • Liked Dana Bradford
    keyboard_arrow_down

    Dana Bradford - How to Save a Life: Could Real-Time Sensor Data Have Saved Mrs Elle?

    Dana Bradford
    Dana Bradford
    Sr. Research Scientist
    CSIRO
    schedule 1 year ago
    Sold Out!
    30 Mins
    Case Study
    Intermediate

    This is the story of Mrs Elle*, a participant in a smart home pilot study. The pilot study was aimed to test the efficacy of sensors to capture in-home activity data including meal preparation, attention to hygiene and movement around the house. The in-home monitoring and response service associated with the sensors had not been implemented, and as such, data was not analyzed in real time. Sadly, Mrs Elle suffered a massive stroke one night, and was found some time after. She later died in hospital without regaining consciousness. This paper looks at the data leading up to Mrs Elle’s stroke, to see if there were any clues that a neurological insult was imminent. We were most interested to know, had we been monitoring in real time, could the sensors have told us how to save a life?

    *pseudonym

  • Liked Elaina Hyde
    keyboard_arrow_down

    Elaina Hyde - What happens when Galactic Evolution and Data Science collide?

    Elaina Hyde
    Elaina Hyde
    Consultant
    Servian
    schedule 1 year ago
    Sold Out!
    30 Mins
    Case Study
    Intermediate

    This talk will cover a short trip around our Milky Way Galaxy and a discussion of how data science can be used to detect faint and sparse objects such as the dwarf satellites and streams that helped form the galaxy we live in. The data science applications and algorithms used determine the accuracy with which we can make detections of these mysterious bodies and with the advent of greater cloud computing capability the sky is no longer the limit when it comes to programming or Astronomy

  • Liked Holden Karau
    keyboard_arrow_down

    Holden Karau - Testing & Validating Big Data Pipelines with examples in Apache Spark & Apache BEAM

    Holden Karau
    Holden Karau
    Software Engineer
    Apple
    schedule 1 year ago
    Sold Out!
    30 Mins
    Talk
    Intermediate

    As distributed data parallel systems, like Spark, are used for more mission-critical tasks, it is important to have effective tools for testing and validation. This talk explores the general considerations and challenges of testing systems like Spark & BEAM through spark-testing-base and other related libraries. Testing isn't enough though, the real world will always find a way to throw a new wrench in the pipeline, and in those cases the best we can hope for is figuring out that something has gone terribly wrong and stopping the deployment of a new model before we get woken up with a 2am page asking why we are recommending sensitive products to the wrong audience*.

    With over 40% of folks automatically deploying the results of their Spark jobs to production, testing is especially important. Many of the tools for working with big data systems (like notebooks) are great for exploratory work, and can give a false sense of security (as well as additional excuses not to test). This talk explores why testing these systems are hard, special considerations for simulating "bad" partioning, figuring out when your stream tests are stopped, and solutions to these challenges.

    *Any resemblance to real pager alert the presenter may have received in the past is coincidence.

  • Liked Graham Polley
    keyboard_arrow_down

    Graham Polley - Look Ma, no servers! Building a petabyte scale data pipeline on Google Cloud with less than 100 lines of code.

    30 Mins
    Demonstration
    Intermediate

    In this talk/demo, I'll describe the Google Cloud Platform architecture that we've used on several client projects to help them easily ingest large amounts of their data into BigQuery for analysis.

    Its zero-ops and petabyte scale features unburden the team from managing any infrastructure, and ultimately frees them up to focus on more important things - like analysing, understanding, and actually drawing insights from the data.

    Forming a conga line of Cloud Storage, Cloud Functions, Cloud Dataflow (templates) and BigQuery in less than 100 lines of code, I'll show how to wire up each component of the data pipeline. Finally, if the Demo Gods are shining down on me that day, I'll even attempt a live demo (I usually regret saying saying that).