What happens when Galactic Evolution and Data Science collide?

schedule May 14th 11:15 - 11:45 AM place Wesley Theatre people 209 Interested

This talk will cover a short trip around our Milky Way Galaxy and a discussion of how data science can be used to detect faint and sparse objects such as the dwarf satellites and streams that helped form the galaxy we live in. The data science applications and algorithms used determine the accuracy with which we can make detections of these mysterious bodies and with the advent of greater cloud computing capability the sky is no longer the limit when it comes to programming or Astronomy

 
 

Outline/Structure of the Case Study

Our Milky Way Galaxy

A problem of stars

The investigation

Machine Learning Solutions

Next steps

Learning Outcome

Understanding of Data Science and just a dash of Astrophysics

Target Audience

Interested folks, Machine Learning beginners, Technical people, Academics, Anyone interested in Data Science

Prerequisites for Attendees

No prerequisites necessary

schedule Submitted 1 year ago

Public Feedback

comment Suggest improvements to the Speaker

  • Liked Cameron Joannidis
    keyboard_arrow_down

    Cameron Joannidis - Machine Learning Systems for Engineers

    30 Mins
    Talk
    Intermediate

    Machine Learning is often discussed in the context of data science, but little attention is given to the complexities of engineering production ready ML systems. This talk will explore some of the important challenges and provide advice on solutions to these problems.

  • Liked Aidan O'Brien
    keyboard_arrow_down

    Aidan O'Brien - DevOps 2.0: Evidence-based evolution of serverless architecture through automatic evaluation of “infrastructure as code” deployments

    Aidan O'Brien
    Aidan O'Brien
    PhD student
    CSIRO
    schedule 1 year ago
    Sold Out!
    30 Mins
    Talk
    Intermediate

    The scientific approach teaches us to formulate hypotheses and test them experimentally in order to advance systematically. DevOps and software architecture in particular, do not traditionally follow this approach. Here decisions like “scaling up to more machines or simply employing a batch queue” or “using Apache Spark or sticking to a job scheduler across multiple machines” are worked out theoretically rather than implemented and tested objectively. Furthermore, the paucity of knowledge in unestablished systems like serverless cloud architecture hampers the theoretical approach.

    We therefore partnered with James Lewis and Kief Morris to establish a fundamentally different approach for serverless architecture design that is based on scientific principles. For this, the serverless architecture stack needs to firstly be fully defined through code/text, e.g. AWS CloudFormation, so that it can easily and consistently be deployed. This “architecture as text”-base can then be modified and re-deployed to systematically test hypotheses, e.g. is an algorithm faster or a particular autoscaling group more efficient. The second key element to this novel way of evolving architecture is the automatic evaluation of any newly deployed architecture without manually recording runtime or defining interactions between services, e.g. Epsagon’s monitoring solution.

    Here we describe the two key aspects in detail and showcase the benefits by describing how we improved runtime by 80% for the bioinformatics software framework GT-Scan, which is used by Australia’s premier research organization to conduct medical research.

  • Liked Dana Bradford
    keyboard_arrow_down

    Dana Bradford - How to Save a Life: Could Real-Time Sensor Data Have Saved Mrs Elle?

    Dana Bradford
    Dana Bradford
    Sr. Research Scientist
    CSIRO
    schedule 1 year ago
    Sold Out!
    30 Mins
    Case Study
    Intermediate

    This is the story of Mrs Elle*, a participant in a smart home pilot study. The pilot study was aimed to test the efficacy of sensors to capture in-home activity data including meal preparation, attention to hygiene and movement around the house. The in-home monitoring and response service associated with the sensors had not been implemented, and as such, data was not analyzed in real time. Sadly, Mrs Elle suffered a massive stroke one night, and was found some time after. She later died in hospital without regaining consciousness. This paper looks at the data leading up to Mrs Elle’s stroke, to see if there were any clues that a neurological insult was imminent. We were most interested to know, had we been monitoring in real time, could the sensors have told us how to save a life?

    *pseudonym

  • Liked Holden Karau
    keyboard_arrow_down

    Holden Karau - Testing & Validating Big Data Pipelines with examples in Apache Spark & Apache BEAM

    Holden Karau
    Holden Karau
    Software Engineer
    Apple
    schedule 1 year ago
    Sold Out!
    30 Mins
    Talk
    Intermediate

    As distributed data parallel systems, like Spark, are used for more mission-critical tasks, it is important to have effective tools for testing and validation. This talk explores the general considerations and challenges of testing systems like Spark & BEAM through spark-testing-base and other related libraries. Testing isn't enough though, the real world will always find a way to throw a new wrench in the pipeline, and in those cases the best we can hope for is figuring out that something has gone terribly wrong and stopping the deployment of a new model before we get woken up with a 2am page asking why we are recommending sensitive products to the wrong audience*.

    With over 40% of folks automatically deploying the results of their Spark jobs to production, testing is especially important. Many of the tools for working with big data systems (like notebooks) are great for exploratory work, and can give a false sense of security (as well as additional excuses not to test). This talk explores why testing these systems are hard, special considerations for simulating "bad" partioning, figuring out when your stream tests are stopped, and solutions to these challenges.

    *Any resemblance to real pager alert the presenter may have received in the past is coincidence.

  • Liked Graham Polley
    keyboard_arrow_down

    Graham Polley - Look Ma, no servers! Building a petabyte scale data pipeline on Google Cloud with less than 100 lines of code.

    30 Mins
    Demonstration
    Intermediate

    In this talk/demo, I'll describe the Google Cloud Platform architecture that we've used on several client projects to help them easily ingest large amounts of their data into BigQuery for analysis.

    Its zero-ops and petabyte scale features unburden the team from managing any infrastructure, and ultimately frees them up to focus on more important things - like analysing, understanding, and actually drawing insights from the data.

    Forming a conga line of Cloud Storage, Cloud Functions, Cloud Dataflow (templates) and BigQuery in less than 100 lines of code, I'll show how to wire up each component of the data pipeline. Finally, if the Demo Gods are shining down on me that day, I'll even attempt a live demo (I usually regret saying saying that).