YOW! Data 2018 Day 1

Mon, May 14
Timezone: Australia/Sydney (AEST)
08:00

    Registration for YOW! Data 2018 - 45 mins

08:45

    Session Overviews & Introductions - 15 mins

09:00
  • Added to My Schedule
    keyboard_arrow_down
    Dean Wampler

    Dean Wampler - Stream All the Things!!

    schedule  09:00 - 09:50 AM place Wesley Theatre people 208 Interested star_halfRate

    Streaming data architectures aren't just "faster" Big Data architectures. They must be reliable and scalable as never before, more like microservice architectures.

    This talk has three goals:

    1. Justify the transition from batch-oriented big data to stream-oriented fast data.
    2. Explain the requirements that streaming architectures must meet and the tools and techniques used to meet them.
    3. Discuss the ways that fast data and microservice architectures are converging.

    Big data started with an emphasis on batch-oriented architectures, where data is captured in large, scalable stores, then processed using batch jobs. To reduce the gap between data arrival and information extraction, these architectures are now evolving to be stream oriented, where data is processed as it arrives. Fast data is the new buzz word.

    These architectures introduce new challenges for developers. Whereas a batch job might run for hours, a stream processing system typically runs for weeks or months, which raises the bar for making these systems reliable and scalable to handle any contingency.

    The microservice world has faced this challenge for a while. Microservices are inherently message driven, responding to requests for service and sending messages to other microservices, in turn. Hence, they are also stream oriented, in the sense that they must respond reliably to never-ending input. So, they offer guidance for how to build reliable streaming data systems. I'll discuss how these architectures are merging in other ways, too.

    We'll also discuss how to pick streaming technologies based on four axes of concern:

    • Low latency: What's my time budget for handling this data?
    • High volume: How much data per unit time must I handle?
    • Data processing: Do I need machine learning, SQL queries, conventional ETL processing, etc.?
    • Integration with other tools: Which ones and how is data exchanged between them?

    We'll consider specific examples of streaming tools and how they fit on these axes, including Spark, Flink, Akka Streams, and Kafka.

09:50
  • Added to My Schedule
    keyboard_arrow_down
    Simon Carryer

    Simon Carryer - Data is a Soft Science

    schedule  09:50 - 10:20 AM place Wesley Theatre people 202 Interested star_halfRate

    The perception of data science, and often the way it is taught, is like this: You have some nice, tidy data, you use the latest, coolest algorithm, and you get some super clever results. You know it’s good ‘cause your r-squared value is through the roof, and you could play checkers on your confusion matrix.

    But the reality is different. That nice, tidy dataset has to be wrangled out of a big, nasty production system that was built by a coffee-fueled maniac. Those cool results have to somehow be translated into a user interface in which it’s the 12th most important thing on the page, and you have to fight for every pixel. And in front of that production system, entering that data, clicking on that user interface, is a data scientist’s worst nightmare: People.

    As much as we might want to believe that data science is a pure “hard” science, about writing greek letters on chalkboards and stroking our chins, the truth is that what we do is more usefully thought of as a social science. Data science is a lens for understanding human behaviour. It is a tool for communicating with people. Data is a soft science.

    This talk is about how my background in Social Anthropology gave me a unique approach to doing data science. I’ll show how taking this view of data science led to some cool discoveries in some interesting projects. And I’ll talk about how, building accounting software at Xero, we’ve started on the journey towards building a “smarter” application. As we've done this, the hardest problems have not been about technical implementation, they’ve been about understanding the interface between these technologies and our users. Our data science problems at Xero, it turns out, are mostly about how to understand humans.

10:20

    Morning Break - 25 mins

10:45
11:15
11:45
12:15

    Lunch Break - 50 mins

13:05
13:35
  • Added to My Schedule
    keyboard_arrow_down
    Boris Savkovic

    Boris Savkovic - Machine learning applications for the autonomous/connected vehicle : perspectives, applications and methods

    schedule  01:35 - 02:05 PM place Wesley Theatre people 213 Interested star_halfRate

    The application of streaming and real-time data science/analytics to connected and autonomous vehicles is gaining traction around the world. Intelematics is an Australian leader and innovator in the field of telematics/connected vehicles as well as in big data traffic analytics, with Intelematics services used by Australian and overseas giants such as Ford, Toyota, Google etc.

    The topic of the talk is the application of streaming and big data analytics to autonomous and connected vehicles. Applications covered will include : ability to predict vehicle failures, forecast traffic conditions, automate vehicle insurance claims with automated crash/incident detection, deliver data into the vehicle (traffic signal states and forecasts) etc.

    The talk will give an outline of general trends as well as give some examples of concrete solutions that we have developed at Intelematics in this emerging field, both in Australia and overseas (US and EU) . The focus will be on the application of data science and algortihms, and the key role that these have to play in the emerging field of connected and autonomous vehicles. These relevant data streams bring a number of technical and non-technical challenges that will be discussed : complexity of dealing with geo-spatial and temporal data, safety and security, privacy, streaming nature of data, event-driven nature data, volume of data, complexity of relationships/patterns to be modelled etc.

    The underlying algorithms and techonlogies will be also be discussed in some detail.

    Link to company website :

    http://www.intelematics.com/

    Speaker bio :

    https://au.linkedin.com/in/borislav-savkovic-23969154

14:05
14:35

    Afternoon Break - 20 mins

14:55
  • Added to My Schedule
    keyboard_arrow_down
    Wai Chee Yau

    Wai Chee Yau / Jeffrey Theobald - Deep Learning, Production and You

    schedule  02:55 - 03:25 PM place Wesley Theatre people 223 Interested star_halfRate

    Simply building a successful machine learning product is extremely challenging, and just as much effort is needed to turn that model into a customer-facing product. Drawing on their experience working on Zendesk’s article recommendation product, Wai Chee Yau and Jeffrey Theobald discuss design challenges and real-world problems you may encounter when building a machine learning product at scale.

    Wai Chee and Jeffrey cover the evolution of the machine learning system, from individual models per customer (using Hadoop to aggregate the training data) to a universal deep learning model for all customers using TensorFlow, and outline some challenges they faced while building the infrastructure to serve TensorFlow models. They also explore the complexities of seamlessly upgrading to a new version of the model and detail the architecture that handles the constantly changing collection of articles that feed into the recommendation engine.

    Topics include:

    • Infrastructure for continuously changing textual data
    • Deploying and serving TensorFlow models in production
    • Real-world production problems when dealing with a machine learning model
    • Data, customer feedback, and user experience
15:25
15:55
  • Added to My Schedule
    keyboard_arrow_down
    Shujia Zhang

    Shujia Zhang - Graph Neural Networks: Algorithm and Applications

    schedule  03:55 - 04:25 PM place Wesley Theatre people 214 Interested star_halfRate

    Artificial neural networks help us cluster and classify. Since "Deep learning" became the buzzword, it has been applied for many advances of AI, such as self-driving car, image classification, Alpha Go, etc. There are lots of different deep learning architectures, the most popular ones are based on the well known convolutional neural network which is one type of feed-forward neural networks. This talk will introduce another variant of deep neural network - Graph Neural network which can model the data represented as generic graphs (a graph can have labelled nodes connected via weighted edges). The talk will cover:

    • the graph (graph of graphs - GoGs) representation: how we represent different data with graphs
    • architecture of graph neural networks (GNN): the architecture of deep graph neural networks and learning algorithm
    • applications of GoGs and GNNs: document classification, web spam detection, human action recognition in video

16:25

    Afternoon Break - 20 mins

16:45
17:15
  • Added to My Schedule
    keyboard_arrow_down
    Tomasz Bednarz

    Tomasz Bednarz - Visual Analytics on Steroids: High Performance Visualisation, Simulations and AI

    schedule  05:15 - 05:45 PM place Wesley Theatre people 128 Interested star_halfRate

    In the time that someone takes to read this abstract, another could solve a detective puzzle if only they had enough quantitative evidence on which to prove their suspicions. But also, one could use visualisation and computational tools like a microscope, to seek a new cure for cancer or predict hospitalisation prevention. In this presentation, we will demonstrate new visual analytics techniques that use various mixed reality approaches that link simulations with collaborative, complex and interactive data exploration, placing the human-in-the-loop. In the recent days, thanks to advances in graphics hardware and compute power (especially GPGPU and modern Big Data / HPC infrastructures), the opportunities are immense, especially in improving our understanding of complex models that represent the real- or hybrid-worlds. Use cases presented will be drawn from ongoing research at CSIRO, and Expanded Perception and Interaction Centre (EPICentre) using world class GPU clusters and visualisation capabilities.

17:45
18:30

    Conference Drinks & Networking - 60 mins

YOW! Data 2018 Day 2

Tue, May 15
Timezone: Australia/Sydney (AEST)
08:45

    Session Overviews & Introductions - 15 mins

09:00
09:50
  • Added to My Schedule
    keyboard_arrow_down
    Tim Garnsey

    Tim Garnsey - Respecting privacy with synthetically generated "look-alike" data sets

    schedule  09:50 - 10:20 AM place Wesley Theatre people 188 Interested star_halfRate

    Safely handling data that contains sensitive or private information about people is a multi-million dollar problem at many companies. It adds time into the data engineering process, it can cost a lot in software licenses for specialised tools, and brings a range of reputational and legal risks.

    Recent advances in deep learning have prompted an interesting way to attack this problem. By fitting a certain class of model on a source data set that contains sensitive information, we can produce a generator that outputs a supply of synthetic "look alike" data. This output data will preserve many of the statistical relationships between fields as the source does, and offers mathematical guarantees around the identifiability of individuals in the source data set.

    This talk will provide an overview of the approach and show how it can speed data engineering effort and reduce risk.

10:20

    Morning Break - 25 mins

10:45
  • Added to My Schedule
    keyboard_arrow_down
    Fiona Tweedie

    Fiona Tweedie - On the quest for advanced analytics: governance and the Internet of Things

    schedule  10:45 - 11:15 AM place Wesley Theatre people 199 Interested star_halfRate

    Data scientists dream of crystal clear data lakes and perfectly ordered warehouses with comprehensive dictionaries, consistent formats and never a null value or encoding error to mar their analysis. The reality, however, is that the bulk of time on most data projects is spent sourcing and munging data before the exploration and analysis can begin. Governance is often presented as the solution to all data woes but all too often generates more meetings than results.

    The University of Melbourne is home to 8000 staff and 48000 students across seven campuses. Both researchers and professional staff recognise that data is going to be key to understanding this complex community and supporting its members. Sensor data collected from around the campuses promises the opportunity to analyse everything from demands on public transport to the impact of weather on coffee consumption. With researchers spread across ten faculties, there is a danger that multiple projects will collect fragmented data and the real power that comes from joining multiple datasets will never be realised. Conversely, overly prescriptive policies will date quickly and hamper innovation. Is it possible to satisfy both the desire to move rapidly to take advantage of new opportunities and the need to maintain data quality?

    This case study will present some of the IoT projects currently being explored at the University and examine the governance efforts that are being trialled to ensure the adoption of standards and future interoperability of devices and data.

11:15
11:45
12:15

    Lunch Break - 50 mins

13:05
13:50
14:20

    Afternoon Break - 20 mins

14:40
15:10
  • Added to My Schedule
    keyboard_arrow_down
    Aidan O

    Aidan O'Brien - DevOps 2.0: Evidence-based evolution of serverless architecture through automatic evaluation of “infrastructure as code” deployments

    schedule  03:10 - 03:40 PM place Wesley Theatre people 178 Interested star_halfRate

    The scientific approach teaches us to formulate hypotheses and test them experimentally in order to advance systematically. DevOps and software architecture in particular, do not traditionally follow this approach. Here decisions like “scaling up to more machines or simply employing a batch queue” or “using Apache Spark or sticking to a job scheduler across multiple machines” are worked out theoretically rather than implemented and tested objectively. Furthermore, the paucity of knowledge in unestablished systems like serverless cloud architecture hampers the theoretical approach.

    We therefore partnered with James Lewis and Kief Morris to establish a fundamentally different approach for serverless architecture design that is based on scientific principles. For this, the serverless architecture stack needs to firstly be fully defined through code/text, e.g. AWS CloudFormation, so that it can easily and consistently be deployed. This “architecture as text”-base can then be modified and re-deployed to systematically test hypotheses, e.g. is an algorithm faster or a particular autoscaling group more efficient. The second key element to this novel way of evolving architecture is the automatic evaluation of any newly deployed architecture without manually recording runtime or defining interactions between services, e.g. Epsagon’s monitoring solution.

    Here we describe the two key aspects in detail and showcase the benefits by describing how we improved runtime by 80% for the bioinformatics software framework GT-Scan, which is used by Australia’s premier research organization to conduct medical research.

15:40
16:10

    Afternoon Break - 20 mins

16:30
  • Added to My Schedule
    keyboard_arrow_down
    Gala Camacho

    Gala Camacho - Using Social Media Data to Interactively Explore the Personality of a Neighbourhood

    schedule  04:30 - 05:00 PM place Wesley Theatre people 147 Interested star_halfRate

    Why do people choose to live in one neighbourhood over another? Every day government makes decisions that can change the way a neighbourhood operates and feels. Understanding the impact that these decisions have is convoluted and hard to measure.

    In April, the Gold Coast held the 2018 Commonwealth Games. These events, usually advertised as urban renewal or regeneration projects, have a lasting impact on the neighbourhoods where they take place. Usually, there is a strong push to predic the impact of the games through economic assessments and surveys, however, once the games are happening, is that impact tracked? Sometimes, further economic assessments are produced years after the event which evaluate whether the impact was as expected. Can we do better?

    Social data can provide us with more timely evidence of whethere the event is activating the economy as expected, or maybe highlight issues that can be resolved.

    Using social media data we will explore Facebook places data during and after the games, ultimately generating a dashboard to help us visualise change by giving us the ability to filter and aggregate the data. We will manage the data using Python’s Pandas, generate quick visualisations using Plotly, and finish off by spinning up a dashboard using Plotly’s Dash.

17:00
17:30
  • Added to My Schedule
    keyboard_arrow_down
    Katie Bell

    Katie Bell - Is the 370 the worst bus in Sydney?

    schedule  05:30 - 06:00 PM place Wesley Theatre people 108 Interested star_halfRate
    In Switzerland, people will be surprised at a bus that's 2min late. In Sydney, people will only consider it noteworthy if a bus is more than 20min late, and this varies greatly between routes and providers. So, how do Sydney bus routes stack up? And if we're talking about privatisation, how do the private bus providers stack up against the state busses?
    To answer these questions we need data… lots of data. Hooray for open government data! Transport for NSW publishes real-time information on the location and lateness of all public transport. Unfortunately it's ephemeral – there is no public log of historical lateness for us to analyse. To gather the data I needed I had to fetch, log and aggregate ephemeral real-time data that was never intended to be used this way. There are random gaps and spontaneous route or timetable changes for special events, roadworks or holidays. Even with noisy data, the patterns start to emerge across months and we can start to answer some questions. The 370 bus route is one of the most complained about routes in Sydney, it even has it's own Facebook group of ironic fans... but is it really the worst bus? Let's look at the data.

Workshop Day 1

Wed, May 16
Timezone: Australia/Sydney (AEST)
08:00

Workshop Day 2

Thu, May 17
Timezone: Australia/Sydney (AEST)
08:00