Keeping systems up to date is an inherently complex challenge. It is greatly complicated by integration challenges, organisational complexity and increasingly a massive amount of state. This talk will explore some of the drivers and patterns and present some approaches we are undertaking to address this problem in a sustainable manner.
-
keyboard_arrow_down
Dali Kaafar - Don’t Give the Network a Function, Teach the Network how to Function!
45 Mins
Keynote
Advanced
Organizations are increasingly prone to outsource network functions to the cloud, aiming to reduce the cost and the complexity of maintaining network infrastructures. At the same time, however, outsourcing implies that sensitive network policies, such as firewall rules, are revealed to the cloud provider. In this talk, I will walk you through investigation of the use of a few cryptographic primitives for processing outsourced network functions, so that the provider does not learn any sensitive information.
I will present a cryptographic treatment of privacy-preserving outsourcing of network functions, introducing security definitions as well as an abstract model of generic network functions, and then propose a few instantiations using homomorphic encryption and public-key encryption with keyword search. This will be an illustration of things you should not do if you are after high performance Function Outsourcing. On the other hand however, that shows that it is feasible if Performance, as in run time performance, is not critical.
I will then presents SplitBox, an efficient system for privacy-preserving processing of network functions that are outsourced as software processes to the cloud. Specifically, cloud providers processing the network functions do not learn the network policies instructing how the functions are to be processed. First, I will present an abstract model of a generic network function based on match-action pairs. We assume that this function is processed in a distributed manner by multiple honest-but-curious cloud service providers. Then, I will describe in detail the SplitBox system for private network function virtualization and present a proof-of-concept implementation on FastClick, an extension of the Click modular router, using a firewall as a use case. This PoC achieves a throughput of over 2 Gbps with 1 kB-sized packets on average, traversing up to 60 firewall rules.
-
keyboard_arrow_down
Mark Hibberd - Lake, Swamp or Puddle: Data Quality at Scale
30 Mins
Talk
Intermediate
Data is a powerful tool. Data-driven systems leveraging modern analytical and predictive techniques can offer significant improvements over static or heuristic driven systems. The question is: how much can you trust your data?
Data collection, processing and aggregation is a challenging task. How do we build confidence in our data? Where did the data come from? How was it generated? What checks have or should be applied? What is affected when it all goes wrong?
This talk looks at the mechanics of maintaining data-quality at scale. Firstly looking at bad-data, what it is and where it comes from. Then diving into the techniques required to detect, avoid and ultimately deal with bad-data. At the end of this talk the audience should come away with an idea of how to design quality data-driven systems that ultimately build confidence and trust rather than inflate expectations.
-
keyboard_arrow_down
Sarah Bolt - Big Data, Little Data, Fast Data, Slow… Understanding the Potential Value of Click-Stream Data Processing
30 Mins
Talk
Intermediate
Digital event data (or click-stream data) is the collective record of users’ interactions with a website or other application. With the growing popularity of collecting this data has come waves of hype about particular approaches and tools for processing and analysing it. From spreadsheets, to dynamic reporting tools, to data warehouses, to massively parallel architectures to real time processing. Which processing approach is right for you?
As for any data system, the collection of digital event data is fundamentally about supporting, or even completely automating, our decisions and this has important implications for how it is processed. This presentation will examine some of these core considerations in formulating an approach to the processing of this data. What decisions does it support? Who uses the data? And increasingly importantly, as users become more aware of the value exchange for the data they provide, how to ensure that this data ultimately provides for a better service?
-
keyboard_arrow_down
Sandy Taylor - Infrastructure for Smart Cities: Bridging Research and Production
30 Mins
Talk
Intermediate
This talk will explore our process for taking research algorithms into production as part of large-scale IoT systems. This will include our experiences developing a condition monitoring system for the Sydney Harbour Bridge, and case studies into some of the challenges we have faced. It will also cover general IoT challenges such as bandwidth limits, weatherproofing, and hardware lock-in, and how we have addressed them.
-
keyboard_arrow_down
Philip Haynes - Keeping RAFT afloat – Cloud Scale Distributed Consensus
30 Mins
Talk
Intermediate
Strong consistency for cloud scale systems is typically viewed as too hard and too expensive. This talk provides an overview of how the implementing of the new distributed consensus algorithm, RAFT, using high performance methods and the Aeron network library, enables the low cost processing of over 100M transactions per day.
-
keyboard_arrow_down
Pablo Caif - Stomping on Big Data using Google’s BigQuery
30 Mins
Talk
Intermediate
Managing infrastructure, worrying about scalability, and waiting for queries to finish executing are some of the biggest challenges when working with massive volumes of data. One solution is to outsource the heavy lifting to someone else, thereby allowing you to spend more time on actually analyzing, and drawing insights out of your data. It other words, look to harnessing the cloud to solve big data problems.
BigQuery is a SaaS tool from Google that is designed to make it easy for us to get up and running without the need to care about any operational overheads. It has a true zero-ops model. BigQuery’s bloodline traces back to Dremel, which was the inspiration for many open sources projects such as Apache Drill. Using a massively parallel processing, tree, and columnar storage architecture, your queries will run on thousands of cores inside Google’s data centres without you spinning up a single VM. This talk will cover its core features, cost model, available APIs, and caveats. Finally, there will be a live demo of BigQuery in action.
-
keyboard_arrow_down
Natalia Rümmele - Automating Data Integration with Machine Learning
30 Mins
Talk
Intermediate
The world of data is a messy and unstructured place, making it difficult to gain value from data. Things get worse when the data resides in different sources or systems. Before we can perform any analytics in such a case, we need to combine the sources and build a unified view of the data. To handle this situation, a data scientist would typically go through each data source, identify which data is of interest, and define transformations and mappings which unify these data with other sources. This process usually includes writing lots of scripts with potentially overlapping code – a real headache in the everyday life of a data scientist! In this talk we will discuss how machine learning techniques and semantic modelling can be applied to automate the data integration process.
-
keyboard_arrow_down
Michael Fernandez - Data Analytics for Accelerated Materials Discovery
30 Mins
Talk
Intermediate
Data analytics and machine learning are at the centre of social, marketing, healthcare and manufacturing research. In material discovery, they play a fundamental role to successfully tackle the exponential increase in size and complexity of functional materials. This presentation will discuss how data analytics tools can drastically accelerate materials discovery and reveal intrinsic relationships between structural features and functional properties of novel materials. Multivariate statistics techniques and simple decision tree predictors can identify design principles from high throughput data on candidate materials. Meanwhile, more complex deep learning models are calibrated on the performance of small set of materials and later generalise to identify high-performing candidates across large virtual material libraries. It will be demonstrated that data-driven predictors can rapidly discriminate among potential candidate materials at a fraction of the traditional cost, whilst providing new opportunities to understand structure-performance paradigms for novel material applications.
-
keyboard_arrow_down
Max Ott - Data Analytics Without Seeing the Data
30 Mins
Talk
Intermediate
Today, we first need to collect data before we can analyse them. This not only creates privacy concerns but also security risks for the collector. For many use cases we really only want the analysis and data collection becomes the necessary evil.
In this talk we describe some of the fundamental techniques which allow us to calculate with encrypted data, as well as protocols for distributed analysis and associated security models. We will use some of the standard algorithms, such as logistic regression, to highlight the differences to conventional big-data analytics frameworks.
Finally, we will discuss the architecture and some interesting implementation details of our N1 Analytics Platform which is one of the few emerging industry strength implementations in this space. We will present some performance and scalability measures we collected from initial customer trials.
-
keyboard_arrow_down
Josh Wilson - Unit Testing Data
30 Mins
Talk
Intermediate
“Can I trust this data?” When asked this question it can be a difficult task to objectively measure and answer. Similar to how unit tests have provided metrics for code coverage and bug regressions, this talk aims to show techniques and recipes developed to quantify data sanitisation and coverage. It also demonstrates an extensible design pattern that allows further tests to be developed.
If you can write a query, you can write data unit tests. These strategies have been implemented at Invoice2go in their ETL pipeline for the last 2 years to detect data regressions in their Amazon Redshift data warehouse.
-
keyboard_arrow_down
Glenn Bunker - Property Recommendations for all Australians
30 Mins
Talk
Advanced
We would like to share our journey and experiences in building a large scale recommendation engine at REA. Attendees will learn about choosing the right algorithms, architecture and toolset for a highly-scalable recommender system.
-
keyboard_arrow_down
Elena Akhmatova - Text Classification: Defining Targeted in Targeted Digital Advertising
30 Mins
Talk
Intermediate
The talk describes the ideas and conclusions that were obtained after 5 years of applying Text Classification for the Domain of Digital (aka Programmatic, RTB) Advertising for the purpose of building targeted audiences to advertise to. I will talk about the importance of Text Classification for targeting. Both academic and technical aspects of Text Classification in application for advertisement will be described. At the end of the talk a summary of what Text Classification can achieve will be presented. The talk may be useful for developers and for business people who run RTB advertisement companies. Developers may gain some technical knowledge. Those who are more interested in making their own companies more competitive will learn whether building audience for targeting in-house is a way to go.
-
keyboard_arrow_down
Cam Grant - The Why and How of Why in a World of What
30 Mins
Talk
Intermediate
All data exists in context, and understanding that context is key to unlocking its potential. In this talk you will learn how we consciously and unconsciously influence the context of data, and how qualitative and quantitative methods can be combined to better interpret and extract insights from data.
The presentation will cover:
- What data in context means
- How bias and interpretation affect data collection, data analysis, and the design of data-driven applications
- The importance of combining quantitative data with qualitative insights; data tells you “what”, whereas qualitative insights tell you “why”
- Lessons learned from five years and over 15 data-driven projects
- A framework for connecting the “why” to the “what”
-
keyboard_arrow_down
Ben Kuai - Property Recommendations for all Australians
30 Mins
Talk
Intermediate
We would like to share our journey and experiences in building a large scale recommendation engine at REA. Attendees will learn about choosing the right algorithms, architecture and toolset for a highly-scalable recommender system.
-
keyboard_arrow_down
Quinton Anderson - Moving Forward Under the Weight of all that State
30 Mins
Talk
Advanced
-
keyboard_arrow_down
Jeffrey Theobald - Intermediate Datasets and Complex Problems
30 Mins
Talk
Intermediate
We often build data processing systems by starting from simple cases and slowly adding functionality. Usually, a system like this is thought of as a set of operations that take raw data and create meaningful output. These operations tend to organically grow into monoliths, which become very hard to debug and reason about. As such a system expands, development of new features tends to slow down and the cost of maintenance dramatically increases. One way to manage this complexity is to produce denormalized intermediate datasets which can then be reused for both automated processes and ad hoc querying. This separates the knowledge of how the data is connected from the process of extracting information, and allows these parts to be tested separately and more thoroughly. While there are disadvantages to this approach, there are many reasons to consider it. If this technique applies to you, it makes the hard things easy, and the impossible things merely hard.
-
keyboard_arrow_down
Hilary Cinis - The Why and How of Why in a World of What
30 Mins
Talk
Intermediate
All data exists in context, and understanding that context is key to unlocking its potential. In this talk you will learn how we consciously and unconsciously influence the context of data, and how qualitative and quantitative methods can be combined to better interpret and extract insights from data.
The presentation will cover:
- What data in context means
- How bias and interpretation affect data collection, data analysis, and the design of data-driven applications
- The importance of combining quantitative data with qualitative insights; data tells you “what”, whereas qualitative insights tell you “why”
- Lessons learned from five years and over 15 data-driven projects
- A framework for connecting the “why” to the “what”
-
keyboard_arrow_down
Danielle Stein Fairhurst - Data Visualisation for Analysts
30 Mins
Talk
Intermediate
The demand for analytical skills has increased rapidly in recent years, and with data analysts generating and analysing large quantities of data, the visualisation and communication of the output has never been more important for data analysts. A skilled data analyst can not only synthesise information into a logical framework and summarise it in to a meaningful format, but can also communicate the output or results of the analysis in a well laid out visual, infographic or visual representation of the data. Hear from data modelling analyst, and author of “Using Excel for Business Analysis”, Danielle Stein Fairhurst as we study the principles of good design and how to convert your data into powerful visuals which tell a story to communicate the message uncovered by your analysis.
-
keyboard_arrow_down
Ben Barnes - Infrastructure for Smart Cities: Bridging Research and Production
30 Mins
Talk
Intermediate
This talk will explore our process for taking research algorithms into production as part of large-scale IoT systems. This will include our experiences developing a condition monitoring system for the Sydney Harbour Bridge, and case studies into some of the challenges we have faced. It will also cover general IoT challenges such as bandwidth limits, weatherproofing, and hardware lock-in, and how we have addressed them.
-
keyboard_arrow_down
Tiberio Caetano - The Best Data Isn’t Data: Why Experiments Are The Future of Data Science
30 Mins
Talk
Intermediate
Data is technically the plural of datum, which in Latin is the neuter past participle of dare, which means “to give”. Thus data means “givens”. Indeed, the overwhelming majority of data being analysed out there is given, i.e., the analyst can’t change it. It’s “just there” for you to analyse. You can slice and dice it, model it, act based on it, but you very likely didn’t control even partially the process that gave rise to it. In this talk I’ll try to convince you that, although this given, passive data is important, the real game changer for the future of data science is to combine it with the best possible data, which is not given at all: active data that arises as the outcome of carefully designed experiments. Just as experiments have propelled science to realms unattainable had it focused exclusively on passive observations, I expect the same to happen with data science. I’ll illustrate my arguments with real-world case studies from Ambiata’s experience in serving our clients.