• Liked Davor Bonaci
    keyboard_arrow_down

    Davor Bonaci - Realizing the Promise of Portable Data Processing with Apache Beam

    Davor Bonaci
    Davor Bonaci
    Senior Software Engineer
    Google Inc.
    schedule 2 months ago
    Sold Out!
    30 mins
    Talk
    Intermediate

    The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the Big Data ecosystem together; it enables users to "run-anything-anywhere".

    This talk will briefly cover the capabilities of the Beam model for data processing, as well as the current state of the Beam ecosystem. We'll discuss Beam architecture and dive into the portability layer. We'll offer a technical analysis of the Beam's powerful primitive operations that enable true and reliable portability across diverse environments. Finally, we'll demonstrate a complex pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Spark on Amazon Web Services, Apache Flink on Google Cloud, Apache Apex on-premise), and give a glimpse at some of the challenges Beam aims to address in the future.

  • Liked Radek Ostrowski
    keyboard_arrow_down

    Radek Ostrowski - Dipping into the Big Data River: Stream Analytics at Scale

    Radek Ostrowski
    Radek Ostrowski
    Big Data Engineer
    CBA/Toptal
    schedule 2 months ago
    Sold Out!
    30 mins
    Demonstration
    Intermediate

    This presentation explains the concept of Kappa and Lambda architectures and showcases how useful business knowledge can be extracted from the constantly flowing river of data.

    It also demonstrates how a simple POC could be built in a day with only getting your toes wet by leveraging Docker and other technologies like Kafka, Spark and Cassandra.

  • Liked Ananth Gundabattula
    keyboard_arrow_down

    Ananth Gundabattula - Low Latency Polyglot Model Scoring using Apache Apex

    30 mins
    Talk
    Intermediate

    Data science is fast becoming a complementary approach and process to solve business challenges today. The explosion of frameworks to help data scientists build models bears a testimony to this. However when a model needs to be turned into a production version in very low latency and enterprise grade environments, there are a very few choices with each one having their own strengths and weaknesses. Adding to this is the current disconnect between a data scientists world which is all about modelling and an engineers world which is about SLAs and service guarantees. A framework like Apache Apex can complement each of these roles and provide constructs for both these worlds. This would help enterprises to drastically cut down the cost of model deployment to production environments.

    The talk will present Apache Apex as a framework that can enable engineers and data scientists to build low latency enterprise grade applications. We will cover the foundations of Apex that contribute to the low latency processing capabilities of the platform. Subsequently aspects of the platform that make it qualify as an enterprise grade platform are discussed. Finally, we will cover the main aspects of the title of the talk wherein models developed in Java, R and Python can co-exist in the same scoring application framework thus enabling a true polyglot framework.

  • Liked Roman Kovalik
    keyboard_arrow_down

    Roman Kovalik - Batch as a Special Case of Streaming

    Roman Kovalik
    Roman Kovalik
    Big Data Engineer
    Quantium
    schedule 1 month ago
    Sold Out!
    30 mins
    Case Study
    Intermediate

    In this talk I will share my teams gruelling journey in attempting to migrate a batch like system into a streaming framework.

    Walking through the various solutions that we tested using Flink, I'll be discussing each ones performance characteristics and bringing to light misconceptions in their designs.

  • Liked Natalia Ruemmele
    keyboard_arrow_down

    Natalia Ruemmele - Cast a Net Over your Data Lake

    Natalia Ruemmele
    Natalia Ruemmele
    Data Scientist
    Data61, CSIRO
    schedule 1 month ago
    Sold Out!
    30 mins
    Talk
    Intermediate

    As the variety of data continues to expand, the need for different kinds of analytics is increasing – big data is no longer just about the volume, but also about its increasing diversity. Unfortunately, there is no one-size-fits-all approach to analytics – no magic pill that will get your organization the insight it needs from data. Graph analytics offers a toolset to visualize your diverse data and to build more accurate predictive models by uncovering non-obvious inter-connections among your data sources.

    In this talk we will discuss some use cases for graph analytics and walk through a particular scenario to find power-users for a promotion campaign. We will also cover machine learning approaches which can assist you in constructing graphs from diverse data sources.

  • Liked Ondrej Ivanič
    keyboard_arrow_down

    Ondrej Ivanič - Writing Better R Code

    30 mins
    Talk
    Intermediate

    Data scientists, analysts, and statisticians are passionate about the data, models, and insights but the code used to produce the results (in many cases) is left behind. We have very good understanding of our code base during the time when we are working on the project but most of the time we do not write the code for the "future me".

    In this talk, I describe and explain common coding pitfalls in R and then introduce functional programming using functions from base R, purrr (part of tidyverse) and pipes as a preferred solution for creating robust and reusable R code. Between the topics, I briefly touch on controversial topics such as "loops are bad" and "pipes are the best"

  • Liked Daniel Filonik
    keyboard_arrow_down

    Daniel Filonik - A Geometric Approach towards Data Analysis and Visualisation

    30 mins
    Talk
    Intermediate

    Beginning with the work of Bertin, visualisation scholars have attempted to systematically study and deconstruct visualisations in order to gain insights about their fundamental structure. More recently, the idea of deconstructing visualizations into fine-grained, modular units of composition also lies at the heart of graphics grammars. These theories provide the foundation for visualization frameworks and interfaces developed as part of ongoing research, as well as state-of-the-art commercial software, such as Tableau. In a similar vein, scholars like Tufte have long advocated to forego embellishments and decorations in favor of abstract and minimalist representations. They argue that such representations facilitate data analysis by communicating only essential information and minimizing distraction.

    This presentation continues along such lines of thought, proposing that this pursuit naturally leads to a geometric approach towards data analysis and visualisation. Looking at data from a sufficiently high level of abstraction, one inevitably returns to fundamental mathematical concepts. As one of the oldest branches of mathematics, geometry offers a vast amount of knowledge that can be applied to the formal study of visualisations.

    ``Visualization is a method of computing. It transforms the symbolic into the geometric.'' (McCormick et al., 1987)

    In other words, geometry is the mathematical link between abstract information and graphic representation. In order to graphically represent information, we assign to it a geometric form. In this presentation we will explore the nature of these mappings from symbolic to geometric representations. This geometric approach provides an alternative perspective towards analysing data. This perspective is inherently equipped with high-level abstractions and invites generalization. It enables the study of abstract geometric objects independent from a concrete presentation medium. Consequently, it allows to interpret data directly through geometric primitives and transformations.

    The presentation illustrates the geometric approach using diverse examples and illustrations. In turn, we discuss the opportunities and challenges that arise from this perspective. For instance, a key benefit of this approach is that it allows to consider seemingly disparate visualization types in a unified framework. By systematically enumerating the design space of geometric representations, it is possible to trivially apply extensions and modifications, resulting in great expressiveness. The approach naturally extends to visualisation techniques for complex, multidimensional, multivariate data sets. However, the effectiveness of the resulting representations and cognitive challenges in the interpretation require careful consideration.

  • Liked Gareth Jones
    keyboard_arrow_down

    Gareth Jones - Image Recognition for Non-Experts: From Google Cloud Vision to Tensorflow

    Gareth Jones
    Gareth Jones
    Consultant
    Shine Solutions
    schedule 2 months ago
    Sold Out!
    30 mins
    Talk
    Intermediate

    Displaying an inappropriate ad on a website can be a major headache for an Ad network. Showing ads for a site’s major competitor, or ads in a category at odds with the site’s brand, for example, can cause embarrassment and lost revenue. With the selection of ads being largely algorithmic it can be hard to set up rules to make sure this doesn’t happen. You also don’t want your first awareness of the problem being a call from an angry CEO.

    This talk shows how we built a system that uses image recognition to detect Ad Breaches. Our first version makes use of Google’s Cloud Vision API. The Cloud Vision API is a pre-trained service that recognises many categories of objects from images, along with some text recognition. I’ll discuss how to use the Cloud Vision API in your applications, what it is good at, what it is not.

    I’ll then look at using transfer learning to improve our system’s ability to recognise Ad Breaches. I will look at how we can use the popular Tensorflow library to build our own image recognition model. Tensorflow comes with several pre-trained models for image recognition - using these I will show you how to build your own specialised image recognition models in a fraction of the time, and with a fraction of the input data, by re-using existing pre-trained layers from the best models out there. I’ll investigate whether we can train a model to detect potential ad breaches from a small set of examples.

  • Liked J. Rosenbaum
    keyboard_arrow_down

    J. Rosenbaum - The Future of Art

    J. Rosenbaum
    J. Rosenbaum
    Artist
    J. Rosenbaum
    schedule 2 months ago
    Sold Out!
    30 mins
    Talk
    Intermediate

    Most people are aware of the impact machine learning will have on jobs, on the future of research and autonomous machines, but few seem to be aware of the future role machine learning could play in the creative arts, in visual art and music. What will art be like when artists and musicians routinely work collaboratively with machines to create new and interesting artworks? What can we learn from art created using neural networks and what can we create? From the frivolous to the beautiful what does art created by computers look like and where can it take us?

    This talk will explore magenta in tensorflow and neural style in caffe, google deep dream, next Rembrandt, and convolutional neural networks. I will look into some of the beautiful applications of machine learning in art and some of the ridiculous ones as well.

  • Liked Bob Raman
    keyboard_arrow_down

    Bob Raman - Learnings from Building Data Products at Zendesk

    Bob Raman
    Bob Raman
    Engineering Manager
    Zendesk
    schedule 2 months ago
    Sold Out!
    30 mins
    Case Study
    Intermediate

    In this talk you will learn about the team structure and process for building Data Product from the lessons of one of the teams that builds Data Products at Zendesk. The Data Product team uses machine learning to build Data Products that will reduce cost of customer support for Zendesk's 100,000 odd customers.

    This talk will explain the journey of the Data Product team to date - its structure and how it has evolved, challenges as well as successes and failures.

  • Liked Richard Morwood
    keyboard_arrow_down

    Richard Morwood - Video Game Analytics on AWS

    30 mins
    Demonstration
    Intermediate

    This talk will cover how to use AWS technologies to build an analytics system for video games which can be used to analyse player behaviour in near real-time. This system enables developers to identify trends in player difficulties, ease of use, the highs and lows of player engagement and how to visualise these results in-game. This demo uses a serverless approach for data capturing, processing and serving using AWS Mobile Analytics, Apache Spark on DataBricks, Athena and Lambda technologies.

    Representing data in game enables developers to see results in an environment they are already very familiar with and adjust level design to maximise engagement. Developers can use this information to track the effects of updated releases to easily identify if changes have had the intended effect. These same techniques can be applied in many scenarios including web tracking and click stream analytics.

  • Liked Mick Semb Wever
    keyboard_arrow_down

    Mick Semb Wever - Looking Behind Microservices to Brewer's Theorem, Externalised Replication,and Event Driven Architecture

    Mick Semb Wever
    Mick Semb Wever
    Consultant
    The Last Pickle
    schedule 3 months ago
    Sold Out!
    30 mins
    Talk
    Advanced

    Scaling data is difficult, scaling people even more so.

    Today Microservices makes it possible to effectively scale both data and people by taking advantage of bounded contexts and Conway's law.
    But there's still a lot more theory that's coming together in our adventures in dealing with ever more data. Some of these ideas and theories are just history repeating, while others are newer concepts.

    These ideas can be seen in many Microservices platforms, within the services' code but also in the surrounding infrastructural tools we become ever more reliant upon.

    Mick'll take a dive into it using examples and offer recommendations after seven years of coding Microservices around 'big data' platforms. The presentation will be relevant to people wanting to move beyond REST based asynchronous platforms, to eventually consistent asynchronous designs that aim towards the goal of linear scalability and 100% availability.

  • Liked Joyce Wang
    keyboard_arrow_down

    Joyce Wang - Covariate Shift - Challenges and Good Practice

    Joyce Wang
    Joyce Wang
    Engineer
    Data61, CSIRO
    schedule 1 month ago
    Sold Out!
    30 mins
    Talk
    Intermediate

    A fundamental assumption in supervised machine learning is that both the training and query data are drawn from the same population/distribution. However, in real-life applications this is very often not the case as the query data distribution is unknown and cannot be guaranteed a-priori. Selection bias in collecting training samples will change the distribution of the training data from that of the overall population. This problem is known as covariate shift in the machine learning literature, and using a machine learning algorithm in this situation can result in spurious and often over-confident predictions.

    Covariate shift is only detectable when we have access to query data. Visualization of training and query data would be helpful to gain an initial impression. Machine learning models can be used to detect covariate shift. For example, Gaussian Process could model the similarity between each query point from feature space of training data. One-class SVMs could detect outliers of training data. Both strategies detect query points that live in a different domain of the feature space from the training dataset.

    We suggest two strategies to mitigate covariate shift: re-weighting training data, and active learning with probabilistic models.

    First, re-weighting the training data is the process of matching distribution statistics between the training and query sets in feature space. When the model is trained (and validated) on re-weighted data, it is expected to generalise better to query data. However, significant overlap between training and query datasets is required.

    Secondly, there may be a situation where we can acquire the labels of a small portion of the query set, potentially at great expense, to reduce the effects of covariate shift. Probabilistic models are required in this case because they indicate the uncertainty in their prediction. Active learning enables us to optimally select small subsets of query points that aim to maximally shrink the uncertainty in our overall prediction.

  • Liked Jesse Anderson
    keyboard_arrow_down

    Jesse Anderson - Processing Data of Any Size with Apache Beam

    45 mins
    Talk
    Intermediate

    Rewriting code as you scale is a terrible waste of time. You have perfectly working code, but it doesn’t scale. You really need code that works at any size, whether that’s a megabyte or a terabyte. Beam allows you to learn a single API and process data as it grows. You don’t have to rewrite at every step.

    In this session, we will talk about Beam and its API. We’ll see how to Beam execute on Big Data or small data. We’ll touch on some of the advanced features that make Beam an interesting choice.

  • Liked Lynn Langit
    keyboard_arrow_down

    Lynn Langit / Denis Bauer - Cloud Data Pipelines for Genomics from a Bioinformatician and a Developer

    45 mins
    Keynote
    Intermediate

    Dr.Bauer and her team have been working to build genome-scale data pipelines that address the computational challenges and limits present in today’s cancer genomic (bioinformatics) data workflows.

    Dr. Bauer and her team have built solutions which use modern architectures, such as serverless (AWS Lambda) and also customised machine learning on Apache Spark. AWS Community Hero and cloud architect Lynn Langit is also collaborating with the CSIRO team to push solutions at the cutting edge of bioinformatic research which best utilise advances in cloud technologies..

    In this demo-filled session Lynn and Denis will discuss and demonstrate some of the latest cloud data pipeline work that they’ve been working together to build out for the bioinformatics community.

  • Liked Rachel Bunder
    keyboard_arrow_down

    Rachel Bunder - What is the Most Common Street Name in Australia?

    Rachel Bunder
    Rachel Bunder
    Data Scientist
    Solar Analytics
    schedule 1 month ago
    Sold Out!
    30 mins
    Case Study
    Intermediate

    Finding the most common street name in Australia may sound relatively simple, but it quickly leads to other questions. What is a street name? Do The Avenue, The Grand Parade and The Serpentine all share the same name? And what is a street? Is the M5 Motorway a street? What about M5 Motorway Offramp?

    This talk will answer these questions using Open Street Map and Python. In particular, reading in and manipulating Open Street Map data using geopandas; exploring the structure of Open Street Map and creating models for parsing street names. And finding the most common street name in Australia.

  • Liked Mick Semb Wever
    keyboard_arrow_down

    Mick Semb Wever - The Network, The Kingmaker: Distributed Tracing and Zipkin

    Mick Semb Wever
    Mick Semb Wever
    Consultant
    The Last Pickle
    schedule 3 months ago
    Sold Out!
    30 mins
    Case Study
    Intermediate

    Adding Zipkin instrumentation into a codebase makes it possible to create one tracing view across an entire platform. This is the often eluded "correlation identifier" that's recommended by Microservices but has so few solid open sourced solutions available. This is an aspect to monitoring of distributed platforms akin to the separate concerns in aggregation of metrics and logs.

    This talk will use the use case of extending Apache Cassandra's tracing: to use Zipkin so to demonstrate a single tracing view across an entire system. From browser and HTTP, through a distributed platform, and into the Database down to seeks on disk. Put together it makes easy to identify which queries to a particular service took the longest and to trace back how the application made them.

    This presentation will raise the requirements and expectations DevOps has on their infrastructural tools. For people that want to take their infrastructural tools to the next level, where the network is known as the kingmaker.

  • Liked Christopher Biggs
    keyboard_arrow_down

    Christopher Biggs - From Little Things, Big Data Grow - IoT at Scale

    Christopher Biggs
    Christopher Biggs
    Director
    Accelerando Consulting
    schedule 3 months ago
    Sold Out!
    30 mins
    Talk
    Intermediate

    The Internet of Things (IoT) is really about the ubiquity of data, the possibility of humans extending their awareness and reach globally, or further.
    IoT frees us from the tedium of physically monitoring or maintaining remote systems, but to be effective we must be able to rely on data being accessible but comprehensible.

    This presentation covers three main areas of an IoT big data strategy

    • The Air Gap - options (from obvious to inventive) for connecting wireless devices to the internet
    • Tributaries - designing a scalable architecture for amalgamating IoT data flows into your data lake. Covers recommended API and message-bus architectures.
    • Management and visualisation - how to characterise and address IoT devices in ways that scale to continental populations. I will show some examples of large scale installations to which I've contributed and how to cope with information overload.
  • Liked Yaniv Rodenski
    keyboard_arrow_down

    Yaniv Rodenski - Introduction to Apache Amaterasu (Incubating): A CD Framework for your Big Data Pipelines

    Yaniv Rodenski
    Yaniv Rodenski
    Developer
    Shinto
    schedule 3 months ago
    Sold Out!
    30 mins
    Demonstration
    Advanced

    In the last few years, the DevOps movement has introduced ground breaking approaches to the way we manage the lifecycle of software development and deployment. Today organisations aspire to fully automate the deployment of microservices and web applications with tools such as Chef, Puppet and Ansible. However, the deployment of data-processing pipelines remains a relic from the dark-ages of software development.

    Processing large-scale data pipelines is the main engineering task of the Big Data era, and it should be treated with the same respect and craftsmanship as any other piece of software. That is why we created Apache Amaterasu (Incubating) - an open source framework that takes care of the specific needs of Big Data applications in the world of continuous delivery.

    In this session, we will take a close look at Apache Amaterasu (Incubating) a simple and powerful framework to build and dispense pipelines. Amaterasu aims to help data engineers and data scientists to compose, configure, test, package, deploy and execute data pipelines written using multiple tools, languages and frameworks.
    We will see what Amaterasu provides today, and how it can help existing Big Data application and demo some of the new bits that are coming in the near future.

  • Liked Aaron Morton
    keyboard_arrow_down

    Aaron Morton - Scalable IOT with Apache Cassandra.

    Aaron Morton
    Aaron Morton
    Co-Founder & CEO
    The Last Pickle
    schedule 1 month ago
    Sold Out!
    45 mins
    Talk
    Intermediate

    IOT and Event Based systems can process huge volumes of data. Which typically needs to be stored and read in near real time for event processing, in addition to being read in bulk to feed data hungry learning systems. Apache Cassandra provides a high performance, scalable, and fault tolerant database platform with excellent support for time series data models typically seen in IOT systems. It's millisecond (or better) latency can support systems that react to events in real time, while scalable bulk reads via batch processing systems such as Apache Hadoop and Apache Spark can support learning applications. These features, and more, make Cassandra an ideal persistence platform for modern data intensive, event driven, systems.

    In this talk Aaron Morton, CEO at The Last Pickle, will discuss lessons learned using Cassandra for IOT systems. He will explain how Cassandra fits into the modern technology landscape and dive into data modelling for common IOT use cases, capacity planning for huge data loads, tuning for high performance, and integration with other data driven systems. Whether starting a new project, or deep into the weeds on an existing system, attendees will leave will leave with an understanding of how Apache Cassandra can help build robust infrastructure for IOT systems.

Looking for your submitted proposals. Click here.