YOW! Data 2021 Day 1
Wed, May 12
Timezone: Australia/Sydney (AEST)
08:45
Session Overviews and Introductions - 15 mins
09:00
-
keyboard_arrow_down
Hilary Mason - Playing with Words: Building Products with NLP
Imagine machines that interact with us using the same interface we use to interact with each other — spoken language! Recent progress in NLP has opened up new possibilities for language-based systems. In this talk, we'll explore the recent history of language models and highlight novel applications of statistical and deep learning approaches. Then, we'll explore emerging products that automate, generate, and create using these models, and discuss the implications for building them, including safety, ethics, and the invention of new design metaphors. Finally, we'll speculate about where this might take us in the next few years. Can machines ... play?
09:45
Break / Q&A with Hilary Mason - 25 mins
10:10
-
keyboard_arrow_down
Jennifer Marsman - Using AI to Mine Unstructured Research Papers to Fight COVID-19
There is an overwhelming amount of information (and misinformation) about COVID-19. How can we use AI to better understand this disease? In this session, we take an open dataset of research papers on COVID-19 and apply several machine learning techniques (name entity recognition of medical terms, finding semantically similar words, contextual summarization, and knowledge graphs) which can help first responders and medical professionals better find and make sense of the research they need. We will dive into the techniques used and share the code repository, so developers will walk away with the understanding of how to build a similar solution using Cognitive Search.
10:40
Break / Q&A with Jennifer Marsman - 25 mins
11:05
-
keyboard_arrow_down
Hien Luu - Scaling the Machine Learning Platform at DoorDash
DoorDash’s mission is to grow and empower local economies. DoorDash’s business is a 3-sided marketplace composed of Dashers, consumers, and merchants.
As DoorDash's business grows, it is essential to establish a centralized ML platform to accelerate the ML development process and to power the numerous ML use cases. We are making good progress, but we are still in the early days of building out our ML platform.
This presentation will detail the DoorDash ML platform journey that includes the way we establish a close collaboration and relationship with the Data Science community, how we intentionally set the guardrails in the early days to enable us to make progress, the principled approach of building out the ML platform while meeting the needs of the Data Science community, and finally the technology stack and architecture that powers billions of predictions per day and supports a diverse set of ML use cases. They include search ranking, recommendation, fraud detection, food delivery assignment, food delivery arrival time prediction, and more.
11:35
Break / Q&A with Hien Luu - 25 mins
12:00
-
keyboard_arrow_down
Julie Amundson - Evolving the ML Platform organisation at Netflix: a case study
Do you wish there was a Machine Learning model to tell you how to structure your ML teams? So do I! While we're waiting for that, I'll share the story of how the ML Platform organisation evolved at Netflix. Although this story is specific to our own journey to expand Netflix ML investments, there are a few lessons learned along the way that you'll be able to relate to. There are several factors going into org structure that we'll discuss, including: the specialty and skillsets of ML practitioners, the variety and depth of ML use cases, who's responsible for the data, the ownership model as ML projects go to production, and how the underlying Platforms are situated. I look forward to sharing and hearing your own thoughts afterward!
12:30
Break / Q&A with Julie Amundson - 25 mins
12:55
Lunch - 30 mins
13:25
-
keyboard_arrow_down
Savin Goyal - Taming the Long Tail of Industrial ML Applications
Data Science usage at Netflix goes much beyond our eponymous recommendation systems. It touches almost all aspects of our business - from optimizing content delivery and informing buying decisions to fighting fraud. Our unique culture affords our data scientists extraordinary freedom of choice in ML tools and libraries, all of which results in an ever-expanding set of interesting problem statements and a diverse set of ML approaches to tackle them. Our data scientists, at the same time, are expected to build, deploy, and operate complex ML workloads autonomously without the need to be significantly experienced with systems or data engineering. In this talk, I will discuss some of the challenges involved in improving the development and deployment experience for ML workloads. I will focus on Metaflow, our ML framework, which offers useful abstractions for managing the model’s lifecycle end-to-end, and how a focus on human-centric design positively affects our data scientists' velocity.
13:55
Break / Q&A with Savin Goyal - 25 mins
14:20
-
keyboard_arrow_down
Will Radford - Assisting design with machine learning in Canva’s editor
Our team at Canva focuses on building features that make design simple, enjoyable and collaborative for more than 55 million people across the globe. For many who haven’t used design tools, starting with a blank page can be intimidating, which is where Canva’s library of more than 500,000 templates comes in. Unfortunately, switching between templates once required retyping your content. To fix this, we created a feature for our users to bring their text with them while exploring the library. The initial challenge was that the template metadata the feature relied on was scarce and costly for our in-house designers to annotate.
We wanted to predict metadata for our designers inside the Canva editor, but had to consider a number of real-world engineering tradeoffs. First, we’ll explain the user problem and provide a glimpse inside some of our templates and the metadata that enables text transfer. Then, we’ll explain what features we extracted for our scikit-learn random forest classifier and how we combined it with a designer-in-the-loop to bootstrap enough batch-predicted metadata to launch an MVP version of the feature. Finally, we’ll explain how we decided to reimplement model storage and inference in our TypeScript frontend stack. Creating this new feature was a joint effort made possible by a multidisciplinary team of designers, engineers and data scientists. We’re looking forward to sharing some of the lessons we learned along the way to shipping this smart feature.
14:50
Break / Q&A with Will Radford - 25 mins
15:15
-
keyboard_arrow_down
Xuanyi Chew - Yepoko Lessons For Machine Learning on Small Data
Let's face it, in most companies, the amount of good data available to perform machine learning is very small. Most data are small data. So how can we do good machine learning on small data?
15:45
Break / Q&A with Xuanyi Chew - 25 mins
16:10
-
keyboard_arrow_down
Mikio Braun - Lessons learned from building ML products
Building products based on machine learning requires much more than taking a ML algorithm and deploying it in the cloud. Based on my experience as a researcher, working in ecommerce and independent consultant, I talk about some of the lessons learned what is needed beyond pure ML algorithms to successfully build products with ML. How do you identify customer problems that can be tackled with ML? How does the technology landscape around ML look like? How do you set up teams and organizations to be "AI ready?" I'll be sharing some of my observation and insights.
16:40
Break / Q&A with Mikio Braun - 25 mins
17:05
-
keyboard_arrow_down
Kendra Vant - Do you want ML with that? When to say yes and why to say no.
In this talk I'll speak about why you should only use ML when you really need to, some techniques we've used successfully at Xero to help cut through the noise/analysis paralysis, and why it might help to consider approaching the build of an ML inside the system the same way you might decide what car to buy.
17:50
Break / Q&A with Kendra Vant - 25 mins
YOW! Data 2021 Day 2
Thu, May 13
08:45
Session Overviews and Introductions - 15 mins
09:00
-
keyboard_arrow_down
Sid Anand - Building & Operating Autonomous Data Streams
The world we live in today is fed by data. From self-driving cars and route planning to fraud prevention to content and network recommendations to ranking and bidding, the world we live in today not only consumes low-latency data streams, it adapts to changing conditions modeled by that data.
While the world of software engineering has settled on best practices for developing and managing both stateless service architectures and database systems, the larger world of data infrastructure still presents a greenfield opportunity. To thrive, this field borrows from several disciplines : distributed systems, database systems, operating systems, control systems, and software engineering to name a few.
Of particular interest to me is the sub field of data streams, specifically regarding how to build high-fidelity nearline data streams as a service within a lean team. To build such systems, human operations is a non-starter. All aspects of operating streaming data pipelines must be automated. Come to this talk to learn how to build such a system soup-to-nuts.
09:45
Break / Q&A with Sid Anand - 25 mins
10:10
-
keyboard_arrow_down
Nathan Wallace - Data Rainbows - select * from cloud;
Drowning in a lake? Stuck inside a warehouse? See your data in a different light! Postgres Foreign Data Wrappers provide SQL queries to live cloud data - all the structure and much lighter weight. In this session, we'll explore the potential of Data Rainbows for growing cloud environments and outline the challenges of working with data you can see but can't quite touch.
10:40
Break / Q&A with Nathan Wallace - 25 mins
11:05
-
keyboard_arrow_down
Zhamak Dehghani - Data Mesh; A principled introduction
For over half a century organizations have assumed that data is an asset to collect more of, and data must be centralized to be useful. These assumptions have led to centralized and monolithic architectures, such as data warehousing and data lake, that limit organization to innovate with data at scale.
Data Mesh as an alternative architecture and organizational structure for managing analytical data.Its objective is enabling access to high quality data for analytical and machine learning use cases - at scale.It's an approach that shifts the data culture, technology and architecture- from centralized collection and ownership of data to domain-oriented connection and ownership of data- from data as an asset to data as a product- from proprietary big platforms to an ecosystem of self-serve data infrastructure with open protocols- from top-down manual data governance to a federated computational one.In this talk, Zhamak will introduce the principles underpinning Data Mesh and architecture.
11:35
Break / Q&A with Zhamak Dehghani - 25 mins
12:00
-
keyboard_arrow_down
Matteo Merli - Apache Pulsar and the Streaming Ecosystem
Apache Pulsar is an open-source distributed pub-sub messaging system, developed under the stewardship of the Apache Software Foundation.
This talk will show how its unique architecture enables Pulsar to seamlessly support both streaming and messaging use cases in a single unified platform.
We will also show where Pulsar fits with the broader ecosystem of data streaming technologies and all the interoperability that is available out of the box, making it a particularly good choice for supporting any kind of data platform, where versatility, interoperability and scalability are the key requirements.
12:30
Break / Q&A with Matteo Merli - 25 mins
12:55
Lunch - 30 mins
13:25
-
keyboard_arrow_down
Jesse Anderson - Foundations of Data Teams
Successful data projects are built on solid foundations. What happens when we’re misled or unaware of what a solid foundation for data teams means? When a data team is missing or understaffed, the entire project is at risk of failure.
This talk will cover the importance of a solid foundation and what management should do to fix it. To do this I’ll be sharing a real-life analogy to show how we can be misled and what that means for our success rates.
We will talk about the teams in data teams: data science, data engineering, and operations. This will include detailing what each is, does, and the unique skills for the team. It will cover what happens when a team is missing and the effect on the other teams.
The analogy will come from my own experience with a house that had major cracks in the foundation. We were going to simply remodel the kitchen. We weren’t ever told about the cracks and the house needs a completely new foundation. In a similar way, most managers think adding in advanced analytics such as machine learning is a simple addition (remodel the kitchen). However, management isn’t ever told that you need all three data teams to do it right. Instead, management has to go all the way back to the foundation and fix it. If they don’t, the house (team) will crumble underneath the strain.
13:55
Break / Q&A with Jesse Anderson - 25 mins
14:20
-
keyboard_arrow_down
Caito Scherr - Sweet Streams are Made of These: Data Driven Development for Stream Processing
The strength of a powerful stream processing engine is in how fast, and how much data it can process. This naturally adds complexity to existing integration points and can lead to development overhead. Luckily, there is a set of data-driven development principles that are built to alleviate precisely these challenges. This talk will go over what these are and how to apply them at various points throughout the development process, using real-world successes (and failures!) as examples. Although the examples are for highly complex systems, this talk will be beginner-friendly and applicable to non-streaming use cases.
14:50
Break / Q&A with Caito Scherr - 25 mins
15:15
-
keyboard_arrow_down
Rimma Shafikova - Analyzing a Terabyte of Game Data
A couple of terabytes of data is not impressive by today's standards. A hard drive of that capacity costs about a hundred dollars. But things quickly get complicated when one needs to draw insights from a corpus of unstructured game scenarios that are increasing at a rate of a terabyte a year.
You will hear some lessons learned by a data scientist wearing an extra hat of data engineer on this fun side project. The talk will cover topics from using Apache Spark distributed computing framework and optimizing Delta tables to making sense of resulted mega-dataset with graph theory and an interactive Streamlit application.
15:45
Break / Q&A with Rimma Shafikova - 25 mins
16:10
-
keyboard_arrow_down
Simon Aubury - Islands in the Stream - What country music can teach us about event driven systems
Event driven systems are all the rage. It's with good reason we're witnessing a transformation with businesses adopting event driven systems. Building systems around an event-driven architecture is powerful pattern for creating awesome data intensive applications. But before we sail away to another world, let's avoid the common pitfalls of designing & running event driven systems.
Islands in the Stream - what Kenny Rogers can teach us about event driven systems from the wisdom of a country music classic
16:40
Break / Q&A with Simon Aubury - 25 mins
17:05
-
keyboard_arrow_down
Kalinda Griffiths - Rights, Sovereignty and Governance in Official Reporting: Considerations in the Use of Aboriginal and Torres Strait Islander data
The realisation for Indigenous people in Australia to be counted in official statistics occurred in 1967.
The identification of Indigenous people in Australia in national data highlights a range of historical
and contemporary issues that require our attention. This includes how Indigenous people have been
defined and by whom, as well as how identification is operationalised in official data collections.
Furthermore, the completeness and accuracy of Indigenous people identified in the data and the
impact this has on the measurement of health and wellbeing must also be taken into account. Official
national reporting of Indigenous people is calculated using data from censuses, vital statistics, and
existing administrative data collections and/or surveys. In alignment with human rights standards,
individuals in Australia can opt to self-identify as ‘Indigenous’ in the data. Australia’s colonial
context in which Aboriginal and Torres Strait Islander data is derived results in considerations about
the sovereign rights of Indigenous people globally in the use of data and how this can be actioned
through data governance processes.