Search at Scale: Using Machine Learning to Automate Content Metadata

location_city Sydney schedule May 6th 04:50 - 05:20 PM place Red Room people 61 Interested

For media organisations, reach is everything. Getting eyeballs and ears in front of content is their raison d'être.

Search plays a critical role in connecting audiences with t-1 content (yesterday's news, last week's podcast). However, with audience expectations conditioned by Google and others, it is challenging to deliver robust, scalable search that people actually want to use.

The relevance of your results is everything, and to produce relevant results you need good metadata for every object in your search index. With hundreds of thousands of content objects and an audience of millions, the ABC has unique challenges in this regard.

This talk will explore the ABC's use of Machine Learning (ML) to automatically generate meaningful metadata for pieces of content (audio/video/text), including AWS MLaaS for full transcripts of audio podcasts and a platform developed in-house for NLP tasks such as entity recognition and automated document summarisation, and image-related tasks such as segmentation and tagging.


Outline/Structure of the Case Study

The presentation will begin with a brief overview of the ABC and the history of their search product.

It will then cover the design of a custom data pipeline using AWS Transcribe to generate transcripts for podcasts, including various early experiments that looked at using in-house models (modified DeepSpeech) and why AWS was chosen & the comparative quality of the transcripts themselves.

The third part of the presentation will look at NLP (on the resulting transcripts, news article text and closed captions from video content) and image segmentation/tagging, as well as the platform the ABC has built to serve these models.

The final part will cover off how we tested the effectiveness of these techniques, including our approach to A/B testing, and what did/didn't work.

Learning Outcome

The audience should walk away with an understanding of how Machine Learning can be used to solve the challenges in delivering good content search at scale. These challenges go beyond the usual 'how do we deliver the thing' and into 'how do we deliver relevant things, reliably'.

Target Audience

Data scientists, infrastructure engineers, product managers

Prerequisites for Attendees

High-level familiarity with data/ML 'landscape' (what's possible with ML in 2019 and what some of the tools are), exposure to ABC content, curiosity!

schedule Submitted 1 year ago