Image caption generation is the task of generating a descriptive and appropriate sentence of a given image. For humans, the task looks straightforward with the motive of summarising the image in a single sentence incorporating the interactions between the various components present in the image. But to replicate this phenomenon in an artificial framework is a very challenging task. Attention fixes this problem as it allows the network to look over the relevant features of the encoder as an input to the decoder at each time step. In this session, we show how attention mechanism enhances the performance of language translation tasks in an encoder-decoder framework.

Before the attention mechanism in sequence to sequence settings, the entire sequence was encoded into a thought/context vector which was used to initialize the decoder to generate the output sequence. But the major shortcoming of this methodology was that no weightage was given to the encoder features in the context of the generated sequence, thereby confounding the network and resulting in the inadequate output sequence.

Inspired by the outstanding results of using attention mechanisms in machine translation and other seq2seq tasks, there have been few advancements in the field of computer vision using attention techniques. In this session, we incorporate visual attention mechanisms in generating relevant captions from images using a deep learning framework.


Outline/Structure of the Talk

  • Intro to Image Captioning - 2 mins
  • Real-life use cases of image captioning - 2 mins
  • History & Evolution of Image Captioning - 4 mins
  • Sequence to Sequence Learning in NLP - 2 mins
  • Show and Tell: Sequence to Sequence Learning in Vision - 3 mins
  • Show Attend and Tell: Image Captioning with Attention - 5 mins
  • Image Captioning demo - 2 mins

Learning Outcome

  • A drill down on the various past approaches in Image captioning and intuitive reasons for their failure
  • Understanding of receptive fields in CNN’s
  • The motivation of using attention in an encoder-decoder based framework
  • About different variations and intuitions of Attention mechanism in an encoder-decoder framework

Target Audience

Our Target Audience will be both Industry ML&DL practitioners and academicians research in the field of Computer vision and Captioning. Image captioning as a field has a huge application in Information retrieval, search engines, recommendation engines etc. and hence is a very special topic for the industrial ML/DL practitioners. Using Attention mechanism in this field has opened up a lot of research scopes and hence is an extremely fascinating topic for academicians and researchers.

Prerequisites for Attendees

Our session will cover the basics of Image captioning explaining the earlier methods as well as the attention mechanisms in detail with real-life examples and use cases. So, anybody with a basic understanding of Probability, Linear Algebra, Machine Learning, Computer Vision will be able to understand and gain from our session.



schedule Submitted 3 years ago

  • Rajesh Shreedhar Bhat

    Rajesh Shreedhar Bhat / Pranay Dugar - Text Extraction from Images using deep learning techniques

    20 Mins

    Extracting texts of various sizes, shapes and orientations from images containing multiple objects is an important problem in many contexts, especially, in connection to e-commerce, augmented reality assistance system in a natural scene, content moderation in social media platforms, etc. The text from the image can be a richer and more accurate source of data than human inputs which can be used in several applications like Attribute Extraction, Profanity Checks, etc.

    Typically, Extracting Text is achieved in 2 stages:

    Text detection: this module helps to know the regions in the input image where the text is present.

    Text recognition: given the regions in the image where the text is present, this module gives the raw text out of it.

    In this session, I will be talking about the Character level Text Detection for detecting normal and arbitrary shaped texts. Later will be discussing the CRNN-CTC network & the need for CTC loss to obtain the raw text from the images.


    SOURADIP CHAKRABORTY / Sayak Paul - Implicit Data Modelling using Self-Supervised Transfer Learning

    20 Mins

    Transfer learning is specifically very helpful when there is a scarcity of data, limited bandwidth that might not allow training deep models from scratch, and so on. In the world of computer vision, ImageNet pre-training has been widely successful across a number of different tasks, image classification being the most popular one. All of that success has been possible mainly because of the ImageNet dataset which is a collection of images spanning across 1000 labels. This is where a stern limitation comes in - the need for having labeled data. In this session, we want to take a deep dive into the world of self-supervised learning which allows models to exploit the implicit labels of input data. In the first half of the session, we will be covering the basics of transfer learning, its successes, and its challenges. We will then start by formulating the problem that self-supervised learning tries to address. In the second half of the session, we will be discussing the ABCs of self-supervised learning along with some examples. We will conclude by a shortcode walk-through and a discussion on the challenges of self-supervised learning.