Extracting texts of various sizes, shapes and orientations from images containing multiple objects is an important problem in many contexts, especially, in connection to e-commerce, augmented reality assistance system in a natural scene, content moderation in social media platforms, etc. The text from the image can be a richer and more accurate source of data than human inputs which can be used in several applications like Attribute Extraction, Profanity Checks, etc.

Typically, Extracting Text is achieved in 2 stages:

Text detection: this module helps to know the regions in the input image where the text is present.

Text recognition: given the regions in the image where the text is present, this module gives the raw text out of it.

In this session, I will be talking about the Character level Text Detection for detecting normal and arbitrary shaped texts. Later will be discussing the CRNN-CTC network & the need for CTC loss to obtain the raw text from the images.


Outline/Structure of the Talk

  • Motivation for Text extraction from Images: 2 mins
  • Defining a pipeline for Text Extraction: 2 mins
  • Deep Learning techniques for Text Detection: 5 mins
  • Understanding receptive fields in CNN: 2 mins
  • Data preparation for training text recognition model: 1 min
  • CRNN-CTC model for Text Recognition: 6 mins
  • Use cases of text extraction in different domains: 2 mins

Learning Outcome


  1. Understanding the need for Text Extraction from Images.
  2. Deep Learning Techniques for detecting highly oriented text.
  3. Understanding of receptive fields in CNN’s.
  4. Theoretical understanding CRNN-CTC network for Text recognition and need for CTC loss.
  5. Usage of Text Extraction in various fields/domains.

Target Audience

Data Scientists, Machine Learning Engineers, Researchers.

Prerequisites for Attendees

Basic understanding of CNN's and LSTM's.

schedule Submitted 2 years ago


    SOURADIP CHAKRABORTY / Rajesh Shreedhar Bhat - Learning Generative models with Visual attentions in the field of Image Captioning

    20 Mins

    Image caption generation is the task of generating a descriptive and appropriate sentence of a given image. For humans, the task looks straightforward with the motive of summarising the image in a single sentence incorporating the interactions between the various components present in the image. But to replicate this phenomenon in an artificial framework is a very challenging task. Attention fixes this problem as it allows the network to look over the relevant features of the encoder as an input to the decoder at each time step. In this session, we show how attention mechanism enhances the performance of language translation tasks in an encoder-decoder framework.

    Before the attention mechanism in sequence to sequence settings, the entire sequence was encoded into a thought/context vector which was used to initialize the decoder to generate the output sequence. But the major shortcoming of this methodology was that no weightage was given to the encoder features in the context of the generated sequence, thereby confounding the network and resulting in the inadequate output sequence.

    Inspired by the outstanding results of using attention mechanisms in machine translation and other seq2seq tasks, there have been few advancements in the field of computer vision using attention techniques. In this session, we incorporate visual attention mechanisms in generating relevant captions from images using a deep learning framework.


    SOURADIP CHAKRABORTY / Sayak Paul - Implicit Data Modelling using Self-Supervised Transfer Learning

    20 Mins

    Transfer learning is specifically very helpful when there is a scarcity of data, limited bandwidth that might not allow training deep models from scratch, and so on. In the world of computer vision, ImageNet pre-training has been widely successful across a number of different tasks, image classification being the most popular one. All of that success has been possible mainly because of the ImageNet dataset which is a collection of images spanning across 1000 labels. This is where a stern limitation comes in - the need for having labeled data. In this session, we want to take a deep dive into the world of self-supervised learning which allows models to exploit the implicit labels of input data. In the first half of the session, we will be covering the basics of transfer learning, its successes, and its challenges. We will then start by formulating the problem that self-supervised learning tries to address. In the second half of the session, we will be discussing the ABCs of self-supervised learning along with some examples. We will conclude by a shortcode walk-through and a discussion on the challenges of self-supervised learning.