Extracting texts of various sizes, shapes and orientations from images containing multiple objects is an important problem in many contexts, especially, in connection to e-commerce, augmented reality assistance system in a natural scene, content moderation in social media platforms, etc. The text from the image can be a richer and more accurate source of data than human inputs which can be used in several applications like Attribute Extraction, Profanity Checks, etc.

Typically, Extracting Text is achieved in 2 stages:

Text detection: this module helps to know the regions in the input image where the text is present.

Text recognition: given the regions in the image where the text is present, this module gives the raw text out of it.

In this session, I will be talking about the Character level Text Detection for detecting normal and arbitrary shaped texts. Later will be discussing the CRNN-CTC network & the need for CTC loss to obtain the raw text from the images.

 
 

Outline/Structure of the Talk

  • Motivation for Text extraction from Images: 2 mins
  • Defining a pipeline for Text Extraction: 2 mins
  • Deep Learning techniques for Text Detection: 5 mins
  • Understanding receptive fields in CNN: 2 mins
  • Data preparation for training text recognition model: 1 min
  • CRNN-CTC model for Text Recognition: 6 mins
  • Use cases of text extraction in different domains: 2 mins

Learning Outcome

 

  1. Understanding the need for Text Extraction from Images.
  2. Deep Learning Techniques for detecting highly oriented text.
  3. Understanding of receptive fields in CNN’s.
  4. Theoretical understanding CRNN-CTC network for Text recognition and need for CTC loss.
  5. Usage of Text Extraction in various fields/domains.

Target Audience

Data Scientists, Machine Learning Engineers, Researchers.

Prerequisites for Attendees

Basic understanding of CNN's and LSTM's.

schedule Submitted 7 months ago

Public Feedback

comment Suggest improvements to the Author
  • Natasha Rodrigues
    By Natasha Rodrigues  ~  7 months ago
    reply Reply

    Hi Rajesh,

    Thanks for your proposal! Requesting you to update the Outline/Structure section of your proposal with a time-wise breakup of how you plan to use 20 mins for the topics you've highlighted?

    To help the program committee understand your presentation style, can you provide a link to your past recording or record a small 1-2 mins trailer of your talk and share the link to the same?

    Also, in order to ensure the completeness of your proposal, we suggest you go through the review process requirements.

    Thanks,
    Natasha

    • Rajesh Shreedhar Bhat
      By Rajesh Shreedhar Bhat  ~  7 months ago
      reply Reply

      I have added the time-wise break for the talk.

      I have the PPT which I used for my talk in Kaggle days meetup. Will that suffice or should I still prepare a video?

      • Natasha Rodrigues
        By Natasha Rodrigues  ~  7 months ago
        reply Reply

        Hi Rajesh, 

        Thanks for the update, wrt the video, kindly share a video where you are presenting for the program committee to understand your presentation style. (The video can be 1-2 mins clip also)

        • Rajesh Shreedhar Bhat
          By Rajesh Shreedhar Bhat  ~  6 months ago
          reply Reply

          Hi Natasha,

          When will we get the decision on our proposal?

          • Natasha Rodrigues
            By Natasha Rodrigues  ~  6 months ago
            reply Reply

            Hi Rajesh,

            The program committee is currently in the process of reviewing the proposals. We will keep you informed on the decisions taken.

            Thank you,

            Natasha 

        • Rajesh Shreedhar Bhat
          By Rajesh Shreedhar Bhat  ~  7 months ago
          reply Reply

          Sure Natasha.


  • SOURADIP CHAKRABORTY
    keyboard_arrow_down

    SOURADIP CHAKRABORTY / Rajesh Shreedhar Bhat - Learning Generative models with Visual attentions in the field of Image Captioning

    20 Mins
    Talk
    Intermediate

    Image caption generation is the task of generating a descriptive and appropriate sentence of a given image. For humans, the task looks straightforward with the motive of summarising the image in a single sentence incorporating the interactions between the various components present in the image. But to replicate this phenomenon in an artificial framework is a very challenging task. Attention fixes this problem as it allows the network to look over the relevant features of the encoder as an input to the decoder at each time step. In this session, we show how attention mechanism enhances the performance of language translation tasks in an encoder-decoder framework.

    Before the attention mechanism in sequence to sequence settings, the entire sequence was encoded into a thought/context vector which was used to initialize the decoder to generate the output sequence. But the major shortcoming of this methodology was that no weightage was given to the encoder features in the context of the generated sequence, thereby confounding the network and resulting in the inadequate output sequence.

    Inspired by the outstanding results of using attention mechanisms in machine translation and other seq2seq tasks, there have been few advancements in the field of computer vision using attention techniques. In this session, we incorporate visual attention mechanisms in generating relevant captions from images using a deep learning framework.

  • SOURADIP CHAKRABORTY
    keyboard_arrow_down

    SOURADIP CHAKRABORTY / Sayak Paul - Implicit Data Modelling using Self-Supervised Transfer Learning

    20 Mins
    Talk
    Intermediate

    Transfer learning is specifically very helpful when there is a scarcity of data, limited bandwidth that might not allow training deep models from scratch, and so on. In the world of computer vision, ImageNet pre-training has been widely successful across a number of different tasks, image classification being the most popular one. All of that success has been possible mainly because of the ImageNet dataset which is a collection of images spanning across 1000 labels. This is where a stern limitation comes in - the need for having labeled data. In this session, we want to take a deep dive into the world of self-supervised learning which allows models to exploit the implicit labels of input data. In the first half of the session, we will be covering the basics of transfer learning, its successes, and its challenges. We will then start by formulating the problem that self-supervised learning tries to address. In the second half of the session, we will be discussing the ABCs of self-supervised learning along with some examples. We will conclude by a shortcode walk-through and a discussion on the challenges of self-supervised learning.