Transfer learning is specifically very helpful when there is a scarcity of data, limited bandwidth that might not allow training deep models from scratch, and so on. In the world of computer vision, ImageNet pre-training has been widely successful across a number of different tasks, image classification being the most popular one. All of that success has been possible mainly because of the ImageNet dataset which is a collection of images spanning across 1000 labels. This is where a stern limitation comes in - the need for having labeled data. In this session, we want to take a deep dive into the world of self-supervised learning which allows models to exploit the implicit labels of input data. In the first half of the session, we will be covering the basics of transfer learning, its successes, and its challenges. We will then start by formulating the problem that self-supervised learning tries to address. In the second half of the session, we will be discussing the ABCs of self-supervised learning along with some examples. We will conclude by a shortcode walk-through and a discussion on the challenges of self-supervised learning.


Outline/Structure of the Talk


  • The ImageNet pre-training era in Computer Vision
    • Success [5 mins]
      • Case studies
        • Less data
        • Architectural decision making
        • Faster prototyping
      • Novel architectures
        • ResNet
        • MobileNet
        • EfficientNet
    • Challenges [5 mins]
      • What if the knowledge from the pre-trained models are not making much sense to the data on which transfer learning is being applied? Ex: Medical Imaging
      • What if the data distribution on which the pre-training took place differs from the target data?
      • What if there are not any explicit discrete labels?
        • Manual labeling, active learning can be time-consuming and might not scale well with small corps.

So, the central question now becomes how can we leverage the power of inherent patterns of the given data?

  • Introducing self-supervised learning [10 mins]
    • Pre-text tasks
    • Downstream tasks
    • Examples:
      • One from NLP
      • One from Vision
    • Remarkable results:
    • The idea of training models with masked inputs and having the models learn to unmask them (courtesy of LeCun)
    • Shortcode demo
      • Train an image inpainting model and use its knowledge for an image classification task
    • Challenges
      • Pretext invariant representation learning (PIRL)
      • Loss estimation
        • Two approaches: FixMatch, SimCLR

Learning Outcome

1. Concept of Transfer Learning and the primary ImageNet models

2. Architectural Decision Making and Faster Prototyping

3. Formulating Energy-efficient models using Novel architectures like ResNet

5. Finally applying Transfer Learning in self-supervised Learning Tasks

6. Code Walk-through the entire process.

Target Audience

Our Target Audience will be both Industry ML&DL practitioners and academicians research in the field of Computer vision and Semi-supervised Learning. The above topic has very high relevance in both. In industry, a lot of the problems doesn't have many labels and majority of the problems are supervised. In such scenarios, our approach might be very helpful and relevant. Moreover, this is an upcoming area of research and a very hot-topic in the field of Self-supervised learning which would specifically interest the researchers and academicians.

Prerequisites for Attendees

Our session will cover the basics of Transfer Learning explaining the very famous ImageNet models to Transfer Learning using the self-supervised approach. So, anybody with a basic understanding of Probability, Linear Algebra, Machine Learning, and Computer Vision will be able to understand and gain from our session.


schedule Submitted 1 year ago

Public Feedback

    • Rajesh Shreedhar Bhat

      Rajesh Shreedhar Bhat / Pranay Dugar - Text Extraction from Images using deep learning techniques

      20 Mins

      Extracting texts of various sizes, shapes and orientations from images containing multiple objects is an important problem in many contexts, especially, in connection to e-commerce, augmented reality assistance system in a natural scene, content moderation in social media platforms, etc. The text from the image can be a richer and more accurate source of data than human inputs which can be used in several applications like Attribute Extraction, Profanity Checks, etc.

      Typically, Extracting Text is achieved in 2 stages:

      Text detection: this module helps to know the regions in the input image where the text is present.

      Text recognition: given the regions in the image where the text is present, this module gives the raw text out of it.

      In this session, I will be talking about the Character level Text Detection for detecting normal and arbitrary shaped texts. Later will be discussing the CRNN-CTC network & the need for CTC loss to obtain the raw text from the images.


      SOURADIP CHAKRABORTY / Rajesh Shreedhar Bhat - Learning Generative models with Visual attentions in the field of Image Captioning

      20 Mins

      Image caption generation is the task of generating a descriptive and appropriate sentence of a given image. For humans, the task looks straightforward with the motive of summarising the image in a single sentence incorporating the interactions between the various components present in the image. But to replicate this phenomenon in an artificial framework is a very challenging task. Attention fixes this problem as it allows the network to look over the relevant features of the encoder as an input to the decoder at each time step. In this session, we show how attention mechanism enhances the performance of language translation tasks in an encoder-decoder framework.

      Before the attention mechanism in sequence to sequence settings, the entire sequence was encoded into a thought/context vector which was used to initialize the decoder to generate the output sequence. But the major shortcoming of this methodology was that no weightage was given to the encoder features in the context of the generated sequence, thereby confounding the network and resulting in the inadequate output sequence.

      Inspired by the outstanding results of using attention mechanisms in machine translation and other seq2seq tasks, there have been few advancements in the field of computer vision using attention techniques. In this session, we incorporate visual attention mechanisms in generating relevant captions from images using a deep learning framework.