The problem of finding duplicates in an image collection is widespread. Many online businesses rely on image galleries to deliver a good customer experience and consequently, generate more revenue. Hence, the image galleries need to be of the highest quality. Presence of duplicates in such galleries could potentially degrade the customer experience. Additionally, image-based machine learning models could generate misleading results due to the duplicates present in the training/evaluation/test sets.

Therefore, finding and removing duplicates is an important requirement across several use cases. In this talk, we want to present imagededup, a Python package we built to solve the problem of finding exact and near duplicates in an image collection. We will speak about the motivation behind building it, its functionality and also give a demo.

 
 

Outline/Structure of the Talk

1. Motivation behind building this for idealo Internet GmbH (5mins-10mins)
- Research state of image deduplication as a problem in Computer Vision
- E-commerce: talking about companies that could potentially benefit from the product
2. Components of an image deduplication system (20mins)
- Query problem (Image search)
- Near and exact duplicates
- Hashing
- Use of convolutional neural networks as feature extractors
- Evaluation of deduplication quality
- Scaling problem
3. Discussing the python package imagededup (10mins)
- How to install/use it
- Design perspective while building the package
- Showing demo for finding exact and near duplicates
- Use of Cython for speed optimization
4. How to contribute (1 min)
5. Summary(1 min)
6. Q&A

Learning Outcome

1. Learn more about image duplication problem and what it has to do with image classification

2. Learn how to use our library imagededup to find exact and near duplicates in an image collection

3. Learning about the basic components of an image deduplication system

Target Audience

Machine Learning Engineers, Data Scientists, Software Engineers

schedule Submitted 6 months ago

Public Feedback

comment Suggest improvements to the Author
  • Kuldeep Jiwani
    By Kuldeep Jiwani  ~  5 months ago
    reply Reply

    Hi Dat and Tanuj,

    For the sake of program committee can you also briefly explain about how imagedup works. Also, how is it different than applying the standard CNN pooling and finding similarity over the embedding vectors. What is the quantitative measure to define near similar/duplicate images.

    • Tanuj Jain
      By Tanuj Jain  ~  5 months ago
      reply Reply

      Here are the answers:

      Q1: can you also briefly explain about how imagedup works.

      A1: Not sure about the meaning of the question. It's a python package, there would be a demo of its functionality. We would talk about different components of an image deduplication system and how each of these components are handled by imagededup (along with some optimizations).

      Q2: Also, how is it different than applying the standard CNN pooling and finding similarity over the embedding vectors

      A2: There's no difference. We use a mobilenet pretrained on Imagenet dataset with the output of the final pooling layer giving the embeddings. It's a simple approach, but quite effective (Simplicity does not imply ineffectiveness ;) ). Additionally,  there are 4 hashing methods that can be used. However, CNN/hashing is only the feature extraction component of imagededup. There's also search (optimized) and an evaluation framework along with flexibility to choose thresholds for each of these algorithms.

      Q3: What is the quantitative measure to define near similar/duplicate images.

      A3: There is no quantitative measure. This is something that the users can choose as per their own definition of what they consider as duplicate for their own use case and dataset. The thresholds for each of the methods can help one provide more/less leeway in their definition of duplicates. Eg: Choosing a small 'max_distance_threshold' for hashing methods takes one closer to choosing exact duplicates, while a bigger threshold takes one closer to near duplicates (and false positives of course). For CNN, the semantics are reversed since we work with similarity there instead of distance.

      For more, please feel free to refer to the documentation of the package as well (https://idealo.github.io/imagededup/)

      • Kuldeep Jiwani
        By Kuldeep Jiwani  ~  5 months ago
        reply Reply

        Hi Tanuj,

        Thanks for the elaborative reply, much appreciated. This should be good.

  • Sujoy Roychowdhury
    By Sujoy Roychowdhury  ~  6 months ago
    reply Reply

    I was looking at your package.You have covered in your github link multiple algorithms. Can you please detail which algorithms will you cover and a time breakdown of the steps in your talk. 

    • Tanuj Jain
      By Tanuj Jain  ~  5 months ago
      reply Reply

      Hi,

      The talk description has been updated.

      • Sujoy Roychowdhury
        By Sujoy Roychowdhury  ~  5 months ago
        reply Reply

        as @natasha has mentioned talks are for 20 minutes. 

        • Dat Tran
          By Dat Tran  ~  5 months ago
          reply Reply

          I talked to Naresh before submitting in the CfP. He said we can do a 45min talk as our topic is quite sophisticated.

  • Natasha Rodrigues
    By Natasha Rodrigues  ~  6 months ago
    reply Reply

    Hi Dat/Tanuj,

    Thanks for your proposal! Requesting you to update the Outline/Structure section of your proposal with a time-wise breakup of how you plan to use 20 mins for the topics you've highlighted?

    Also, in order to ensure the completeness of your proposal, we suggest you go through the review process requirements.

    Thanks,

    Natasha

    • Tanuj Jain
      By Tanuj Jain  ~  5 months ago
      reply Reply

      Hi,

      The talk description has been updated (45 mins).