The problem of finding duplicates in an image collection is widespread. Many online businesses rely on image galleries to deliver a good customer experience and consequently, generate more revenue. Hence, the image galleries need to be of the highest quality. Presence of duplicates in such galleries could potentially degrade the customer experience. Additionally, image-based machine learning models could generate misleading results due to the duplicates present in the training/evaluation/test sets.

Therefore, finding and removing duplicates is an important requirement across several use cases. In this talk, we want to present imagededup, a Python package we built to solve the problem of finding exact and near duplicates in an image collection. We will speak about the motivation behind building it, its functionality and also give a demo.


Outline/Structure of the Talk

1. Motivation behind building this for idealo Internet GmbH (5mins-10mins)
- Research state of image deduplication as a problem in Computer Vision
- E-commerce: talking about companies that could potentially benefit from the product
2. Components of an image deduplication system (20mins)
- Query problem (Image search)
- Near and exact duplicates
- Hashing
- Use of convolutional neural networks as feature extractors
- Evaluation of deduplication quality
- Scaling problem
3. Discussing the python package imagededup (10mins)
- How to install/use it
- Design perspective while building the package
- Showing demo for finding exact and near duplicates
- Use of Cython for speed optimization
4. How to contribute (1 min)
5. Summary(1 min)
6. Q&A

Learning Outcome

1. Learn more about image duplication problem and what it has to do with image classification

2. Learn how to use our library imagededup to find exact and near duplicates in an image collection

3. Learning about the basic components of an image deduplication system

Target Audience

Machine Learning Engineers, Data Scientists, Software Engineers

schedule Submitted 1 year ago

Public Feedback