Data quality is utmost important in Master Data Management solutions. Data curation and standardisation involves multiple iterations of exchange between customers and its’ vendors. Rules written for validations and corrections, pile up and their maintenance gets costlier with time. Data quality rules can run from 500+ to 20K, many of which get outdated, but cannot be taken out risking any regressions. To address these challenges, we turned to machine learning to enable autocorrection of the human errors and standardisation of the content across products on-boarded.
This talk is about our journey to fix the problem at hand where we started with implementing a simple spell check algorithm using edit distance/Levenshtein distance to more complex language models. We used state of the art approaches such as a char-to-char sequence model with encode decoder, auto encoders, attention based transformers and even BERT. The result from these models were getting better, but not good enough to the quality expected. These experiments with latest techniques helped us build a strong intuition and understanding of language models.
I will also be touching upon the data collection, it’s challenges and our work arounds. The key takeaway will be performance comparisons of the various techniques and approaches from the experiments, (in the context of our use case) something similar that I had once longed to see before starting on this journey. I will also share my experience on intuitions learned and common mistakes to be aware of.
If there is anything that blocks you today from trying new techniques, or keeps you wondering how and where to start from, or anything that I could help you with, please leave a comment and I will work to get answers to, in this talk (if the talk gets accepted, if not pls reach out to me on linkedIn and I will be happy to help.).