AI/ML under the covers of modern Master Data Management

Data quality is utmost important in Master Data Management solutions. Data curation and standardisation involves multiple iterations of exchange between customers and its’ vendors. Rules written for validations and corrections, pile up and their maintenance gets costlier with time. Data quality rules can run from 500+ to 20K, many of which get outdated, but cannot be taken out risking any regressions. To address these challenges, we turned to machine learning to enable autocorrection of the human errors and standardisation of the content across products on-boarded.

This talk is about our journey to fix the problem at hand where we started with implementing a simple spell check algorithm using edit distance/Levenshtein distance to more complex language models. We used state of the art approaches such as a char-to-char sequence model with encode decoder, auto encoders, attention based transformers and even BERT. The result from these models were getting better, but not good enough to the quality expected. These experiments with latest techniques helped us build a strong intuition and understanding of language models.

I will also be touching upon the data collection, it’s challenges and our work arounds. The key takeaway will be performance comparisons of the various techniques and approaches from the experiments, (in the context of our use case) something similar that I had once longed to see before starting on this journey. I will also share my experience on intuitions learned and common mistakes to be aware of.

If there is anything that blocks you today from trying new techniques, or keeps you wondering how and where to start from, or anything that I could help you with, please leave a comment and I will work to get answers to, in this talk (if the talk gets accepted, if not pls reach out to me on linkedIn and I will be happy to help.).


Outline/Structure of the Talk

  • Introduction
  • Data Quality Challenges .…. ( 2 mins)
  • Data Standardisation use case
    - Requirements
    - Expected solution .….. ( 2 mins)
  • Dataset
    - Data collection
    - Synthetic Data Generation
    - Data Preprocessing ….. ( 3 mins)
  • Language Modelling techniques:
    • Char2char seq model with enc, dec architecture
      Approach details, Results, Intuitions: What worked, What did not work
    • Transformers with attention, with beam search, with BERT
      Approach details, Results, Intuitions:What worked, what did not work …. ( 8 mins)
  • Adopted Solution
    - Describe the chosen approach
    - Results summary ..... ( 2 mins)
  • Future Work ….. ( 1 mins)
  • Question Answers .…. ( 2 mins)

Learning Outcome

This is a case study from an ongoing live project where we leverage ML to improve customer experiences. The key takeways from this talk are the insights in:

1. Applying/customizing state-of-the-art techniques to match the requirement of the use case

2. Performance Comparisons of the techniques on our use case.

3. The language modelling intuitions we learned from these experiments.

As data standardisation is a domain agnostic area of improvement, many of our learnings could be directly applied to varied domains like commerce, retail, manufacturing etc. It is worth your time !

Target Audience

Data Scientists, Machine Learning Practitioners, Datascience Enthusiasts with hands-on on NLP

Prerequisites for Attendees

Basics in NLP and Language Modelling is a prerequisite.

schedule Submitted 3 months ago

Public Feedback

comment Suggest improvements to the Speaker
  • Ravi Balasubramanian
    By Ravi Balasubramanian  ~  1 month ago
    reply Reply

    Thanks Bharati for your proposal. 

    1. Can you please share details on the specific use case or application where the data quality solution is being applied?

    2. Most of the steps you are covering during the talk are little generic? Will it be possible for you to illustrate the steps through the application lens, so it becomes very interesting for the audience to relate to.

    3. If there is sensitivity in sharing the real use-case, please provide an equivalent example for us.


    Thanks a lot,


    • Bharati Patidar
      By Bharati Patidar  ~  1 month ago
      reply Reply

      HI Ravi,

      Apologies for the late reply. Thanks for your questions.


      There isn't any sensitivity in sharing the real use-cases. The biggest challenge in Master Data Management (MDM) product is that it doesn't enforce any schema. The customers bring in their own schema/ data model. Thus, ML features provided by MDM must inherently be generic to be plugged into any data model/schema yet effective to address the challenges that each customer faces.


      Will try to explain this with a hypothetical scenario,

      say Amazon is using MDM solution to store its master records. Amazon wants to onboard new sellers on to it’s website and expects that the third-party sellers fill out an excel to onboard their products into amazon catalogs. The data in the filled out excel from third party sellers is a cause of concern and needs multiple human cycles to review and discussions with the seller before onboarding could be completed. Just in the product description, there could be multiple problems such as non-standard acronyms used ( ‘glove’ could be written as ‘glv’), typos, spelled out numbers instead of digits. There is a need to standardise these descriptions to match the amazon defined conventions. The scale is high, where thousands and millions of products could be imported. Reviewing these standards with manual interactions takes weeks. And as these are amazon specific conventions, amazon has to write rules for and maintain the validation code.

      As MDM providers, we are working to help Amazon and other customers do away with the validation codes. We leverage ML to understand the desired conventions of the said customer and enforce that on the data been imported for that customer.


      More than the varied use-cases we address in MDM, my focus would be to highlight the results from the comparisons of NLP techniques applied and experiments performed. Hope this helps.

  • Ashay Tamhane
    By Ashay Tamhane  ~  3 months ago
    reply Reply

    Thanks Bharati for an interesting proposal. In order to further motivate the problem, may I suggest you to give one actual example from your experience to reinforce this point - "Rules written for validations and corrections, pile up and their maintenance gets costlier with time." 

    • Bharati Patidar
      By Bharati Patidar  ~  3 months ago
      reply Reply

      Thanks Akshay for your suggestion. There are many examples depending on which data is been brought into MDM. I have updated the proposal listing counts of rules that customers today have, which is why they are moving away from hand crafted rules.

      I will state some very simple ones -

      1. The column containing zip code from a given geography cannot be more than 5 digits.

      2. The Cost field should not contain the '$' sign, if found strip it off.

      3. Expected Integer values should not have numbers spelled out.

      4. Abbreviations, possible variations of a word ( say 'gloves' spelled as 'glv').

      Please let me know if this helps. Thanks

      • Kuldeep Jiwani
        By Kuldeep Jiwani  ~  2 months ago
        reply Reply

        Hi Bharati,

        Can you elaborate further on how do you use the ML techniques mentioned in abstract to solve the problems you mentioned here.

        • Bharati Patidar
          By Bharati Patidar  ~  2 months ago
          reply Reply

          Hi Kuldeep,

          The examples I quoted were specific to the comment on the kind of rules been created today, to check the data quality. The ML techniques in the abstract were used to solve the problem of standardizing say a product description using language modelling. This standardization is also implemented using rules today- one of the many rules been written by customers which we are trying to replace using ML. Hope this helps. Thanks!

  • Liked Amogh Kamat Tarcar

    Amogh Kamat Tarcar - Privacy Preserving Machine Learning Techniques

    20 Mins

    Privacy preserving machine learning is an emerging field which is in active research. The most prolific successful machine learning models today are built by aggregating all data together at a central location. While centralised techniques are great , there are plenty of scenarios such as user privacy, legal concerns ,business competitiveness or bandwidth limitations ,wherein data cannot be aggregated together. Federated Learningcan help overcome all these challenges with its decentralised strategy for building machine learning models. Paired with privacy preserving techniques such as encryption and differential privacy, Federated Learning presents a promising new way for advancing machine learning solutions.

    In this talk I’ll be bringing the audience upto speed with the progress in Privacy preserving machine learning while discussing platforms for developing models and present a demo on healthcare use cases.