AI/ML under the covers of modern Master Data Management
Data quality is utmost important in Master Data Management solutions. Data curation and standardisation involves multiple iterations of exchange between customers and its’ vendors. Rules written for validations and corrections, pile up and their maintenance gets costlier with time. Data quality rules can run from 500+ to 20K, many of which get outdated, but cannot be taken out risking any regressions. To address these challenges, we turned to machine learning to enable autocorrection of the human errors and standardisation of the content across products on-boarded.
This talk is about our journey to fix the problem at hand where we started with implementing a simple spell check algorithm using edit distance/Levenshtein distance to more complex language models. We used state of the art approaches such as a char-to-char sequence model with encode decoder, auto encoders, attention based transformers and even BERT. The result from these models were getting better, but not good enough to the quality expected. These experiments with latest techniques helped us build a strong intuition and understanding of language models.
I will also be touching upon the data collection, it’s challenges and our work arounds. The key takeaway will be performance comparisons of the various techniques and approaches from the experiments, (in the context of our use case) something similar that I had once longed to see before starting on this journey. I will also share my experience on intuitions learned and common mistakes to be aware of.
If there is anything that blocks you today from trying new techniques, or keeps you wondering how and where to start from, or anything that I could help you with, please leave a comment and I will work to get answers to, in this talk (if the talk gets accepted, if not pls reach out to me on linkedIn and I will be happy to help.).
Outline/Structure of the Talk
- Introduction
- Data Quality Challenges .…. ( 2 mins)
- Data Standardisation use case
- Requirements
- Expected solution .….. ( 2 mins) - Dataset
- Data collection
- Synthetic Data Generation
- Data Preprocessing ….. ( 3 mins) - Language Modelling techniques:
- Char2char seq model with enc, dec architecture
Approach details, Results, Intuitions: What worked, What did not work - Transformers with attention, with beam search, with BERT
Approach details, Results, Intuitions:What worked, what did not work …. ( 8 mins)
- Char2char seq model with enc, dec architecture
- Adopted Solution
- Describe the chosen approach
- Results summary ..... ( 2 mins) - Future Work ….. ( 1 mins)
- Question Answers .…. ( 2 mins)
Learning Outcome
This is a case study from an ongoing live project where we leverage ML to improve customer experiences. The key takeways from this talk are the insights in:
1. Applying/customizing state-of-the-art techniques to match the requirement of the use case
2. Performance Comparisons of the techniques on our use case.
3. The language modelling intuitions we learned from these experiments.
As data standardisation is a domain agnostic area of improvement, many of our learnings could be directly applied to varied domains like commerce, retail, manufacturing etc. It is worth your time !
Target Audience
Data Scientists, Machine Learning Practitioners, Datascience Enthusiasts with hands-on on NLP
Prerequisites for Attendees
Basics in NLP and Language Modelling is a prerequisite.
Video
Links
I am a Data Scientist @Persistent Systems Limited. Have 14+ years of industry experience and since 2 years am working especially into core NLP. I help teams leverage Machine Learning to build cutting-edge solutions for our customers.
I am also a Data Science Lead with WomenWhoCode Pune Chapter group. We organise sessions on machine learning/AI to help the community ramp up on the latest skills.
schedule Submitted 3 years ago
People who liked this proposal, also liked:
-
keyboard_arrow_down
Amogh Kamat Tarcar - Privacy Preserving Machine Learning Techniques
20 Mins
Demonstration
Intermediate
Privacy preserving machine learning is an emerging field which is in active research. The most prolific successful machine learning models today are built by aggregating all data together at a central location. While centralised techniques are great , there are plenty of scenarios such as user privacy, legal concerns ,business competitiveness or bandwidth limitations ,wherein data cannot be aggregated together. Federated Learningcan help overcome all these challenges with its decentralised strategy for building machine learning models. Paired with privacy preserving techniques such as encryption and differential privacy, Federated Learning presents a promising new way for advancing machine learning solutions.
In this talk I’ll be bringing the audience upto speed with the progress in Privacy preserving machine learning while discussing platforms for developing models and present a demo on healthcare use cases.