Google AI stirred up the language processing domain with the introduction of Transformer architecture and BERT models. Models built using transformer based architecture have outperformed and set new standards for State-of-the-art (SOTA) for NLP tasks like text classification, question-answering, text summarization, etc. BERT is said to improve 10% of google search results, single handedly the largest improvement brought in by any approach Google has tried in recent years. In this talk, we aim to demystify BERT and help industry practitioners help gain a deeper understanding of the same.


The ImageNet moment for NLP arrived with Bidirectional Encoder Representations from Transformers (BERT). Introduction of BERT created a wave in language research and variations of BERT established new State-of-the-Art (SOTA) metrics for all standard NLP tasks which were majorly held by techniques utilizing pre-trained word-vectors. BERT and its variants demonstrate, with a high degree of validity across the research community, that pre-trained models can SOTA on a range of NLP Tasks.

Owing to its success in academia, industry practitioners started utilizing open-source BERT based models in their own applications for tasks ranging from NER extraction & text classification to search recommendations & opinion mining. It is common to find applied scientists, ML engineers or even researchers to use BERT based models as a black box for their tasks. In some cases, miraculous better than expected results are found, but in many cases we may not find encouraging results upon direct application of a black-box understanding.

In this talk, we aim to go under the skin of BERT and help the audience build a better understanding of the internal workings of the same.


Outline/Structure of the Talk

  1. Introduction (1~2 mins)
  2. State of language processing before BERT (2 mins)
  3. Why introduction of BERT was the ImageNet moment for NLP (3 mins)
  4. Discussion on BERT architecture (3 mins)
  5. Promises of BERT(2 mins)
  6. Practical limitations of BERT base models (2 mins)
  7. Solutions to practical problem presented by Variants of BERT (3 mins)
  8. Tips on how we can utilize BERT in practice (3 mins)
  9. Questions from audience

Learning Outcome

  • State of language processing before BERT
  • Why introduction of BERT was the ImageNet moment for NLP
  • Deep Dive into BERT architecture
  • Promises of BERT
  • Practical limitations of BERT base models
  • Solutions to practical problem presented by Variants of BERT
  • Tips on how we can utilize BERT in practice

Target Audience

NLP practitioners, ML engineers, data scientists, senior leaders of AI/ML/DS groups

Prerequisites for Attendees

Basic Knowledge of Machine Learning and Natural Language Processing is required to understand the contents of the talk


schedule Submitted 2 years ago

  • Ankit Kumar

    Ankit Kumar - Noisy Text Data: Achilles’ Heel of BERT

    Ankit Kumar
    Ankit Kumar
    NLP Researcher
    schedule 2 years ago
    Sold Out!
    20 Mins

    Pre-trained language models such as BERT have performed very well for various NLP tasks like text classification, question answering etc. Given BERT success, industry practitioners are actively experimenting with fine-tuning BERT to build NLP applications for solving their use cases like search recommendation, sentiment analysis, opinion mining etc. As compared to the benchmark datasets, datasets used to build industrial NLP applications are often much more noisy. While BERT has performed exceedingly well for transferring the learnings from one use case to another, it remains unclear how BERT performs when fine tuned on non-canonical text.

    In this talk, we systematically dig deeper into BERT architecture and show the effect of noisy text data on final outcome. We systematically show that when the text data is noisy (spelling mistakes, typos), there is a significant degradation in the performance of BERT. We further analyze the reasons and shortcomings in the existing BERT pipeline that are responsible for this drop in performance.

    This work is motivated from the business use case where we are building a dialogue system over WhatsApp to screen candidates for blue collar jobs. Our candidate user base often comes from underprivileged backgrounds, hence most of them are unable to complete college graduation. This coupled with fat finger problem over a mobile keypad leads to a lot of typos and spelling mistakes in the responses received by our dialogue system.

    While this work is motivated from our business use case, our findings are applicable across various use cases in industry that deal with non-canonical text - from sentiment classification on twitter data to entity extraction over text collected from discussion forums.