A large fraction of work in NLP work in academia and research groups deals with clean datasets that are much more structured and free of noise. However, when it comes to building real-world NLP applications, one often has to collect data from applications such as chats, user-discussion forums, social-media conversations, etc. Invariably all NLP applications in industrial settings that have to deal with much more noisy and varying data - data with spelling mistakes, typos, acronyms, emojis, embedded metadata, etc. 

There is a high level of disparity between the data SOTA language models were trained on & the data these models are expected to work on in practice. This renders most commercial NLP applications working with noisy data unable to take advantage of SOTA advances in the field of language computation.

Handcrafting rules and heuristics to correct this data on a large scale might not be a scalable option for most industrial applications. Most SOTA models in NLP are not designed keeping in mind noise in the data. They often give a substandard performance on noisy data.

In this talk, we share our approach, experience, and learnings from designing a robust system to clean noise in data, without handcrafting the rules, using Machine Translation, and effectively making downstream NLP tasks easier to perform.

This work is motivated by our business use case where we are building a conversational system over WhatsApp to screen candidates for blue-collar jobs. Our candidate user base often comes from tier-2 and tier-3 cities of India. Their responses to our conversational bot are mostly a code mix of Hindi and English coupled with non-canonical text (ex: typos, non-standard syntactic constructions, spelling variations, phonetic substitutions, foreign language words in a non-native script, grammatically incorrect text, colloquialisms, abbreviations, etc). The raw text our system gets is far from clean well-formatted text and text normalization becomes a necessity to process it any further.

This talk is meant for computational language researchers/NLP practitioners, ML engineers, data scientists, senior leaders of AI/ML/DS groups & linguists working with non-canonical resource-rich, resource-constrained i.e. vernacular  & code-mixed languages.

 
 

Outline/Structure of the Talk

  1. Introduction (1~2 mins)
  2. Charateristics of User-generated content (4 mins)
  3. Impact of noisy data on downstream NLP tasks (2 mins)
  4. Techniques to deal with Noisy data (3 mins)
  5. Concrete case study of applying these techniques to our work (5 mins)
  6. Key Takeaways (1 mins)
  7. Questions from audience (as time permits)

Learning Outcome

  • Characteristics of User-Generated Text Content
  • Impact of noisy data on downstream NLP tasks
  • Techniques to deal with Noisy data
  • Concrete case study of applying these techniques to our work

Target Audience

NLP practitioners, ML engineers, data scientists, senior leaders of AI/ML/DS groups

Prerequisites for Attendees

Basic Knowledge of Machine Learning and Natural Language Processing is required to understand the contents of the talk

schedule Submitted 5 months ago

Public Feedback

comment Suggest improvements to the Author
  • Kuldeep Jiwani
    By Kuldeep Jiwani  ~  4 months ago
    reply Reply

    Hi Piyush,

    You have mentioned that majority of SOTA models use clean data and cannot handle noisy data. But you have devised a mechanism to deal noisy data in NLP application.

    For the review committee can you briefly explain what techniques are you using that is able to handle noisy data better than others.

    • Piyush Makhija
      By Piyush Makhija  ~  4 months ago
      reply Reply

      Hi Kuldeep

      This talks deals exactly with the point of how we can modify our NLP pipelines to handle noisy i.e. non-canonical data and make them perform well in an industry setting.

      We have studied SOTA models & their performance on Noisy data in detail, along with existing work on how to deal with noisy data. Our approach introduces a step within existing NLP pipelines. In this step we preform Normalization of noisy text so that downstream tasks which can perform better.
      E.g. BERT, by default, uses word-piece tokenizer to split a word and find candidates before passing it to the transformer part of its architecture. This tokenizer works decently well with clean data as most of the tokens are found within a limited vocab, thereby limiting the Out-Of-Vocab (OOV) tokens. However, if we provide noisy data as input to BERT, this tokenizer produces a many OOV tokens which in turn causes degradation in BERT's performance. By Normalizing noisy data, we can avoid such OOV tokens and provide significant boost to SOTA model's performance on noisy text.

      We are currently in the process of writing a paper on this very topic which will go further into the details of this problem and our proposed resolution. I will add paper link in references once a print or pre-print is available

      • Kuldeep Jiwani
        By Kuldeep Jiwani  ~  4 months ago
        reply Reply

        Thanks for the detailed explanation

  • Ashay Tamhane
    By Ashay Tamhane  ~  5 months ago
    reply Reply

    Hi Piyush, thanks for the proposal. Could you clarify what metrics do you use to compare your approach with the existing approaches? Also, could you briefly elaborate on your approach in the proposal?

    • Piyush Makhija
      By Piyush Makhija  ~  4 months ago
      reply Reply

      Hi Akshay

      I use Word-Error Rate (WER) which is the most common metric for evaluating machine translation, Normalization & ASR models. Apart from this I also provide a BLEU based evaluation metric to show how our model performance improved along with variations in data and modeling approaches.

      I will update my proposal soon and attach some slides for reference.

  • Natasha Rodrigues
    By Natasha Rodrigues  ~  5 months ago
    reply Reply

    Hi Piyush,

    Thanks for your proposal! Requesting you to update the Outline/Structure section of your proposal with a time-wise breakup of how you plan to use 20 mins for the topics you've highlighted?

    Thanks,
    Natasha

    • Piyush Makhija
      By Piyush Makhija  ~  5 months ago
      reply Reply

      Hi Natasha

      I have updated the outline/structure with the requested details.
      Please let me know if you need any more details

      • Natasha Rodrigues
        By Natasha Rodrigues  ~  5 months ago
        reply Reply

        Thanks Piyush! will let you know if we need more details.


  • Piyush Makhija
    keyboard_arrow_down

    Piyush Makhija / Ankit Kumar - Going Beyond "from huggingface import bert"

    20 Mins
    Talk
    Intermediate

    SHORT ABSTRACT

    Google AI stirred up the language processing domain with the introduction of Transformer architecture and BERT models. Models built using transformer based architecture have outperformed and set new standards for State-of-the-art (SOTA) for NLP tasks like text classification, question-answering, text summarization, etc. BERT is said to improve 10% of google search results, single handedly the largest improvement brought in by any approach Google has tried in recent years. In this talk, we aim to demystify BERT and help industry practitioners help gain a deeper understanding of the same.

    LONG ABSTRACT

    The ImageNet moment for NLP arrived with Bidirectional Encoder Representations from Transformers (BERT). Introduction of BERT created a wave in language research and variations of BERT established new State-of-the-Art (SOTA) metrics for all standard NLP tasks which were majorly held by techniques utilizing pre-trained word-vectors. BERT and its variants demonstrate, with a high degree of validity across the research community, that pre-trained models can SOTA on a range of NLP Tasks.

    Owing to its success in academia, industry practitioners started utilizing open-source BERT based models in their own applications for tasks ranging from NER extraction & text classification to search recommendations & opinion mining. It is common to find applied scientists, ML engineers or even researchers to use BERT based models as a black box for their tasks. In some cases, miraculous better than expected results are found, but in many cases we may not find encouraging results upon direct application of a black-box understanding.

    In this talk, we aim to go under the skin of BERT and help the audience build a better understanding of the internal workings of the same.

  • Ankit Kumar
    keyboard_arrow_down

    Ankit Kumar - Noisy Text Data: Achilles’ Heel of BERT

    Ankit Kumar
    Ankit Kumar
    NLP Researcher
    vahan.co
    schedule 5 months ago
    Sold Out!
    20 Mins
    Talk
    Intermediate

    Pre-trained language models such as BERT have performed very well for various NLP tasks like text classification, question answering etc. Given BERT success, industry practitioners are actively experimenting with fine-tuning BERT to build NLP applications for solving their use cases like search recommendation, sentiment analysis, opinion mining etc. As compared to the benchmark datasets, datasets used to build industrial NLP applications are often much more noisy. While BERT has performed exceedingly well for transferring the learnings from one use case to another, it remains unclear how BERT performs when fine tuned on non-canonical text.

    In this talk, we systematically dig deeper into BERT architecture and show the effect of noisy text data on final outcome. We systematically show that when the text data is noisy (spelling mistakes, typos), there is a significant degradation in the performance of BERT. We further analyze the reasons and shortcomings in the existing BERT pipeline that are responsible for this drop in performance.


    This work is motivated from the business use case where we are building a dialogue system over WhatsApp to screen candidates for blue collar jobs. Our candidate user base often comes from underprivileged backgrounds, hence most of them are unable to complete college graduation. This coupled with fat finger problem over a mobile keypad leads to a lot of typos and spelling mistakes in the responses received by our dialogue system.

    While this work is motivated from our business use case, our findings are applicable across various use cases in industry that deal with non-canonical text - from sentiment classification on twitter data to entity extraction over text collected from discussion forums.