Normalizing User-Generated Text Data
A large fraction of work in NLP work in academia and research groups deals with clean datasets that are much more structured and free of noise. However, when it comes to building real-world NLP applications, one often has to collect data from applications such as chats, user-discussion forums, social-media conversations, etc. Invariably all NLP applications in industrial settings that have to deal with much more noisy and varying data - data with spelling mistakes, typos, acronyms, emojis, embedded metadata, etc.
There is a high level of disparity between the data SOTA language models were trained on & the data these models are expected to work on in practice. This renders most commercial NLP applications working with noisy data unable to take advantage of SOTA advances in the field of language computation.
Handcrafting rules and heuristics to correct this data on a large scale might not be a scalable option for most industrial applications. Most SOTA models in NLP are not designed keeping in mind noise in the data. They often give a substandard performance on noisy data.
In this talk, we share our approach, experience, and learnings from designing a robust system to clean noise in data, without handcrafting the rules, using Machine Translation, and effectively making downstream NLP tasks easier to perform.
This work is motivated by our business use case where we are building a conversational system over WhatsApp to screen candidates for blue-collar jobs. Our candidate user base often comes from tier-2 and tier-3 cities of India. Their responses to our conversational bot are mostly a code mix of Hindi and English coupled with non-canonical text (ex: typos, non-standard syntactic constructions, spelling variations, phonetic substitutions, foreign language words in a non-native script, grammatically incorrect text, colloquialisms, abbreviations, etc). The raw text our system gets is far from clean well-formatted text and text normalization becomes a necessity to process it any further.
This talk is meant for computational language researchers/NLP practitioners, ML engineers, data scientists, senior leaders of AI/ML/DS groups & linguists working with non-canonical resource-rich, resource-constrained i.e. vernacular & code-mixed languages.
Outline/Structure of the Talk
- Introduction (1~2 mins)
- Charateristics of User-generated content (4 mins)
- Impact of noisy data on downstream NLP tasks (2 mins)
- Techniques to deal with Noisy data (3 mins)
- Concrete case study of applying these techniques to our work (5 mins)
- Key Takeaways (1 mins)
- Questions from audience (as time permits)
- Characteristics of User-Generated Text Content
- Impact of noisy data on downstream NLP tasks
- Techniques to deal with Noisy data
- Concrete case study of applying these techniques to our work
NLP practitioners, ML engineers, data scientists, senior leaders of AI/ML/DS groups
Prerequisites for Attendees
Basic Knowledge of Machine Learning and Natural Language Processing is required to understand the contents of the talk
schedule Submitted 1 year ago
People who liked this proposal, also liked:
Piyush Makhija / Ankit Kumar - Going Beyond "from huggingface import bert"Piyush MakhijaMachine Learning EngineerVahan IncAnkit KumarNLP Researchervahan.co
schedule 1 year agoSold Out!
Google AI stirred up the language processing domain with the introduction of Transformer architecture and BERT models. Models built using transformer based architecture have outperformed and set new standards for State-of-the-art (SOTA) for NLP tasks like text classification, question-answering, text summarization, etc. BERT is said to improve 10% of google search results, single handedly the largest improvement brought in by any approach Google has tried in recent years. In this talk, we aim to demystify BERT and help industry practitioners help gain a deeper understanding of the same.
The ImageNet moment for NLP arrived with Bidirectional Encoder Representations from Transformers (BERT). Introduction of BERT created a wave in language research and variations of BERT established new State-of-the-Art (SOTA) metrics for all standard NLP tasks which were majorly held by techniques utilizing pre-trained word-vectors. BERT and its variants demonstrate, with a high degree of validity across the research community, that pre-trained models can SOTA on a range of NLP Tasks.
Owing to its success in academia, industry practitioners started utilizing open-source BERT based models in their own applications for tasks ranging from NER extraction & text classification to search recommendations & opinion mining. It is common to find applied scientists, ML engineers or even researchers to use BERT based models as a black box for their tasks. In some cases, miraculous better than expected results are found, but in many cases we may not find encouraging results upon direct application of a black-box understanding.
In this talk, we aim to go under the skin of BERT and help the audience build a better understanding of the internal workings of the same.
Ankit Kumar - Noisy Text Data: Achilles’ Heel of BERTAnkit KumarNLP Researchervahan.co
schedule 1 year agoSold Out!
Pre-trained language models such as BERT have performed very well for various NLP tasks like text classification, question answering etc. Given BERT success, industry practitioners are actively experimenting with fine-tuning BERT to build NLP applications for solving their use cases like search recommendation, sentiment analysis, opinion mining etc. As compared to the benchmark datasets, datasets used to build industrial NLP applications are often much more noisy. While BERT has performed exceedingly well for transferring the learnings from one use case to another, it remains unclear how BERT performs when fine tuned on non-canonical text.
In this talk, we systematically dig deeper into BERT architecture and show the effect of noisy text data on final outcome. We systematically show that when the text data is noisy (spelling mistakes, typos), there is a significant degradation in the performance of BERT. We further analyze the reasons and shortcomings in the existing BERT pipeline that are responsible for this drop in performance.
This work is motivated from the business use case where we are building a dialogue system over WhatsApp to screen candidates for blue collar jobs. Our candidate user base often comes from underprivileged backgrounds, hence most of them are unable to complete college graduation. This coupled with fat finger problem over a mobile keypad leads to a lot of typos and spelling mistakes in the responses received by our dialogue system.
While this work is motivated from our business use case, our findings are applicable across various use cases in industry that deal with non-canonical text - from sentiment classification on twitter data to entity extraction over text collected from discussion forums.