Short Abstract

It is a well known fact that the more data we have, the better performance ML models can achieve. However, getting a large amount of training data annotated is a luxury most practitioners cannot afford. Computer vision has circumvented this via data augmentation techniques and has reaped rich benefits. Can NLP not do the same? In this talk we will look at various techniques available for practitioners to augment data for their NLP application and various bells and whistles around these techniques.

 

Long Abstract

In the area of AI, it is a well established fact that data beats algorithms i.e. large amounts of data with a simple algorithm often yields far superior results as compared to the best algorithm with little data. This is especially true for Deep learning algorithms that are known to be data guzzlers. Getting data labeled at scale is a luxury most practitioners cannot afford. What does one do in such a scenario?

 

This is where Data augmentation comes into play. Data augmentation is a set of techniques to increase the size of datasets and introduce more variability in the data. This helps to train better and more robust models. Data augmentation is very popular in the area of computer vision. From simple techniques like rotation, translation, adding salt etc to GANs, we have a whole range of techniques to augment images. It is a well known fact that augmentation is one of the key anchors when it comes to success of computer vision models in industrial applications.

 

Most natural language processing (NLP) projects in industry still suffer from data scarcity. This is where recent advances in data augmentation for NLP can come very helpful. When it comes to NLP, data augmentation is not that straight forward. You want to augment data while keeping the syntactic and semantic properties of the text. In this talk we will take a deep dive into the world of various techniques that are available to practitioners to augment data for NLP. The talk is meant for Data Scientists, NLP engineers, ML engineers and industry leaders working on NLP problems.

 
 

Outline/Structure of the Talk

  • What is data augmentation
  • Why data augmentation is tricky in NLP
  • Recent advances in data augmentation in NLP
    • Deep Dive: various techniques for data augmentation in NLP
    • Pros and cons of various techniques
    • Practical tips
  • Concrete case study of applying these techniques to our work.

Learning Outcome

  • What is data augmentation
  • Why data augmentation is tricky in NLP
  • Recent advances in data augmentation in NLP
    • Deep Dive: various techniques for data augmentation in NLP
    • Pros and cons of various techniques
    • Practical tips
  • Concrete case study of applying these techniques to our work.

Target Audience

The talk is meant for Data Scientists, NLP engineers, ML engineers and industry leaders working on NLP problems.

Prerequisites for Attendees

None

schedule Submitted 8 months ago

Public Feedback

comment Suggest improvements to the Author
  • Deepti Tomar
    By Deepti Tomar  ~  6 months ago
    reply Reply

    Hello Anuj,

    Request your response to help us understand the following -

    How different & detailed will the content on the topic covered in this proposal be from what will be covered in the workshop?

    Thanks,

    Deepti

    • Anuj Gupta
      By Anuj Gupta  ~  6 months ago
      reply Reply

      Hey Deepti

      This talk has nothing to do with workshop. The workshop is for beginners (Hence the title: zero to Hero)

      Synthetic data generation in NLP is a fairly advanced and very new concept that people run into after they have built 8-10 NLP systems in a commercial environment. Until then they don't even have the clue that something like this is even possible. The biggest problem most AI teams in industry face is lack of data. The talk addresses that

      Hope that answers your question. Let me know if not

      Thanks

       

       

  • Natasha Rodrigues
    By Natasha Rodrigues  ~  8 months ago
    reply Reply

    Hi Anuj,

    Thanks for your proposal! Requesting you to update the Outline/Structure section of your proposal with a time-wise breakup of how you plan to use 20 mins for the topics you've highlighted?

    Thanks,
    Natasha

    • Anuj Gupta
      By Anuj Gupta  ~  7 months ago
      reply Reply

      Hi Natasha

      Have added my slides on broad outline. Like we discussed in emails too, this would be a 30-35 min talk and not a 20 min talk

      Request you to update the proposal to reflect the same

       

      • Natasha Rodrigues
        By Natasha Rodrigues  ~  7 months ago
        reply Reply

        Hi Anuj,

        Thank you for the slides.

        As discussed over email, we are only accepting 20 min proposals through our submission system. 

        We can only increase the duration of the proposal once the talk has been reviewed and accepted by the program committee, we have informed the program committee regarding your requests and the proposals are under review.

         

        Thanks,

        Natasha

  • Madalasa Venkataraman
    By Madalasa Venkataraman  ~  8 months ago
    reply Reply

    Thank you for your interest! The topic is highly relevant, but the proposal lacks details on the various techniques that will be discussed. Pl provide tentative slide deck with time break up and list techniques that would be discussed. 

    • Anuj Gupta
      By Anuj Gupta  ~  7 months ago
      reply Reply

      Hi  Madalasa

      I have added the details in the slides. Please do let me know if anything else is needed from my end

      Thanks

  • Dr. Vikas Agrawal
    By Dr. Vikas Agrawal  ~  7 months ago
    reply Reply

    Dear Anuj: Are you referring to techniques like

    1. Synonym Replacement: Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.
    2. Random Insertion: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times.
    3. Random Swap: Randomly choose two words in the sentence and swap their positions. Do this n times.
    4. Random Deletion: Randomly remove each word in the sentence with probability p.
    5. Replace the words from dicitionary of the same label NER
    1. Perturbations (letter, word, or sentence level noisemix
    2. Language model Contextual augmentation
    3. Back translation Machine translation
    4. Round-trip translation, Paraphrasing, Low-resource parallel corpuses, Leverage External Data
    5. Using external data derived from Wikipedia. linking wikipedia articles to arbitrary input text. The idea is that if the input text were on Wikipedia, it would have links to other Wikipedia articles (that are semantically related and provide additional info).
        • break the input text into n-grams
        • check whether each n-gram exists as a wikipedia article to create a set of ‘candidate links’
        • prune the candidate links by computing the similarity of the input text and the abstract of each candidate
    6. Conversational Systems fountain
    7. Reading Comprehension Entity replacement and permutation, Generate strong negatives based on POS tags
    • Anuj Gupta
      By Anuj Gupta  ~  7 months ago
      reply Reply

      Hi Vikas, yes part of talk will cover what you have listed above. 

  • Sujoy Roychowdhury
    By Sujoy Roychowdhury  ~  8 months ago
    reply Reply

    can you please elaborate "various techniques" - would request a list of the techniques you would cover and the time distribution for your talk. And also please highlight how much of it is based on your experience in implementation.

    • Anuj Gupta
      By Anuj Gupta  ~  7 months ago
      reply Reply

      Hi Sujoy

      Sorry for the delay. I have added slide deck with broad outline. I would say 30-35% would come our experience in implementation. Please do let me know if anything else is needed from my end

       

    • Sujoy Roychowdhury
      By Sujoy Roychowdhury  ~  7 months ago
      reply Reply

      Anuj, Please could you reply urgently to this