A Hands-on Introduction to Natural Language Processing
Data is the new oil and unstructured data, especially text, images and
videos contain a wealth of information. However, due to the inherent
complexity in processing and analyzing this data, people often refrain
from spending extra time and effort in venturing out from structured
datasets to analyze these unstructured sources of data, which can be a
potential gold mine. Natural Language Processing (NLP) is all about
leveraging tools, techniques and algorithms to process and understand
natural language-based data, which is usually unstructured like text,
speech and so on. In this workshop, we will be looking at tried and tested
strategies, techniques and workflows which can be leveraged by
practitioners and data scientists to extract useful insights from text data.
Being specialized in domains like computer vision and natural language
processing is no longer a luxury but a necessity which is expected of
any data scientist in today’s fast-paced world! With a hands-on and interactive approach, we will understand essential concepts in NLP along with extensive case-
studies and hands-on examples to master state-of-the-art tools,
techniques and frameworks for actually applying NLP to solve real-
world problems. We leverage Python 3 and the latest and best state-of-
the-art frameworks including NLTK, Gensim, SpaCy, Scikit-Learn,
TextBlob, Keras and TensorFlow to showcase our examples.
In my journey in this field so far, I have struggled with various problems,
faced many challenges, and learned various lessons over time. This
workshop will contain a major chunk of the knowledge I’ve gained in the world
of text analytics and natural language processing, where building a
fancy word cloud from a bunch of text documents is not enough
anymore. Perhaps the biggest problem with regard to learning text
analytics is not a lack of information but too much information, often
called information overload. There are so many resources,
documentation, papers, books, and journals containing so much content
that they often overwhelm someone new to the field. You might have
had questions like ‘What is the right technique to solve a problem?’,
‘How does text summarization really work?’ and ‘Which are the best
frameworks to solve multi-class text categorization?’ among many other
questions! Based on my prior knowledge and learnings from publishing a couple of books in this domain, this workshop should help readers avoid the pressing
issues I’ve faced in my journey so far and learn the strategies to master NLP.
This workshop follows a comprehensive and structured approach. First it
tackles the basics of natural language understanding and Python for
handling text data in the initial chapters. Once you’re familiar with the
basics, we cover text processing, parsing and understanding. Then, we
address interesting problems in text analytics in each of the remaining
chapters, including text classification, clustering and similarity analysis,
text summarization and topic models, semantic analysis and named
entity recognition, sentiment analysis and model interpretation. The last
chapter is an interesting chapter on the recent advancements made in
NLP thanks to deep learning and transfer learning and we cover an
example of text classification with universal sentence embeddings.
Outline/Structure of the Workshop
The following is the rough structure of the workshop
- Introduction to Natural Language Processing
- Text pre-processing and Wrangling
- Removing HTML tags\noise
- Removing accented characters
- Removing special characters\symbols
- Handling contractions
- Stop word removal
- Project: Build a duplicate character removal module
- Project: Build a spell-check and correction module
- Project: Build an end-to-end text pre-processor
- Text Understanding
- POS (Parts of Speech) Tagging
- Text Parsing
- Shallow Parsing
- Dependency Parsing
- Constituency Parsing
- NER (Named Entity Recognition) Tagging
- Project: Build your own POS Tagger
- Project: Build your own NER Tagger
- Text Representation – Feature Engineering
- Traditional Statistical Models – BOW, TF-IDF
- Newer Deep Learning Models for word embeddings – Word2Vec, GloVe, FastText
- Project: Similarity and Movie Recommendations
- Project: Interactive exploration of Word Embeddings
- Case Studies for other common NLP Tasks
- Project: Sentiment Analysis using unsupervised learning and supervised learning (machine and deep learning)
- Project: Text Clustering (grouping similar movies)
- Project: Text Summarization and Topic Models
- Promise of Deep Learning for NLP, Transfer and Generative Learning
- Hands-on with universal sentence embeddings in deep learning
- Learn and understand popular NLP workflows with interactive examples
- Covers concepts and interactive projects on cleaning and handling noisy unstructured text data including duplicate checks, spelling corrections and text wrangling
- Build your own POS and NER taggers and parse text data to understand it better
- Understand, build and explore text semantics and representations with traditional statistical models and newer word embedding models
- Projects on popular NLP tasks including text classification, sentiment analysis, text clustering, summarization, topic models and recommendations
- Recent state-of-the-art cutting edge research implementation on deep transfer learning for NLP
Data Scientists, Engineers, Developers, AI Enthusiasts, Linguistic Experts
Prerequisites for Attendees
Basic knowledge of Python and Machine Learning.
All the examples will be covered in Python
schedule Submitted 3 months ago
People who liked this proposal, also liked:
Viral B. Shah - Growing a compiler - Getting to ML from the general-purpose Julia compilerViral B. ShahCo-inventor of JuliaJulia Computing Inc.
schedule 1 month agoSold Out!
Since we originally proposed the need for a first-class language, compiler and ecosystem for machine learning (ML) - a view that is increasingly shared by many, there have been plenty of interesting developments in the field. Not only have the tradeoffs in existing systems, such as TensorFlow and PyTorch, not been resolved, but they are clearer than ever now that both frameworks contain distinct "static graph" and "eager execution" interfaces. Meanwhile, the idea of ML models fundamentally being differentiable algorithms – often called differentiable programming – has caught on.
Where current frameworks fall short, several exciting new projects have sprung up that dispense with graphs entirely, to bring differentiable programming to the mainstream. Myia, by the Theano team, differentiates and compiles a subset of Python to high-performance GPU code. Swift for TensorFlow extends Swift so that compatible functions can be compiled to TensorFlow graphs. And finally, the Flux ecosystem is extending Julia’s compiler with a number of ML-focused tools, including first-class gradients, just-in-time CUDA kernel compilation, automatic batching and support for new hardware such as TPUs.
This talk will demonstrate how Julia is increasingly becoming a natural language for machine learning, the kind of libraries and applications the Julia community is building, the contributions from India (there are many!), and our plans going forward.
Dipanjan Sarkar - Explainable Artificial Intelligence - Demystifying the HypeDipanjan SarkarData ScientistRed Hat
schedule 3 months agoSold Out!
The field of Artificial Intelligence powered by Machine Learning and Deep Learning has gone through some phenomenal changes over the last decade. Starting off as just a pure academic and research-oriented domain, we have seen widespread industry adoption across diverse domains including retail, technology, healthcare, science and many more. More than often, the standard toolbox of machine learning, statistical or deep learning models remain the same. New models do come into existence like Capsule Networks, but industry adoption of the same usually takes several years. Hence, in the industry, the main focus of data science or machine learning is more ‘applied’ rather than theoretical and effective application of these models on the right data to solve complex real-world problems is of paramount importance.
A machine learning or deep learning model by itself consists of an algorithm which tries to learn latent patterns and relationships from data without hard-coding fixed rules. Hence, explaining how a model works to the business always poses its own set of challenges. There are some domains in the industry especially in the world of finance like insurance or banking where data scientists often end up having to use more traditional machine learning models (linear or tree-based). The reason being that model interpretability is very important for the business to explain each and every decision being taken by the model.However, this often leads to a sacrifice in performance. This is where complex models like ensembles and neural networks typically give us better and more accurate performance (since true relationships are rarely linear in nature).We, however, end up being unable to have proper interpretations for model decisions.
To address and talk about these gaps, I will take a conceptual yet hands-on approach where we will explore some of these challenges in-depth about explainable artificial intelligence (XAI) and human interpretable machine learning and even showcase with some examples using state-of-the-art model interpretation frameworks in Python!
Kuldeep Jiwani - "Sessionisation" of time sequenced events via Stochastic periodsKuldeep JiwaniDirector / Data ScientistThales (Guavus)
schedule 1 month agoSold Out!
In todays world majority of information is generated by self sustaining systems like various kinds of bots, crawlers, servers, various online services, etc. This information is flowing on the axis of time and is generated by these actors under some complex logic. For example, a stream of buy/sell order requests by an Order Gateway in financial world, or a stream of web requests by a monitoring / crawling service in the web world, or may be a hacker's bot sitting on internet and attacking various computers. Although we may not be able to know the motive or intention behind these data sources. But via some unsupervised techniques we can try to infer the pattern or correlate the events based on their multiple occurrences on the axis of time. Thus we could automatically identify signatures of various actors and take appropriate actions.
Sessionisation is one such unsupervised technique that tries to find the signal in a stream of events associated with a timestamp. In the ideal world it would resolve to finding periods with a mixture of sinusoidal waves. But for the real world this is a much complex activity, as even the systematic events generated by machines over the internet behave in a much erratic manner. So the notion of a period for a signal also changes in the real world. We can no longer associate it with a number, it has to be treated as a random variable, with expected values and associated variance. Hence we need to model "Stochastic periods" and learn their probability distributions in an unsupervised manner. This would be done via non-parametric Bayesian techniques with Gaussian prior.
In this talk we will do a walk through of a real security use cases solved via Sessionisation for the SOC (Security Operations Centre) centre of an international firm with offices in 56 countries being monitored via a central SOC team.
In this talk we will go through a Sessionisation technique based on stochastic periods. The journey would begin by extracting relevant data from a sequence of timestamped events. Then we would apply various techniques like FFT (Fast Fourier Transform), kernel density estimation, optimal signal selection, Gaussian Mixture Models, etc. and eventually discover patterns in time stamped events.
Key concepts explained in talk: Sessionisation, Bayesian techniques of Machine Learning, Gaussian Mixture Models, Kernel density estimation, FFT, stochastic periods, probabilistic modelling
Favio Vázquez - Complete Data Science Workflows with Open Source ToolsFavio VázquezSr. Data ScientistRaken Data Group
schedule 1 week agoSold Out!
Cleaning, preparing , transforming, exploring data and modeling it's what we hear all the time about data science, and these steps maybe the most important ones. But that's not the only thing about data science, in this talk you will learn how the combination of Apache Spark, Optimus, the Python ecosystem and Data Operations can form a whole framework for data science that will allow you and your company to go further, and beyond common sense and intuition to solve complex business problems.
Pushker Ravindra - Data Science Best Practices for R and PythonPushker RavindraData Analytics LeadMonsanto/Bayer
schedule 1 week agoSold Out!
How many times did you feel that you were not able to understand someone else’s code or sometimes not even your own? It’s mostly because of bad/no documentation and not following the best practices. Here I will be demonstrating some of the best practices in Data Science, for R and Python, the two most important programming languages in the world for Data Science, which would help in building sustainable data products.
- Integrated Development Environment (RStudio, PyCharm)
- Coding best practices (Google’s R Style Guide and Hadley’s Style Guide, PEP 8)
- Linter (lintR, Pylint)
- Documentation – Code (Roxygen2, reStructuredText), README/Instruction Manual (RMarkdown, Jupyter Notebook)
- Unit testing (testthat, unittest)
- Version control (Git)
These best practices reduce technical debt in long term significantly, foster more collaboration and promote building of more sustainable data products in any organization.
Avishkar Gupta / Dipanjan Sarkar - Leveraging AI to Enhance Developer Productivity & Confidence - AI-based insights for Developers by DevelopersAvishkar GuptaData ScientistRed HatDipanjan SarkarData ScientistRed Hat
schedule 16 hours agoSold Out!
A major approach to the application of AI is leveraging it to create a safer world around us, as well as that of helping people make choices. With the open source revolution having taken the world by a storm and developers relying on various upstream third party dependencies (too many to chose from!:http://www.modulecounts.com/) to develop applications moving petabytes of sensitive data and mission critical code that can lead to disastrous failures, it is required now more than ever to build better developer tooling to help developers make safer, better choices in terms of their dependencies as well as providing them with more insights around the code they are using.
Though we are data scientists, at heart we are also developers building intelligent systems powered by AI. We, the Redhat developer group through our “Dependency Analytics” platform and extension, seeks to do the same. We call this, 'AI-based insights for developers by developers'! In this session we would be going into the details of the deep learning models we have implemented and deployed to solve two major problems:
- Dependency Recommendations: Recommend dependencies to a user for their specific application stack by trying to guess their intent as well as an overview of how we maintain and manage these production AI systems.
- Pro-active Security and Vulnerability Analysis: We would also touch upon how our platform aims to make developer applications safer by way of CVE (Common Vulnerabilities and Exposures) analyses and the experimental deep learning models we have built to proactively identify potential vulnerabilities. This shall be followed by a short architectural overview of the entire platform.
If we have enough time, we intend to showcase some sample code as a part of a tutorial of how we built these deep learning models and do a walkthrough of the same!