Cast a Net Over your Data Lake
As the variety of data continues to expand, the need for different kinds of analytics is increasing – big data is no longer just about the volume, but also about its increasing diversity. Unfortunately, there is no one-size-fits-all approach to analytics – no magic pill that will get your organization the insight it needs from data. Graph analytics offers a toolset to visualize your diverse data and to build more accurate predictive models by uncovering non-obvious inter-connections among your data sources.
In this talk we will discuss some use cases for graph analytics and walk through a particular scenario to find power-users for a promotion campaign. We will also cover machine learning approaches which can assist you in constructing graphs from diverse data sources.
Outline/structure of the Session
Title: Cast a Net over your Data Lake
Subtitle: Harnessing the power of graph analytics
Keywords: graph analytics, knowledge graph, heterogeneous information network, node classification, link prediction, data integration
Gentle Introduction to Graph Analytics
Graph analytics, also known as network analysis, uncovers the nature, extent and structure of inter-woven connections among entities. These entities can be families of organisms for biologists, interlocking subgroups of humanity for anthropologists, can be people and organizations for sociologists, historians and economists, and many others. The most well-known use of the methodology is in the area of social network analysis, involving connections of people in communities such as LinkedIn or Facebook.
Graph analytics is not a replacement for classic (relational) analytical approaches – there will always be use cases for both. Here we will explore scenarios which are the best fit for graph analytics.
Case Study: Power-users for a promotion campaign
As a new restaurant in the area, you are thinking of offering free vouchers to a selected number of customers as a promotion initiative. As your budget is limited, you would want to find people (let’s call them power-users) who are the most likely to create the “viral” effect (to increase the visibility), and who are most likely to visit your restaurant. We will use the Yelp dataset as a practical example to illustrate how this can be achieved using graph analytics. Please note that estimation of any causal effects and revenue for such a promotion strategy are out of the scope of this talk.
To find the power-users, we will start by building the graph from the relational Yelp data sources. We will see that by including more and more pieces of knowledge about customers and businesses we can achieve better accuracy for our task. In the process, it will become clear that we are actually building a knowledge graph, or a heterogeneous information network, about businesses and customers. Then, we will get acquainted with such concepts as “node classification” and “link prediction” from the area of graph analytics, and will see how they are applied to find our power-users.
We will have a short look at other common applications for graph analytics:
- Data discovery: discover “unknown unknowns” by mining non-obvious patterns
- Cold start: modelling framework which simplifies the use of extra knowledge
- Uncover communities
- Information propagation/spread
- Identify trend setters
Building Knowledge Graphs
In the case of the Yelp dataset, we have 8 data sources which we incorporate into the knowledge graph. As is typical when integrating data, this process may become extremely tedious with the growing number of data sources. To help automate this procedure, we will present a brief overview of interesting approaches based on Machine Learning. For more details you may refer to my previous talk at Yow!Data under the title “Automating Data Integration with Machine Learning”.
- Introduction to graph analytics
- What kind of problems graph analytics can solve
- How to build graphs from relational data sources
- References to tools
Data scientists, analysts, engineers, technical people
schedule Submitted 4 months ago
People who liked this proposal, also liked:
Joyce Wang - Covariate Shift - Challenges and Good PracticeJoyce WangSoftware EngineerData61, CSIRO
schedule 4 months agoSold Out!
A fundamental assumption in supervised machine learning is that both the training and query data are drawn from the same population/distribution. However, in real-life applications this is very often not the case as the query data distribution is unknown and cannot be guaranteed a-priori. Selection bias in collecting training samples will change the distribution of the training data from that of the overall population. This problem is known as covariate shift in the machine learning literature, and using a machine learning algorithm in this situation can result in spurious and often over-confident predictions.
Covariate shift is only detectable when we have access to query data. Visualization of training and query data would be helpful to gain an initial impression. Machine learning models can be used to detect covariate shift. For example, Gaussian Process could model the similarity between each query point from feature space of training data. One-class SVMs could detect outliers of training data. Both strategies detect query points that live in a different domain of the feature space from the training dataset.
We suggest two strategies to mitigate covariate shift: re-weighting training data, and active learning with probabilistic models.
First, re-weighting the training data is the process of matching distribution statistics between the training and query sets in feature space. When the model is trained (and validated) on re-weighted data, it is expected to generalise better to query data. However, significant overlap between training and query datasets is required.
Secondly, there may be a situation where we can acquire the labels of a small portion of the query set, potentially at great expense, to reduce the effects of covariate shift. Probabilistic models are required in this case because they indicate the uncertainty in their prediction. Active learning enables us to optimally select small subsets of query points that aim to maximally shrink the uncertainty in our overall prediction.