Cast a Net over your Data Lake
As the variety of data continues to expand, the need for different kinds of analytics is increasing – big data is no longer just about the volume, but also about its increasing diversity. Unfortunately, there is no one-size-fits-all approach to analytics – no magic pill that will get your organization the insight it needs from data. Graph analytics offers a toolset to visualize your diverse data and to build more accurate predictive models by uncovering non-obvious inter-connections among your data sources. In this talk we will
discuss some use cases for graph analytics and walk through a particular scenario to find power-users for a promotion campaign. We will also cover machine learning approaches which can assist you in constructing graphs from diverse data sources.
Outline/structure of the Session
Title: Cast a Net over your Data Lake
Subtitle: Harnessing the power of graph analytics
Keywords: graph analytics, knowledge graph, heterogeneous information network, node classification, link prediction, data integration
Gentle Introduction to Graph Analytics
Graph analytics, also known as network analysis, uncovers the nature, extent and structure of inter-woven connections among entities. These entities can be families of organisms for biologists, interlocking subgroups of humanity for anthropologists, can be people and organizations for sociologists, historians and economists, and many others. The most well-known use of the methodology is in the area of social network analysis, involving connections of people in communities such as LinkedIn or Facebook.
Graph analytics is not a replacement for classic (relational) analytical approaches – there will always be use cases for both. Here we will explore scenarios which are the best fit for graph analytics.
Case Study: Power-users for a promotion campaign
As a new restaurant in the area, you are thinking of offering free vouchers to a selected number of customers as a promotion initiative. As your budget is limited, you would want to find people (let’s call them power-users) who are the most likely to create the “viral” effect (to increase the visibility), and who are most likely to visit your restaurant. We will use the Yelp dataset as a practical example to illustrate how this can be achieved using graph analytics. Please note that estimation of any causal effects and revenue for such a promotion strategy are out of the scope of this talk.
To find the power-users, we will start by building the graph from the relational Yelp data sources. We will see that by including more and more pieces of knowledge about customers and businesses we can achieve better accuracy for our task. In the process, it will become clear that we are actually building a knowledge graph, or a heterogeneous information network, about businesses and customers. Then, we will get acquainted with such concepts as “node classification” and “link prediction” from the area of graph analytics, and will see how they are applied to find our power-users.
We will have a short look at other common applications for graph analytics:
- Data discovery: discover “unknown unknowns” by mining non-obvious patterns
- Cold start: modelling framework which simplifies the use of extra knowledge
- Uncover communities
- Information propagation/spread
- Identify trend setters
Building Knowledge Graphs
In the case of the Yelp dataset, we have 8 data sources which we incorporate into the knowledge graph. As is typical when integrating data, this process may become extremely tedious with the growing number of data sources. To help automate this procedure, we will present a brief overview of interesting approaches based on Machine Learning. For more details you may refer to my previous talk at Yow!Data under the title “Automating Data Integration with Machine Learning”.
- Introduction to graph analytics
- What kind of problems graph analytics can solve
- How to build graphs from relational data sources
- References to tools
data scientists, analysts, engineers, technical people